Python-Pandas Parser for Text Records arranged in non-uniform number of Rows-Lines

Question

Python-Pandas Parser for Text Records arranged in non-uniform number of Rows-Lines

I'm very new to Python and pandas and I would like to parse a very big text file (200-500 MB) which has the information displayed as Records, with each record containing data fields as rows/lines rather than columns, and such Records having different number of lines (as they provide different type of data). An example below:

....
Record Type  = Record_Name
  attribute1 = 3090 (0x01218)
  attribute2     = attribute_name3090 (type: type_name1)
  timestamp: 21:47:33.360000
  attribute4: 731001-1 (0x0b277911)
  attribute5: 50000 (0x000000c350)
  attribute6: 3934 (0x0000000f5e)
  attribute7: 857 (0x0000000359)
Record Type  = Record_Name  
  attribute1 = 3099 (0x01227)
  attribute2     = attribute_name3099 (type: type_name2)
  timestamp: 21:49:07.359000
  attribute4     = 731001-3 (0x0b277911)
  attribute8: 14 (0x0000000e)
  attribute9: 17 (0x00000011)
  attribute10: 43 (0x0000002b)
  attribute11: 40 (0x00000028)
Record Type  = Record_Name
  attribute1 = 3090 (0x01218)
  attribute2     = attribute_name3090 (type: type_name1)
  timestamp: 21:51:17.142000
  attribute4: 942101-2 (0x0b345872)
  attribute5: 2490368 (0x00260000)
  attribute6: 24 (0x00000018)
  attribute7: 25 (0x00000019)
Record Type  = Record_Name
  attribute1 = 3102 (0x01230)
  attribute2     = attribute_name3102 (type: type_name1)
  timestamp: 21:52:42.359000
  attribute4: 731001-2 (0x0b277911)
  attribute12: 0 (0x0000000000)
  attribute13: 80 (0x0000000050)
....

Thus, for post-processing, I would like to extract the desired data into a couple of pandas dataframes, one to summarize the records, and the other to filter for a particular type of record.

DATAFRAME 1 SUMMARY TABLE OF RECORDS: create a dataframe indexed by timestamp showing only a couple of attributes of each record, all keeping string format BUT without the hex values within ():

                  attribute1  attribute2          type         attribute4
timestamp
...
21:47:33.360000   3090        attribute_name3090  type_name1   731001-1
21:49:07.359000   3099        attribute_name3099  type_name2   731001-3
21:51:17.142000   3090        attribute_name3090  type_name1   942101-2
21:52:42.359000   3102        attribute_name3102  type_name1   731001-2
....

DATAFRAME 2 FILTERED RECORDS: capture only records associated with attribute1 = 3090 (of type = typename1) and create the following dataframe indexed by timestamp, with the attributes5-7 without the the hex values within () and converted from string to integers:

                  attribute4   attribute5   attribute6   attribute7   
timestamp
...
21:47:33.360000   731001-1     5000         3934         857
21:51:17.142000   942101-2     2490368      24           25
....

I tried opening the file and read lines but it takes a large amount of time. I read about "generators" but couldn't figure out how to ue them to simplify the code. Your suggestions would be greatly appreciated. Thanks in advance.

Gus

python

parsing

pandas

text

asked on Stack Overflow Dec 16, 2016 by

Gus Whu

1 Answer

This code shows how to parse the input lines into two dictionaries in a simple way. It assumes that the input is very regular. The main action comes just after reading each line where lines are split on blanks, colons are removed (which assumes that there are none in the data) and the equal signs are removed — all to simplify subsequent processing based on the first item in each line.

If you run the code you'll find that it creates two text files, each of which consists of one dictionary per line.

with open ('DF1.txt', 'w') as DF1:
    with open ('DF2.txt', 'w') as DF2:

        DF1_record = {}
        DF2_record = {}
        with open('bigText.txt') as bigText:
            for inputLine in bigText:
                inputLine = [_.replace(':', '') for _ in inputLine.strip().split()]
                if '=' in inputLine:
                    inputLine.remove('=')
                kind = inputLine[0]
                if kind=='Record':
                    if DF1_record:
                        DF1.write(str(DF1_record)+'\n')
                    if DF2_record:
                        DF2.write(str(DF2_record)+'\n')
                        DF1_record={}
                        DF2_record={}
                    continue
                if kind in ['attribute1', 'attribute2', 'attribute4', 'timestamp']:
                    if kind=='attribute2':
                        DF1_record[kind]=inputLine[1]
                        DF1_record['type']=inputLine[3][:-1]
                    else:
                        DF1_record[kind]=inputLine[1]
                if kind in ['attribute1', 'attribute4', 'attribute4', 'attribute4', 'attribute7', 'timestamp']:
                    if DF1_record['attribute1']=='3090':
                        DF2_record[kind]=inputLine[1]

answered on Stack Overflow Dec 17, 2016 by

Bill Bell

User contributions licensed under CC BY-SA 3.0