I'm very new to Python and pandas and I would like to parse a very big text file (200-500 MB) which has the information displayed as Records, with each record containing data fields as rows/lines rather than columns, and such Records having different number of lines (as they provide different type of data). An example below:
....
Record Type = Record_Name
attribute1 = 3090 (0x01218)
attribute2 = attribute_name3090 (type: type_name1)
timestamp: 21:47:33.360000
attribute4: 731001-1 (0x0b277911)
attribute5: 50000 (0x000000c350)
attribute6: 3934 (0x0000000f5e)
attribute7: 857 (0x0000000359)
Record Type = Record_Name
attribute1 = 3099 (0x01227)
attribute2 = attribute_name3099 (type: type_name2)
timestamp: 21:49:07.359000
attribute4 = 731001-3 (0x0b277911)
attribute8: 14 (0x0000000e)
attribute9: 17 (0x00000011)
attribute10: 43 (0x0000002b)
attribute11: 40 (0x00000028)
Record Type = Record_Name
attribute1 = 3090 (0x01218)
attribute2 = attribute_name3090 (type: type_name1)
timestamp: 21:51:17.142000
attribute4: 942101-2 (0x0b345872)
attribute5: 2490368 (0x00260000)
attribute6: 24 (0x00000018)
attribute7: 25 (0x00000019)
Record Type = Record_Name
attribute1 = 3102 (0x01230)
attribute2 = attribute_name3102 (type: type_name1)
timestamp: 21:52:42.359000
attribute4: 731001-2 (0x0b277911)
attribute12: 0 (0x0000000000)
attribute13: 80 (0x0000000050)
....
Thus, for post-processing, I would like to extract the desired data into a couple of pandas dataframes, one to summarize the records, and the other to filter for a particular type of record.
DATAFRAME 1 SUMMARY TABLE OF RECORDS: create a dataframe indexed by timestamp showing only a couple of attributes of each record, all keeping string format BUT without the hex values within ():
attribute1 attribute2 type attribute4
timestamp
...
21:47:33.360000 3090 attribute_name3090 type_name1 731001-1
21:49:07.359000 3099 attribute_name3099 type_name2 731001-3
21:51:17.142000 3090 attribute_name3090 type_name1 942101-2
21:52:42.359000 3102 attribute_name3102 type_name1 731001-2
....
DATAFRAME 2 FILTERED RECORDS: capture only records associated with attribute1 = 3090 (of type = typename1) and create the following dataframe indexed by timestamp, with the attributes5-7 without the the hex values within () and converted from string to integers:
attribute4 attribute5 attribute6 attribute7
timestamp
...
21:47:33.360000 731001-1 5000 3934 857
21:51:17.142000 942101-2 2490368 24 25
....
I tried opening the file and read lines but it takes a large amount of time. I read about "generators" but couldn't figure out how to ue them to simplify the code. Your suggestions would be greatly appreciated. Thanks in advance.
Gus
This code shows how to parse the input lines into two dictionaries in a simple way. It assumes that the input is very regular. The main action comes just after reading each line where lines are split on blanks, colons are removed (which assumes that there are none in the data) and the equal signs are removed — all to simplify subsequent processing based on the first item in each line.
If you run the code you'll find that it creates two text files, each of which consists of one dictionary per line.
with open ('DF1.txt', 'w') as DF1:
with open ('DF2.txt', 'w') as DF2:
DF1_record = {}
DF2_record = {}
with open('bigText.txt') as bigText:
for inputLine in bigText:
inputLine = [_.replace(':', '') for _ in inputLine.strip().split()]
if '=' in inputLine:
inputLine.remove('=')
kind = inputLine[0]
if kind=='Record':
if DF1_record:
DF1.write(str(DF1_record)+'\n')
if DF2_record:
DF2.write(str(DF2_record)+'\n')
DF1_record={}
DF2_record={}
continue
if kind in ['attribute1', 'attribute2', 'attribute4', 'timestamp']:
if kind=='attribute2':
DF1_record[kind]=inputLine[1]
DF1_record['type']=inputLine[3][:-1]
else:
DF1_record[kind]=inputLine[1]
if kind in ['attribute1', 'attribute4', 'attribute4', 'attribute4', 'attribute7', 'timestamp']:
if DF1_record['attribute1']=='3090':
DF2_record[kind]=inputLine[1]
User contributions licensed under CC BY-SA 3.0