I am working with a dataset that is roughly a 6GB csv file. Obviously loading it straight into pandas throws up memory errors. I tried to process it with a chunksize of 1000000 but then it just says "Process finished with exit code -1073741819 (0xC0000005)" when I try to access any of the information.
I will post the code I have below:
Code:
import pandas as pd
import numpy as np
import os
import dask.dataframe as dd
import psutil
import tqdm
f_path = "..\..\..\Desktop\march_test.csv"
svmem = psutil.virtual_memory()
print (svmem.available)
df_size = os.path.getsize(f_path)
print(df_size)
df_sample = pd.read_csv(f_path, nrows=10 )
df_sample_size = df_sample.memory_usage(index=True).sum
print(df_sample_size)
df = pd.read_csv(f_path, sep=',', nrows=5000000)
print(df.head())
df_chunk = pd.read_csv(f_path, headers=True, chunksize=1000000)
chunk_list = []  # append each chunk df here
for chunk in tqdm(df_chunk):
    # perform data filtering
    chunk_filter = chunk_preprocessing(chunk)
    # Once the data filtering is done, append the chunk to list
    chunk_list.append(chunk_filter)
df_concat = pd.concat(chunk_list)
print(chunk_list)
output:
25124765696
3952642752
<bound method Series.sum of Index                        64
safegraph_place_id           40
location_name                40
street_address               40
city                         40
region                       40
postal_code                  80
brands                       40
naics_code                   80
date_range_start             40
date_range_end               40
raw_visit_counts             80
raw_visitor_counts           80
visits_by_day                40
visits_by_each_hour          40
visitor_home_cbgs            40
visitor_country_of_origin    40
distance_from_home           80
median_dwell                 80
bucketed_dwell_times         40
related_same_day_brand       40
related_same_week_brand      40
device_type                  40
iso_country_code             40
dtype: int64>
Process finished with exit code -1073741819 (0xC0000005)
**Thanks in advance for any help! **
User contributions licensed under CC BY-SA 3.0