Process finished with exit code -1073740791 (0xc0000409) when web scraping large data

0

I had written a script to do some web scraping of webpages. The webpages had javascript on them so I used PyQT5 to render the pages before using BeautifulSoup to scrape the desired content.

However, I have a lot of pages to scrape (more than 10,000) and I was attempting to store the content in a dict which I would later convert to a json file. I have attempted to periodically write the json file periodically because I assumed the dict was getting too large on account of the number of scrapes. Still received the exit code.

On another thread someone made suggestions about updating the video card driver (no idea why that would affect my Python script, but I gave it a shot. No progress.

python
web-scraping
asked on Stack Overflow Oct 27, 2018 by boymeetscode • edited Oct 27, 2018 by eyllanesc

1 Answer

0

The issue (at least in this case) was that the dictionaries were getting too large. The way I solved the issue was every 1000 scrapes, dump the date to json format on the hard drive by appending an iterator to the file name, clear the dict, increment the iterator, and keep on scraping.

... while/for loop iterating over all web pages
    data_table = soup.find('table', attrs={'class', 'dataTable'})
    ... process data into dict d
    data[id] = d
    if id % 1000 == 0:
        with open(r'datafile-{num}.json'.format(num=id//1000)) as file:
            json.dump(data, file)
        data.clear()
    id += 1  # increment the key for dict data and counter for file separation

It isn't ideal since now I have many files but at least I have the data that I'm wanting. In case anyone else is getting exit code -1073740791 (0xc0000409) on Windows, if you are dumping lots of data into dictionaries, this could very well be the reason.

answered on Stack Overflow Oct 27, 2018 by boymeetscode

User contributions licensed under CC BY-SA 3.0