Why can my Page(QWebEnginePage) class only support one instance in this scraping application?

0

What about this dynamic_finder(url) function, my user-defined Page class, and PyQt5.QtWebEngineWidgets.QWebEnginePage prevents me from running it more than once?

Within a program that I'm building, I need to scrape some websites that load in with JavaScript. To handle this, I've made use of PyQt5. It works very well and gets content that I thought was entirely inaccessible for Python's bs4 Library. Below is an excerpt of the code I have written:

from bs4 import BeautifulSoup
from colorama import Fore

# Modules for dynamic JS websites

from PyQt5.QtWebEngineWidgets import QWebEnginePage
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
import sys


def dynamic_finder(url_path):
    class Page(QWebEnginePage):
        def __init__(self, url):
            self.app = QApplication(sys.argv)
            QWebEnginePage.__init__(self)
            self.html = ''
            self.loadFinished.connect(self._on_load_finished)
            self.load(QUrl(url))
            self.app.exec_()

        def _on_load_finished(self):
            self.toHtml(self.callable)
            print(Fore.YELLOW + 'Dynamic Load finished')

        def callable(self, html_str):
            self.html = html_str
            self.app.quit()

    page = Page(url_path)
    soupy = BeautifulSoup(page.html, 'html.parser')
    tag = soupy.a
    output = tag.text

    return output, tag.attrs['href']


print(dynamic_finder("https://www.fb.com"))
print(dynamic_finder("https://www.apple.com"))
print(dynamic_finder("https://www.a.co"))
print(dynamic_finder("https://www.netflix.com"))
print(dynamic_finder("https://www.google.com"))

However, whenever I try and make a second instance of Page (by calling dynamic_finder more than once) within PyCharm, it throws an error "Process finished with exit code -1073741819 (0xC0000005)."

I found one potential solution in this Question on StackOverflow but even after applying the suggested change in my settings, I still face the same issue. I have also tried changing the variable name (page1, page2, etc.) with each function call to no avail.

In the last five lines of the code, I illustrate the issue by calling dynamic_finder on five different websites. It fails on the second. Please note that these websites (homepages for FAANG) are not my actual target and the true targets require a methodology that handles loading in with JavaScript. The problem persists regardless of the urls.

I'm not sure how StackOverflow handles package requirements (this is only my second time sending a question), but for those who seek to run to code see the requirements.txt and Anaconda Prompt (Windows) commands

beautifulsoup4==4.9.1
colorama==0.4.1
ipython==7.18.1
PyQt5==5.15.2
PyQt5-sip==12.8.1
PyQtWebEngine==5.15.2
requests==2.22.0
virtualenv==20.0.31
widgetsnbextension==3.5.1
pip install virtualenv
virtualenv venv
.\venv\Scripts\activate
pip install -r requirements.txt
ipython

I believe the problem lies within the Page class definition which I adapted from this YouTube video.

Any suggestions are greatly appreciated, even ones that suggest something separate from PyQt5 for websites that load with JavaScript. Still, I would prefer a recommendation that helps me edit the Page class definition such that I can create multiple instances in different calls of dynamic_finder without problems.

python
class
web-scraping
beautifulsoup
pyqt5
asked on Stack Overflow Dec 4, 2020 by JacobK

0 Answers

Nobody has answered this question yet.


User contributions licensed under CC BY-SA 3.0