What about this dynamic_finder(url) function, my user-defined Page class, and PyQt5.QtWebEngineWidgets.QWebEnginePage prevents me from running it more than once?
Within a program that I'm building, I need to scrape some websites that load in with JavaScript. To handle this, I've made use of PyQt5. It works very well and gets content that I thought was entirely inaccessible for Python's bs4 Library. Below is an excerpt of the code I have written:
from bs4 import BeautifulSoup
from colorama import Fore
# Modules for dynamic JS websites
from PyQt5.QtWebEngineWidgets import QWebEnginePage
from PyQt5.QtWidgets import QApplication
from PyQt5.QtCore import QUrl
import sys
def dynamic_finder(url_path):
class Page(QWebEnginePage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebEnginePage.__init__(self)
self.html = ''
self.loadFinished.connect(self._on_load_finished)
self.load(QUrl(url))
self.app.exec_()
def _on_load_finished(self):
self.toHtml(self.callable)
print(Fore.YELLOW + 'Dynamic Load finished')
def callable(self, html_str):
self.html = html_str
self.app.quit()
page = Page(url_path)
soupy = BeautifulSoup(page.html, 'html.parser')
tag = soupy.a
output = tag.text
return output, tag.attrs['href']
print(dynamic_finder("https://www.fb.com"))
print(dynamic_finder("https://www.apple.com"))
print(dynamic_finder("https://www.a.co"))
print(dynamic_finder("https://www.netflix.com"))
print(dynamic_finder("https://www.google.com"))
However, whenever I try and make a second instance of Page (by calling dynamic_finder more than once) within PyCharm, it throws an error "Process finished with exit code -1073741819 (0xC0000005)."
I found one potential solution in this Question on StackOverflow but even after applying the suggested change in my settings, I still face the same issue. I have also tried changing the variable name (page1, page2, etc.) with each function call to no avail.
In the last five lines of the code, I illustrate the issue by calling dynamic_finder on five different websites. It fails on the second. Please note that these websites (homepages for FAANG) are not my actual target and the true targets require a methodology that handles loading in with JavaScript. The problem persists regardless of the urls.
I'm not sure how StackOverflow handles package requirements (this is only my second time sending a question), but for those who seek to run to code see the requirements.txt and Anaconda Prompt (Windows) commands
beautifulsoup4==4.9.1
colorama==0.4.1
ipython==7.18.1
PyQt5==5.15.2
PyQt5-sip==12.8.1
PyQtWebEngine==5.15.2
requests==2.22.0
virtualenv==20.0.31
widgetsnbextension==3.5.1
pip install virtualenv
virtualenv venv
.\venv\Scripts\activate
pip install -r requirements.txt
ipython
I believe the problem lies within the Page class definition which I adapted from this YouTube video.
Any suggestions are greatly appreciated, even ones that suggest something separate from PyQt5 for websites that load with JavaScript. Still, I would prefer a recommendation that helps me edit the Page class definition such that I can create multiple instances in different calls of dynamic_finder without problems.
User contributions licensed under CC BY-SA 3.0