I am very new to VB.NET and currently learning how to scrape and parse websites. My problem in a nutshell is - if I use “getElementsByClassName” more than one time in my code, it will only work the first time. Same situation with “getElementsByTagName”. And even when I just parse html code manually it will only work the first time.
Here is an example using “getElementsByClassName”. I have Form1 with Button 1 and ListBox1. I am trying to get news titles from two websites (Google and BBC) and then put them into the ListBox1. You can see I split my code into two parts. I would like to point out that both parts work very well and get the information I need, but only when used individually. When put together like in the example below, the first part (Google) will execute without problems but the second part (BBC) will give me an error on line “Dim AllItemsBBC As Object = SecondBrowser.Document.getElementsByClassName("title-link__title-text")”.
Now what’s more interesting, if I flip the code around and put the BBC part first and Google second, BBC will execute without problems and Google will give me error on line “Dim AllItemsGoogle As Object = FirstBrowser.Document.getElementsByClassName("titletext")”. Basically whichever is first executes without problems, second one fails.
The error message shows “An unhandled exception of type 'System.NotSupportedException' occurred in Microsoft.VisualBasic.dll Additional information: Exception from HRESULT: 0x800A01B6”.
Example1:
Public Class Form1
Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
'START OF PART 1
'Creating and navigating the IE browser to Google news page
Dim FirstBrowser As Object = CreateObject("InternetExplorer.Application")
FirstBrowser.Visible = True
FirstBrowser.Navigate("https://news.google.com/news?cf=all&pz=1&ned=us")
Do
Application.DoEvents()
Loop Until FirstBrowser.readyState = 4
'Getting the titles from Google news page and adding them to ListBox1
Dim ItemGoogle As Object
Dim AllItemsGoogle As Object = FirstBrowser.Document.getElementsByClassName("titletext")
For Each ItemGoogle In AllItemsGoogle
ListBox1.Items.Add(ItemGoogle.InnerText)
Next ItemGoogle
'Closing the browser
FirstBrowser.Quit()
'END OF PART1
'START OF PART 2
'Creating and navigating the IE browser to BBC news page
Dim SecondBrowser As Object = CreateObject("InternetExplorer.Application")
SecondBrowser.Visible = True
SecondBrowser.Navigate("http://www.bbc.com/news")
Do
Application.DoEvents()
Loop Until SecondBrowser.readyState = 4
'Getting the titles from BBC news page and adding them to ListBox1
Dim ItemBBC As Object
Dim AllItemsBBC As Object = SecondBrowser.Document.getElementsByClassName("title-link__title-text")
For Each ItemBBC In AllItemsBBC
ListBox1.Items.Add(ItemBBC.InnerText)
Next ItemBBC
'Closing the browser
SecondBrowser.Quit()
'END OF PART 2
End Sub
End Class
My second example is me parsing same websites by basically just finding the phrases I need. Same situation, Google part works, BBC fails on line “Dim the_html_code_bbc As String = SecondBrowser.Document.Body.InnerHTML”.
Flip it around and BBC works and Google fails on line “Dim the_html_code_google As String = FirstBrowser.Document.Body.InnerHTML”.
The error message shows “An unhandled exception of type 'System.MissingMemberException' occurred in Microsoft.VisualBasic.dll Additional information: Public member 'InnerHTML' on type 'JScriptTypeInfo' not found.”
Example 2
Public Class Form1
Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click
'START OF PART 1
'Creating and navigating the IE browser to Google news page
Dim FirstBrowser As Object = CreateObject("InternetExplorer.Application")
FirstBrowser.Visible = True
FirstBrowser.Navigate("https://news.google.com/news?cf=all&pz=1&ned=us")
Do
Application.DoEvents()
Loop Until FirstBrowser.readyState = 4
'Getting the titles from Google news page and adding them to ListBox1
Dim the_html_code_google As String = FirstBrowser.Document.Body.InnerHTML
Dim start_of_code_google As String
Dim code_selection_google As String
Do
Application.DoEvents()
start_of_code_google = InStr(the_html_code_google, "titletext")
If start_of_code_google > 0 Then
code_selection_google = Mid(the_html_code_google, start_of_code_google + 11, Len(the_html_code_google))
the_html_code_google = Mid(the_html_code_google, start_of_code_google + 11, Len(the_html_code_google))
code_selection_google = Mid(code_selection_google, 1, InStr(code_selection_google, Chr(60)) - 1)
ListBox1.Items.Add(code_selection_google)
End If
Loop Until start_of_code_google = 0
'Closing the browser
FirstBrowser.Quit()
'END OF PART1
'START OF PART 2
'Creating and navigating the IE browser to BBC news page
Dim SecondBrowser As Object = CreateObject("InternetExplorer.Application")
SecondBrowser.Visible = True
SecondBrowser.Navigate("http://www.bbc.com/news")
Do
Application.DoEvents()
Loop Until SecondBrowser.readyState = 4
'Getting the titles from BBC news page and adding them to ListBox1
Dim the_html_code_bbc As String = SecondBrowser.Document.Body.InnerHTML
Dim start_of_code_bbc As String
Dim code_selection_bbc As String
Do
Application.DoEvents()
start_of_code_bbc = InStr(the_html_code_bbc, "title-link__title-text")
If start_of_code_bbc > 0 Then
code_selection_bbc = Mid(the_html_code_bbc, start_of_code_bbc + 24, Len(the_html_code_bbc))
the_html_code_bbc = Mid(the_html_code_bbc, start_of_code_bbc + 24, Len(the_html_code_bbc))
code_selection_bbc = Mid(code_selection_bbc, 1, InStr(code_selection_bbc, Chr(60)) - 1)
ListBox1.Items.Add(code_selection_bbc)
End If
Loop Until start_of_code_bbc = 0
'Closing the browser
SecondBrowser.Quit()
'END OF PART 2
End Sub
End Class
Another thing worth mentioning is that if I use one method of parsing for the Google part and a different method for BBC, everything works great.
I must be missing something due to my inexperience with Visual Studio. I am using Express 2013 for Windows Desktop version. If you know what’s causing this issue, I would greatly appreciate your advice.
User contributions licensed under CC BY-SA 3.0