Web parsing issue using VB

0

I am very new to VB.NET and currently learning how to scrape and parse websites. My problem in a nutshell is - if I use “getElementsByClassName” more than one time in my code, it will only work the first time. Same situation with “getElementsByTagName”. And even when I just parse html code manually it will only work the first time.

Here is an example using “getElementsByClassName”. I have Form1 with Button 1 and ListBox1. I am trying to get news titles from two websites (Google and BBC) and then put them into the ListBox1. You can see I split my code into two parts. I would like to point out that both parts work very well and get the information I need, but only when used individually. When put together like in the example below, the first part (Google) will execute without problems but the second part (BBC) will give me an error on line “Dim AllItemsBBC As Object = SecondBrowser.Document.getElementsByClassName("title-link__title-text")”.

Now what’s more interesting, if I flip the code around and put the BBC part first and Google second, BBC will execute without problems and Google will give me error on line “Dim AllItemsGoogle As Object = FirstBrowser.Document.getElementsByClassName("titletext")”. Basically whichever is first executes without problems, second one fails.

The error message shows “An unhandled exception of type 'System.NotSupportedException' occurred in Microsoft.VisualBasic.dll Additional information: Exception from HRESULT: 0x800A01B6”.

Example1:

Public Class Form1

    Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click

        'START  OF PART 1
        'Creating and navigating the IE browser to Google news page
        Dim FirstBrowser As Object = CreateObject("InternetExplorer.Application")
        FirstBrowser.Visible = True
        FirstBrowser.Navigate("https://news.google.com/news?cf=all&pz=1&ned=us")
        Do
            Application.DoEvents()
        Loop Until FirstBrowser.readyState = 4

        'Getting the titles from Google news page and adding them to ListBox1
        Dim ItemGoogle As Object
        Dim AllItemsGoogle As Object = FirstBrowser.Document.getElementsByClassName("titletext")
        For Each ItemGoogle In AllItemsGoogle
            ListBox1.Items.Add(ItemGoogle.InnerText)
        Next ItemGoogle

        'Closing the browser
        FirstBrowser.Quit()
        'END OF PART1

        'START  OF PART 2
        'Creating and navigating the IE browser to BBC news page
        Dim SecondBrowser As Object = CreateObject("InternetExplorer.Application")
        SecondBrowser.Visible = True
        SecondBrowser.Navigate("http://www.bbc.com/news")
        Do
            Application.DoEvents()
        Loop Until SecondBrowser.readyState = 4

        'Getting the titles from BBC news page and adding them to ListBox1
        Dim ItemBBC As Object
        Dim AllItemsBBC As Object = SecondBrowser.Document.getElementsByClassName("title-link__title-text")
        For Each ItemBBC In AllItemsBBC
            ListBox1.Items.Add(ItemBBC.InnerText)
        Next ItemBBC

        'Closing the browser
        SecondBrowser.Quit()
        'END OF PART 2

    End Sub
End Class

My second example is me parsing same websites by basically just finding the phrases I need. Same situation, Google part works, BBC fails on line “Dim the_html_code_bbc As String = SecondBrowser.Document.Body.InnerHTML”.

Flip it around and BBC works and Google fails on line “Dim the_html_code_google As String = FirstBrowser.Document.Body.InnerHTML”.

The error message shows “An unhandled exception of type 'System.MissingMemberException' occurred in Microsoft.VisualBasic.dll Additional information: Public member 'InnerHTML' on type 'JScriptTypeInfo' not found.”

Example 2

Public Class Form1

    Private Sub Button1_Click(sender As Object, e As EventArgs) Handles Button1.Click

        'START  OF PART 1
        'Creating and navigating the IE browser to Google news page
        Dim FirstBrowser As Object = CreateObject("InternetExplorer.Application")
        FirstBrowser.Visible = True
        FirstBrowser.Navigate("https://news.google.com/news?cf=all&pz=1&ned=us")
        Do
            Application.DoEvents()
        Loop Until FirstBrowser.readyState = 4

        'Getting the titles from Google news page and adding them to ListBox1
        Dim the_html_code_google As String = FirstBrowser.Document.Body.InnerHTML
        Dim start_of_code_google As String
        Dim code_selection_google As String
        Do
            Application.DoEvents()
            start_of_code_google = InStr(the_html_code_google, "titletext")
            If start_of_code_google > 0 Then
                code_selection_google = Mid(the_html_code_google, start_of_code_google + 11, Len(the_html_code_google))
                the_html_code_google = Mid(the_html_code_google, start_of_code_google + 11, Len(the_html_code_google))
                code_selection_google = Mid(code_selection_google, 1, InStr(code_selection_google, Chr(60)) - 1)
                ListBox1.Items.Add(code_selection_google)
            End If
        Loop Until start_of_code_google = 0

        'Closing the browser
        FirstBrowser.Quit()
        'END OF PART1


        'START  OF PART 2
        'Creating and navigating the IE browser to BBC news page
        Dim SecondBrowser As Object = CreateObject("InternetExplorer.Application")
        SecondBrowser.Visible = True
        SecondBrowser.Navigate("http://www.bbc.com/news")
        Do
            Application.DoEvents()
        Loop Until SecondBrowser.readyState = 4

        'Getting the titles from BBC news page and adding them to ListBox1
        Dim the_html_code_bbc As String = SecondBrowser.Document.Body.InnerHTML
        Dim start_of_code_bbc As String
        Dim code_selection_bbc As String
        Do
            Application.DoEvents()
            start_of_code_bbc = InStr(the_html_code_bbc, "title-link__title-text")
            If start_of_code_bbc > 0 Then
                code_selection_bbc = Mid(the_html_code_bbc, start_of_code_bbc + 24, Len(the_html_code_bbc))
                the_html_code_bbc = Mid(the_html_code_bbc, start_of_code_bbc + 24, Len(the_html_code_bbc))
                code_selection_bbc = Mid(code_selection_bbc, 1, InStr(code_selection_bbc, Chr(60)) - 1)
                ListBox1.Items.Add(code_selection_bbc)
            End If
        Loop Until start_of_code_bbc = 0

        'Closing the browser
        SecondBrowser.Quit()
        'END OF PART 2

    End Sub
End Class

Another thing worth mentioning is that if I use one method of parsing for the Google part and a different method for BBC, everything works great.

I must be missing something due to my inexperience with Visual Studio. I am using Express 2013 for Windows Desktop version. If you know what’s causing this issue, I would greatly appreciate your advice.

vb.net
web-scraping
asked on Stack Overflow Mar 22, 2016 by tanbox

0 Answers

Nobody has answered this question yet.


User contributions licensed under CC BY-SA 3.0