Iterate through web pages and download PDFs

Question

Iterate through web pages and download PDFs

I have a code for crawling through all PDF files on web page and download them to folder. However now it started to drop an error:

System.NullReferenceException HResult=0x80004003 Message=Object reference not set to an instance of an object. Source=NW Crawler
StackTrace: at NW_Crawler.Program.Main(String[] args) in C:\Users\PC\source\repos\NW Crawler\NW Crawler\Program.cs:line 16

Pointing to ProductListPage in foreach (HtmlNode src in ProductListPage)

Is there any hint on how to fix this issue? I have tried to implement async/await with no success. Maybe I was doing something wrong tho...

Here is the process to be done:

Go to https://www.nordicwater.com/products/waste-water/
List all links in section (related products). They are: <a class="ap-area-link" href="https://www.nordicwater.com/product/mrs-meva-multi-rake-screen/">MRS MEVA multi rake screen</a>

Proceed to each link and search for PDF files. PDF files are in:

                <div class="dl-items">
<a href="https://www.nordicwater.com/wp-content/uploads/2016/04/S1126-MRS-brochure-EN.pdf" download="">

Here is my full code for testing:

using HtmlAgilityPack;
using System;
using System.Net;


namespace NW_Crawler
{
    class Program
    {
        static void Main(string[] args)
        {

            {
                HtmlDocument htmlDoc = new HtmlWeb().Load("https://www.nordicwater.com/products/waste-water/");
                HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//a[@class='ap-area-link']//a");
                Console.WriteLine("Here are the links:" + ProductListPage);
                foreach (HtmlNode src in ProductListPage)
                {
                    htmlDoc = new HtmlWeb().Load(src.Attributes["href"].Value);

                    // Thread.Sleep(5000); // wait some time

                    HtmlNodeCollection LinkTester = htmlDoc.DocumentNode.SelectNodes("//div[@class='dl-items']//a");
                    if (LinkTester != null)
                    {
                        foreach (var dllink in LinkTester)
                        {
                            string LinkURL = dllink.Attributes["href"].Value;
                            Console.WriteLine(LinkURL);

                            string ExtractFilename = LinkURL.Substring(LinkURL.LastIndexOf("/"));
                            var DLClient = new WebClient();

                            // Thread.Sleep(5000); // wait some time

                            DLClient.DownloadFileAsync(new Uri(LinkURL), @"C:\temp\" + ExtractFilename);
                        }
                    }
                }
            }

        }
    }
}

c#

web-scraping

web-crawler

html-agility-pack

asked on Stack Overflow Dec 27, 2019 by

hatman • edited Dec 27, 2019 by

hatman

2 Answers

Made a couple of changes to cover the errors you might be seeing.

Changes

Use of src.GetAttributeValue("href", string.Empty) instead of src.Attribute["href"].Value;. If the href is not present or null, you will get Object Reference Not Set to an instance of an object
Check if ProductListPage is valid and not null.
ExtractFileName includes a / in the name. You want to use + 1 in the substring method to skip that 'Last / from index of)'.
Move on to the next iteration if the href is null on either of the loops
Changed the Product List query to //a[@class='ap-area-link'] from //a[@class='ap-area-link']//a. You were searching for <a> within the <a> tag which is null. Still, if you want to query it this way, the first IF statement to check if ProductListPage != null will take care of errors.

    HtmlDocument htmlDoc = new HtmlWeb().Load("https://www.nordicwater.com/products/waste-water/");
    HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//a[@class='ap-area-link']");
    if (ProductListPage != null)
        foreach (HtmlNode src in ProductListPage)
        {
            string href = src.GetAttributeValue("href", string.Empty);
            if (string.IsNullOrEmpty(href))
                continue;
            htmlDoc = new HtmlWeb().Load(href);
            HtmlNodeCollection LinkTester = htmlDoc.DocumentNode.SelectNodes("//div[@class='dl-items']//a");
            if (LinkTester != null)
                foreach (var dllink in LinkTester)
                {
                    string LinkURL = dllink.GetAttributeValue("href", string.Empty);
                    if (string.IsNullOrEmpty(LinkURL))
                        continue;
                    string ExtractFilename = LinkURL.Substring(LinkURL.LastIndexOf("/") + 1);
                    new WebClient().DownloadFileAsync(new Uri(LinkURL), @"C:\temp\" + ExtractFilename);
                }
        }

answered on Stack Overflow Dec 27, 2019 by

Jawad • edited Dec 27, 2019 by

Jawad

The Xpath that you used seems to be incorrect. I tried loading the web page in a browser and did a search for the xpath and got no results. I replaced it with //a[@class='ap-area-link'] and was able to find matching elements, screenshot below.

answered on Stack Overflow Dec 27, 2019 by

Ram Kumaran

User contributions licensed under CC BY-SA 3.0