Iterate through web pages and download PDFs

1

I have a code for crawling through all PDF files on web page and download them to folder. However now it started to drop an error:

System.NullReferenceException HResult=0x80004003 Message=Object reference not set to an instance of an object. Source=NW Crawler
StackTrace: at NW_Crawler.Program.Main(String[] args) in C:\Users\PC\source\repos\NW Crawler\NW Crawler\Program.cs:line 16

Pointing to ProductListPage in foreach (HtmlNode src in ProductListPage)

Is there any hint on how to fix this issue? I have tried to implement async/await with no success. Maybe I was doing something wrong tho...

Here is the process to be done:

  1. Go to https://www.nordicwater.com/products/waste-water/
  2. List all links in section (related products). They are: <a class="ap-area-link" href="https://www.nordicwater.com/product/mrs-meva-multi-rake-screen/">MRS MEVA multi rake screen</a>
  3. Proceed to each link and search for PDF files. PDF files are in:

                    <div class="dl-items">
    <a href="https://www.nordicwater.com/wp-content/uploads/2016/04/S1126-MRS-brochure-EN.pdf" download="">
    

Here is my full code for testing:

using HtmlAgilityPack;
using System;
using System.Net;


namespace NW_Crawler
{
    class Program
    {
        static void Main(string[] args)
        {

            {
                HtmlDocument htmlDoc = new HtmlWeb().Load("https://www.nordicwater.com/products/waste-water/");
                HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//a[@class='ap-area-link']//a");
                Console.WriteLine("Here are the links:" + ProductListPage);
                foreach (HtmlNode src in ProductListPage)
                {
                    htmlDoc = new HtmlWeb().Load(src.Attributes["href"].Value);

                    // Thread.Sleep(5000); // wait some time

                    HtmlNodeCollection LinkTester = htmlDoc.DocumentNode.SelectNodes("//div[@class='dl-items']//a");
                    if (LinkTester != null)
                    {
                        foreach (var dllink in LinkTester)
                        {
                            string LinkURL = dllink.Attributes["href"].Value;
                            Console.WriteLine(LinkURL);

                            string ExtractFilename = LinkURL.Substring(LinkURL.LastIndexOf("/"));
                            var DLClient = new WebClient();

                            // Thread.Sleep(5000); // wait some time

                            DLClient.DownloadFileAsync(new Uri(LinkURL), @"C:\temp\" + ExtractFilename);
                        }
                    }
                }
            }

        }
    }
}
c#
web-scraping
web-crawler
html-agility-pack
asked on Stack Overflow Dec 27, 2019 by hatman • edited Dec 27, 2019 by hatman

2 Answers

2

Made a couple of changes to cover the errors you might be seeing.

Changes

  1. Use of src.GetAttributeValue("href", string.Empty) instead of src.Attribute["href"].Value;. If the href is not present or null, you will get Object Reference Not Set to an instance of an object
  2. Check if ProductListPage is valid and not null.
  3. ExtractFileName includes a / in the name. You want to use + 1 in the substring method to skip that 'Last / from index of)'.
  4. Move on to the next iteration if the href is null on either of the loops
  5. Changed the Product List query to //a[@class='ap-area-link'] from //a[@class='ap-area-link']//a. You were searching for <a> within the <a> tag which is null. Still, if you want to query it this way, the first IF statement to check if ProductListPage != null will take care of errors.
    HtmlDocument htmlDoc = new HtmlWeb().Load("https://www.nordicwater.com/products/waste-water/");
    HtmlNodeCollection ProductListPage = htmlDoc.DocumentNode.SelectNodes("//a[@class='ap-area-link']");
    if (ProductListPage != null)
        foreach (HtmlNode src in ProductListPage)
        {
            string href = src.GetAttributeValue("href", string.Empty);
            if (string.IsNullOrEmpty(href))
                continue;
            htmlDoc = new HtmlWeb().Load(href);
            HtmlNodeCollection LinkTester = htmlDoc.DocumentNode.SelectNodes("//div[@class='dl-items']//a");
            if (LinkTester != null)
                foreach (var dllink in LinkTester)
                {
                    string LinkURL = dllink.GetAttributeValue("href", string.Empty);
                    if (string.IsNullOrEmpty(LinkURL))
                        continue;
                    string ExtractFilename = LinkURL.Substring(LinkURL.LastIndexOf("/") + 1);
                    new WebClient().DownloadFileAsync(new Uri(LinkURL), @"C:\temp\" + ExtractFilename);
                }
        }

answered on Stack Overflow Dec 27, 2019 by Jawad • edited Dec 27, 2019 by Jawad
1

The Xpath that you used seems to be incorrect. I tried loading the web page in a browser and did a search for the xpath and got no results. I replaced it with //a[@class='ap-area-link'] and was able to find matching elements, screenshot below.

enter image description here

answered on Stack Overflow Dec 27, 2019 by Ram Kumaran

User contributions licensed under CC BY-SA 3.0