C#.net Use HTMLDocument from Console?

4

I'm trying to use System.Windows.Forms.HTMLDocument in a console application. First, is this even possible? If so, how can I load up a page from the web into it? I was trying to use WebBrowser, but it's telling me:

Unhandled Exception: System.Threading.ThreadStateException: ActiveX control '885 6f961-340a-11d0-a96b-00c04fd705a2' cannot be instantiated because the current th read is not in a single-threaded apartment.

There seems to be a severe lack of tutorials on the HTMLDocument object (or Google is just turning up useless results).


Just discovered mshtml.HTMLDocument.createDocumentFromUrl, but that throws me

Unhandled Exception: System.Runtime.InteropServices.COMException (0x80010105): T he server threw an exception. (Exception from HRESULT: 0x80010105 (RPC_E_SERVERF AULT)) at System.RuntimeType.ForwardCallToInvokeMember(String memberName, BindingFla gs flags, Object target, Int32[] aWrapperTypes, MessageData& msgData) at mshtml.HTMLDocumentClass.createDocumentFromUrl(String bstrUrl, String bstr Options) at iget.Program.Main(String[] args)

What the heck? All I want is a list of <a> tags on a page. Why is this so hard?


For those that are curious, here's the solution I came up with, thanks to TrueWill:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Net;
using System.IO;
using HtmlAgilityPack;

namespace iget
{
    class Program
    {
        static void Main(string[] args)
        {
            WebClient wc = new WebClient();
            HtmlDocument doc = new HtmlDocument();
            doc.Load(wc.OpenRead("http://google.com"));
            foreach(HtmlNode a in doc.DocumentNode.SelectNodes("//a[@href]"))
            {
                Console.WriteLine(a.Attributes["href"].Value);
            }
        }
    }
}
c#
.net
console
asked on Stack Overflow Nov 22, 2009 by mpen • edited May 23, 2017 by Community

3 Answers

6

As an alternative, you could use the free Html Agility Pack library. That can parse HTML and will let you query it with LINQ. I used an older version for a project at home and it worked great.

EDIT: You may also want to use the WebClient or WebRequest classes to download the web page. See my blog post on Web scraping in .NET. (Note that I haven't tried this in a console app.)

answered on Stack Overflow Nov 22, 2009 by TrueWill • edited Nov 22, 2009 by TrueWill
3

Add the [STAThread] attribute to your Main method

    [STAThread]
    static void Main(string[] args)
    {
    }

That should fix it.

answered on Stack Overflow Nov 22, 2009 by chris.w.mclean
-1

If it's xhtml load it into an XDocument and parse the anchor tags out, or you could also do it with RegEx, if all you need is the anchor tags.

answered on Stack Overflow Nov 22, 2009 by Wil P

User contributions licensed under CC BY-SA 3.0