I'm trying to use System.Windows.Forms.HTMLDocument
in a console application. First, is this even possible? If so, how can I load up a page from the web into it? I was trying to use WebBrowser
, but it's telling me:
Unhandled Exception: System.Threading.ThreadStateException: ActiveX control '885 6f961-340a-11d0-a96b-00c04fd705a2' cannot be instantiated because the current th read is not in a single-threaded apartment.
There seems to be a severe lack of tutorials on the HTMLDocument
object (or Google is just turning up useless results).
Just discovered mshtml.HTMLDocument.createDocumentFromUrl
, but that throws me
Unhandled Exception: System.Runtime.InteropServices.COMException (0x80010105): T he server threw an exception. (Exception from HRESULT: 0x80010105 (RPC_E_SERVERF AULT)) at System.RuntimeType.ForwardCallToInvokeMember(String memberName, BindingFla gs flags, Object target, Int32[] aWrapperTypes, MessageData& msgData) at mshtml.HTMLDocumentClass.createDocumentFromUrl(String bstrUrl, String bstr Options) at iget.Program.Main(String[] args)
What the heck? All I want is a list of <a>
tags on a page. Why is this so hard?
For those that are curious, here's the solution I came up with, thanks to TrueWill:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Net;
using System.IO;
using HtmlAgilityPack;
namespace iget
{
class Program
{
static void Main(string[] args)
{
WebClient wc = new WebClient();
HtmlDocument doc = new HtmlDocument();
doc.Load(wc.OpenRead("http://google.com"));
foreach(HtmlNode a in doc.DocumentNode.SelectNodes("//a[@href]"))
{
Console.WriteLine(a.Attributes["href"].Value);
}
}
}
}
As an alternative, you could use the free Html Agility Pack library. That can parse HTML and will let you query it with LINQ. I used an older version for a project at home and it worked great.
EDIT: You may also want to use the WebClient or WebRequest classes to download the web page. See my blog post on Web scraping in .NET. (Note that I haven't tried this in a console app.)
Add the [STAThread] attribute to your Main method
[STAThread]
static void Main(string[] args)
{
}
That should fix it.
If it's xhtml load it into an XDocument and parse the anchor tags out, or you could also do it with RegEx, if all you need is the anchor tags.
User contributions licensed under CC BY-SA 3.0