c# - How to use HTMLAgilityPack to extract HTML data -
i learning write web crawler , found great examples me started since new this, have few questions in regards coding method.
the search result example can found here: search result
when @ html source result can see following:
<hr><center><h3>license information *</h3></center><hr> <p> <center> 06/03/2014 </center> <br> <b>name : </b> williams ajaya l <br> <b>address : </b> new york ny <br> <b>profession : </b> athletic trainer <br> <b>license no: </b> 001475 <br> <b>date of licensure : </b> 01/12/07 <br> <b>additional qualification : </b> not applicable in profession <br> <b> <a href="http://www.op.nysed.gov/help.htm#status"> status :</a></b> registered <br> <b>registered through last day of : </b> 08/15 <br>
how can use htmlagilitypack scrap data site?
i trying implement example shown below, not sure make edit working crawl page:
private void btncrawl_click(object sender, eventargs e) { foreach (shdocvw.internetexplorer ie in shellwindows) { filename = path.getfilenamewithoutextension( ie.fullname ).tolower(); if ( filename.equals( "iexplore" ) ) txturl.text = "now crawling: " + ie.locationurl.tostring(); } string url = ie.locationurl.tostring(); string xmlns = "{http://www.w3.org/1999/xhtml}"; crawler cl = new crawler(url); xdocument xdoc = cl.getxdocument(); var res = item in xdoc.descendants(xmlns + "div") item.attribute("class") != null && item.attribute("class").value == "folder-news" && item.element(xmlns + "a") != null //select item; select new { link = item.element(xmlns + "a").attribute("href").value, image = item.element(xmlns + "a").element(xmlns + "img").attribute("src").value, title = item.elements(xmlns + "p").elementat(0).element(xmlns + "a").value, desc = item.elements(xmlns + "p").elementat(1).value }; foreach (var node in res) { messagebox.show(node.tostring()); tb.text = node + "\n"; } //console.readkey(); }
the crawler helper class:
using system; using system.collections.generic; using system.linq; using system.text; using system.threading.tasks; using system.xml.linq; namespace crawlerweb { public class crawler { public string url { get; set; } public crawler() { } public crawler(string url) { this.url = url; } public xdocument getxdocument() { htmlagilitypack.htmlweb doc1 = new htmlagilitypack.htmlweb(); doc1.useragent = "mozilla/4.0 (conpatible; msie 7.0; windows nt 5.1)"; htmlagilitypack.htmldocument doc2 = doc1.load(url); doc2.optionoutputasxml = true; doc2.optionautocloseonend = true; doc2.optiondefaultstreamencoding = system.text.encoding.utf8; xdocument xdoc = xdocument.parse(doc2.documentnode.selectsinglenode("html").outerhtml); return xdoc; } } }
multiline textbox... display following:
williams ajaya l
new york ny
athletic trainer
license no
date of licensure
additional qualification
not applicable in profession
registered through last day of
i second argument added array because next step write sql database...
i able url ie has search result how can code in script?
this little snippet should started:
htmldocument doc = new htmldocument(); webclient client = new webclient(); string html = client.downloadstring("http://www.nysed.gov/coms/op001/opsc2a?profcd=67&plicno=001475&namechk=wil"); doc.loadhtml(html); htmlnodecollection nodes = doc.documentnode.selectnodes("//div");
you use webclient
class download html file , load html htmldocument
object. need use xpath query dom tree , search nodes. in above example "nodes" include div
elements in document.
here's quick reference xpath syntax: http://msdn.microsoft.com/en-us/library/ms256086(v=vs.110).aspx
Post a Comment