AppleScript parsing html from site -
what i'm trying names of tv shows on wikipedia page.
ok, did first:
property showsweblist : {} tell application "safari" set loaddelay 2 -- in seconds; test system make new document @ end of every document set url of document 1 "http://en.wikipedia.org/wiki/list_of_television_programs_by_name" delay loaddelay set nrofuls javascript "document.getelementbyid('mw-content-text').queryselectorall('ul').length;" in document 1 set nrofuls nrofuls - 1 number log nrofuls repeat ws 1 nrofuls delay loaddelay set nroflis javascript "document.getelementbyid('mw-content-text').getelementsbytagname('ul')[" & ws & "].queryselectorall('li').length;" in document 1 set nroflis nroflis - 1 number log nroflis repeat rs 0 nroflis delay 0.3 set ashow javascript "document.getelementbyid('mw-content-text').getelementsbytagname('ul')[" & ws & "].getelementsbytagname('li')[" & rs & "].getelementsbytagname('i')[0].getelementsbytagname('a')[0].innerhtml;" in document 1 if ashow not "" or "missing value" copy ashow end of showsweblist end if end repeat end repeat end tell
and works how want to. problem takes 15 minutes until it's done , gotta have safari document in front whole time. thought pick whole code , parse it. not easy. how code looks now:
tell application "safari" make new document @ end of every document set url of document 1 "http://en.wikipedia.org/wiki/list_of_television_programs_by_name" delay 4 set orghtml javascript "document.getelementbyid('mw-content-text').innerhtml;" in document 1 set orghtml orghtml text set readytext extractbetween(orghtml, "<li><i><a ", "</a></i></li>") log (item 0 of readytext) set removearray extractbetween(readytext, "href", ">") set completearray {} repeat rt 0 (count readytext) repeat ra 0 (count removearray) if (item ra of removearray) in (item rt of readytext) set completename trim_line((item rt of readytext), (item ra of removearray), 1) set end of completearray completename end if end repeat end repeat log completearray end tell on extractbetween(searchtext, starttext, endtext) set tid applescript's text item delimiters -- save them later. set applescript's text item delimiters starttext -- find first one. set liste text items of searchtext set applescript's text item delimiters endtext -- find end one. set extracts {} repeat subtext in liste if subtext contains endtext copy text item 1 of subtext end of extracts end if end repeat set applescript's text item delimiters tid -- original values. return extracts end extractbetween on trim_line(this_text, trim_chars, trim_indicator) -- 0 = beginning, 1 = end, 2 = both set x length of trim_chars -- trim beginning if trim_indicator in {0, 2} repeat while this_text begins trim_chars try set this_text characters (x + 1) thru -1 of this_text string on error -- text contains nothing trim characters return "" end try end repeat end if -- trim ending if trim_indicator in {1, 2} repeat while this_text ends trim_chars try set this_text characters 1 thru -(x + 1) of this_text string on error -- text contains nothing trim characters return "" end try end repeat end if return this_text end trim_line
not smooth , not working. somehow seems can't items out of list, because doesn't see list item. can me out?
cheers
i recommend different approach. dl source, , grab title between tags. whole script takes under 2 seconds. start with:
property baseurl : "http://en.wikipedia.org/wiki/list_of_television_programs_by_name" set rawhtml shell script "curl '" & baseurl & "'" set pretag "\" title=\"" -- " title=" set otid applescript's text item delimiters set applescript's text item delimiters pretag set rawlist text items of rawhtml set namelist {} repeat eachline in rawlist set theoff offset of ">" in eachline set thisname text 1 thru (theoff - 2) of eachline -- add error checking here skip opening non-title hits, , fine-tune precise title string set namelist namelist & return & thisname end repeat set applescript's text item delimiters otid return namelist
add little error checking, , tweak pretag , posttag fits best.
Comments
Post a Comment