Python 3.x Beautifulsoup crawling image url -
i'm trying image url crawling python
as result of confirming google image search window development tool, there 100 image urls
more urls appears scrolling down. however, okay.
the problem 20 urls got.
i opened addressable request in html file.
i confirmed there 20 urls there.
i think there 20 image urls in request, 20 output.
how image urls?
this source code.
#-*- coding: utf-8 -*- import urllib.request bs4 import beautifulsoup if __name__ == "__main__": print("crawling!!!!!!!!!!!!!!!") hdr = {'user-agent': 'mozilla/5.0 (windows nt 6.1; wow64; rv:52.0)', 'referer' : 'http:google.com', 'accept': 'text/html', 'accept':'application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'accept': 'none', 'connection': 'keep-alive'} inputsearch = "sites:pinterest+white+jeans" req = urllib.request.request("https://www.google.co.kr/searchhl=ko&site=imghp&tbm=isch&source=hp&biw=1600&bih=770&q=" + inputsearch, headers = hdr) data = urllib.request.urlopen(req).read() bs = beautifulsoup(data, "html.parser") img in bs.find_all('img'): print(img.get('src'))
your link wrong. can use following code , see if fits needs.
you have pass searchterm
, program open google page , fetch urls of 20 images.
code:
def get_images_links(searchterm): import requests bs4 import beautifulsoup searchurl = "https://www.google.com/search?q={}&site=webhp&tbm=isch".format(searchterm) d = requests.get(searchurl).text soup = beautifulsoup(d, 'html.parser') img_tags = soup.find_all('img') imgs_urls = [] img in img_tags: if img['src'].startswith("http"): imgs_urls.append(img['src']) return(imgs_urls)
usage:
get_images_links('computer')
output:
['https://encrypted-tbn0.gstatic.com/images?q=tbn:and9gcseq5kkisog6zsm2bsrwenyhpzepmoyiilzqf6qfwkzsvuoz5rhoya75dm', 'https://encrypted-tbn1.gstatic.com/images?q=tbn:and9gctbuesihyt4cgasiudruqvvmzubfcug_iv92nxjzpmtpe5v2g626bge0g0', 'https://encrypted-tbn1.gstatic.com/images?q=tbn:and9gcryz8c6luaiyuasxkmroh8dc56afemy63m8fw8-zdutb5edpw1hl0y3xg', 'https://encrypted-tbn0.gstatic.com/images?q=tbn:and9gct33qnycx0ghqhfqs7masrk9uvp6d66vld2djhffql4p6phzcjlxksx0wnt', 'https://encrypted-tbn3.gstatic.com/images?q=tbn:and9gcruf11clrzh2wnfiuj3weaom7veme0_glfwoocs3r5gtqdfcfhmgsnqlyo', 'https://encrypted-tbn1.gstatic.com/images?q=tbn:and9gctxctcv4nptbovorbd4i-ujbyjy4kjar5jamvuxcg33clduqop8iufknw', 'https://encrypted-tbn0.gstatic.com/images?q=tbn:and9gctu8mkwwhdgcobqn_h2n3ss7dpvwu3i-ki1sa_4u5yoet-rafok1kb2jpho', 'https://encrypted-tbn1.gstatic.com/images?q=tbn:and9gcqlgu_y_dhu60unyilmiusuojx5_unmcwr2axgj0w6bmvcxuzisscrtpcw', 'https://encrypted-tbn0.gstatic.com/images?q=tbn:and9gcqn7itgvbhd1h9embc0zfdmznu5nt2l-ek1ckmqe4grntylalyttjqxaly', 'https://encrypted-tbn3.gstatic.com/images?q=tbn:and9gcqyfgwd4wr20oimzk9uc0gggi2-7myqau6mjn2gefkpgltaququm4kl0tuqwq', 'https://encrypted-tbn2.gstatic.com/images?q=tbn:and9gcqr0lfraugiadoo5_qolg9zwegxw0otghzbf1yzoihpqkaiy1f3ynx4jne', 'https://encrypted-tbn2.gstatic.com/images?q=tbn:and9gcruok4npppaudjnzl1pewgwlfq25gjvzfsshmoub0qav925kxhg43wjfwc6', 'https://encrypted-tbn3.gstatic.com/images?q=tbn:and9gcr5aqlfb9safbalzp4z2qtolewqeujqas3ewnhi6fahrcxycpmsjhmivkf8', 'https://encrypted-tbn3.gstatic.com/images?q=tbn:and9gcr6deli7h9dcaxjxjyp7lmoixad5rgo1gblfvq35lewrvpgpoyqj8ccz-4', 'https://encrypted-tbn2.gstatic.com/images?q=tbn:and9gcspqafl2wb-awzilan6nazvzh2xvdu_xjecjqsgodnojdffo7gowhrfd3wu', 'https://encrypted-tbn1.gstatic.com/images?q=tbn:and9gcsb3o5cp8dmk9gqt9wpb1n7q6jtreuwitghlxo65ud5s3xcolj80qudlzw', 'https://encrypted-tbn3.gstatic.com/images?q=tbn:and9gcq18lwmvzzcizvki36buupnbiaa5e4a3tuavdxas6hhj-rod446dmrpph2v', 'https://encrypted-tbn0.gstatic.com/images?q=tbn:and9gcr8xzhvomxcafqehhetm1_zxoufbvwmedabosqx-fiu5xu3u6uwao3xw-m', 'https://encrypted-tbn1.gstatic.com/images?q=tbn:and9gcqiwudrcl9y0xbtc19abcpfswo4n060ipv4znqxnplywx5ufo-qdzjatd0r', 'https://encrypted-tbn2.gstatic.com/images?q=tbn:and9gcqtgqdxef3aosiyuk0j0mbxgzt8c0jsaw3upoumstmfsgxde3betrgsqw']
edit:
if want more 20 urls, must find way send ajax request rest of page, or can use selenium simulate interaction between , webpage.
i've used second approach (probably there's tons of other ways this, if want, can optimize code lot):
code2:
def scrape_all_imgs_google(searchterm): selenium import webdriver bs4 import beautifulsoup time import sleep def scroll_page(): in range(7): driver.execute_script("window.scrollto(0, document.body.scrollheight);") sleep(3) def click_button(): more_imgs_button_xpath = '//*[@id="smb"]' driver.find_element_by_xpath(more_imgs_button_xpath).click() def create_soup(): html_source = driver.page_source soup = beautifulsoup(html_source, 'html.parser') def find_imgs(): imgs_urls = [] img in soup.find_all('img'): try: if img['src'].startswith('http'): imgs_urls.append(img['src']) except: pass #create webdriver driver = selenium.webdriver.chrome() #define url using search term searchurl = "https://www.google.com/search?q={}&site=webhp&tbm=isch".format(searchterm) #get url driver.get(searchurl) try: click_button() scroll_page() except: scroll_page() click_button() #create soup after loaded imgs when scroll'ed page down create_soup() #find imgs in soup find_imgs() #close driver driver.close() #return list of img urls found in page return imgs_urls
usage:
urls = scrape_all_imgs_google('computer') print(len(urls)) print(urls)
output:
377 ['https://encrypted-tbn1.gstatic.com/images?q=tbn:and9gct5hi9cde5jpygl6g3oyfre7uheie6zm-8q3zqoek0vlqqucgzcwwkggfoe', 'https://encrypted-tbn2.gstatic.com/images?q=tbn:and9gcr0tu_xiyb__pvvdh0hkvpd5n1k-0gvbm5pdr1br9xtyjxc4oru5e8bviif', 'https://encrypted-tbn0.gstatic.com/images?q=tbn:and9gcqqhh6zr6k-7iztfclfk09md19xjzaahbbafcej6s30pkmtoftfkhhs-ksn', , etc...
if don't want use code, can take @ google scraper , see if has method can useful you.
Comments
Post a Comment