python - Retrieving a subset of href's from findall() in BeautifulSoup -


my goal write python script takes artist's name string input , appends base url goes genius search query.then retrieves lyrics returned web page's links (which required subset of problem contain artist name in every link in subset.).i in initial phase right , have been able retrieve links web page including ones don't want in subset. tried find simple solution failed continuously.

import requests # requests library.  bs4 import beautifulsoup lxml import html  user_input = input("enter artist name = ").replace(" ","+") base_url = "https://genius.com/search?q="+user_input  header = {'user-agent':''} response = requests.get(base_url, headers=header)  soup = beautifulsoup(response.content, "lxml")  link in soup.find_all('a',href=true):         print (link['href']) 

this returns complete list while need ones end lyrics , artist's name (here instance drake). these links should able retrieve lyrics.

https://genius.com/ /signup /login https://www.facebook.com/geniusdotcom/ https://twitter.com/genius https://www.instagram.com/genius/ https://www.youtube.com/user/rapgeniusvideo https://genius.com/new https://genius.com/drake-hotline-bling-lyrics https://genius.com/drake-one-dance-lyrics https://genius.com/drake-hold-on-were-going-home-lyrics https://genius.com/drake-know-yourself-lyrics https://genius.com/drake-back-to-back-lyrics https://genius.com/drake-all-me-lyrics https://genius.com/drake-0-to-100-the-catch-up-lyrics https://genius.com/drake-started-from-the-bottom-lyrics https://genius.com/drake-from-time-lyrics https://genius.com/drake-the-motto-lyrics /search?page=2&q=drake /search?page=3&q=drake /search?page=4&q=drake /search?page=5&q=drake /search?page=6&q=drake /search?page=7&q=drake /search?page=8&q=drake /search?page=9&q=drake /search?page=672&q=drake /search?page=673&q=drake /search?page=2&q=drake /embed_guide /verified-artists /contributor_guidelines /about /static/press mailto:brands@genius.com https://eventspace.genius.com/ /static/privacy_policy /jobs /developers /static/terms /static/copyright /feedback/new https://genius.com/genius-how-genius-works-annotated https://genius.com/genius-how-genius-works-annotated 

my next step use selenium emulate scroll in case of genius.com gives entire set of search results. suggestions or resources appreciated. few comments way wish proceed solution. can make more generic?

p.s. may not have lucidly explained problem have tried best. also, ambiguities welcome too. new scraping , python , programming in so, wanted make sure following right path.

use regex module match links want.

import requests # requests library.  bs4 import beautifulsoup lxml import html re import compile  user_input = input("enter artist name = ").replace(" ","+") base_url = "https://genius.com/search?q="+user_input  header = {'user-agent':''} response = requests.get(base_url, headers=header)  soup = beautifulsoup(response.content, "lxml")  pattern = re.compile("[\s]+-lyrics$")  link in soup.find_all('a',href=true):     if pattern.match(link['href']):         print (link['href']) 

output:

https://genius.com/drake-hotline-bling-lyrics https://genius.com/drake-one-dance-lyrics https://genius.com/drake-hold-on-were-going-home-lyrics https://genius.com/drake-know-yourself-lyrics https://genius.com/drake-back-to-back-lyrics https://genius.com/drake-all-me-lyrics https://genius.com/drake-0-to-100-the-catch-up-lyrics https://genius.com/drake-started-from-the-bottom-lyrics https://genius.com/drake-from-time-lyrics https://genius.com/drake-the-motto-lyrics 

this looks if link matches pattern ending in -lyrics. may use similar logic filter using user_input variable well.

hope helps.


Comments

Popular posts from this blog

inversion of control - Autofac named registration constructor injection -

verilog - Systemverilog dynamic casting issues -

ios - Change Storyboard View using Seague -