LoginSignup
0
2

More than 1 year has passed since last update.

BeautifulSoup use cases

Last updated at Posted at 2021-01-03

Use case 1 : read anchored links of interest from a url

I have a url of an html page that contains links to various courses. I am interested in extracting href links of courses only.

Problem solution outline

  1. The url is https://www.jdla.org/certificate/engineer/#certificate_No04
  2. First I will read the url and save raw html text in a variable
  3. Then I will instantiate BeautifulSoup, using its instance I first get the list of div blocks with a class wp-block-group
  4. Finally I will parse each div block and extract href links

Problem solution code

import requests
from bs4 import BeautifulSoup

url = "https://www.jdla.org/certificate/engineer/#certificate_No04"

res = requests.get(url)
html = res.text

soup = BeautifulSoup(html, 'lxml')

divs = soup.find_all('div', attrs={'class':'wp-block-group'})
preliminary_links = [div.find_all('a') for div in divs]

# preliminary_links is in the form of list of lists. Moreover some lists are empty. We will flatten it and get rid of empty lists simultaneously

import itertools

flat_links = list(itertools.chain.from_iterable(preliminary_links))

# Finally I have got list of links I am interested in. Now I will extract their href attributes and text

links = [(link.get_text(),link.get_attribute_list('href')) for link in flat_links]

# extract actual href from list of lists
clean_links = [(title,href_list_of_list[0]) for (title,href_list_of_list) in links]
links_df = pd.DataFrame(columns=['title','link'], data=clean_links)

Use case 2 : parse non-basic http auth password protected url

Problem

Some URLs are password protected and can not be accessed by basic http auth. Meaning you can not access them with below code

import requests
from requests.auth import HTTPBasicAuth
res=requests.get(url,auth=HTTPBasicAuth('<username>','<password>'))
html=res.text
soup=BeautifulSoup(html,"lxml")

In such cases, you can leverage session as below
The keys in data should match input.name value of the form of the specific url you are trying to access.

s=requests.Session()
data={"u":"<username>","p":"<password>"}
res=s.post(url,data=data)
html=res.text
soup=BeautifulSoup(html,"lxml")
0
2
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
2