Use case 1 : read anchored links of interest from a url
I have a url of an html page that contains links to various courses. I am interested in extracting href links of courses only.
Problem solution outline
- The url is https://www.jdla.org/certificate/engineer/#certificate_No04
- First I will read the url and save raw html text in a variable
- Then I will instantiate BeautifulSoup, using its instance I first get the list of
div
blocks with a classwp-block-group
- Finally I will parse each
div
block and extracthref links
Problem solution code
import requests
from bs4 import BeautifulSoup
url = "https://www.jdla.org/certificate/engineer/#certificate_No04"
res = requests.get(url)
html = res.text
soup = BeautifulSoup(html, 'lxml')
divs = soup.find_all('div', attrs={'class':'wp-block-group'})
preliminary_links = [div.find_all('a') for div in divs]
# preliminary_links is in the form of list of lists. Moreover some lists are empty. We will flatten it and get rid of empty lists simultaneously
import itertools
flat_links = list(itertools.chain.from_iterable(preliminary_links))
# Finally I have got list of links I am interested in. Now I will extract their href attributes and text
links = [(link.get_text(),link.get_attribute_list('href')) for link in flat_links]
# extract actual href from list of lists
clean_links = [(title,href_list_of_list[0]) for (title,href_list_of_list) in links]
links_df = pd.DataFrame(columns=['title','link'], data=clean_links)
Use case 2 : parse non-basic http auth password protected url
Problem
Some URLs are password protected and can not be accessed by basic http auth. Meaning you can not access them with below code
import requests
from requests.auth import HTTPBasicAuth
res=requests.get(url,auth=HTTPBasicAuth('<username>','<password>'))
html=res.text
soup=BeautifulSoup(html,"lxml")
In such cases, you can leverage session as below
The keys in data should match input.name
value of the form of the specific url you are trying to access.
s=requests.Session()
data={"u":"<username>","p":"<password>"}
res=s.post(url,data=data)
html=res.text
soup=BeautifulSoup(html,"lxml")