Use case 1 : read anchored links of interest from a url

I have a url of an html page that contains links to various courses. I am interested in extracting href links of courses only.

Problem solution outline

The url is https://www.jdla.org/certificate/engineer/#certificate_No04
First I will read the url and save raw html text in a variable
Then I will instantiate BeautifulSoup, using its instance I first get the list of div blocks with a class wp-block-group
Finally I will parse each div block and extract href links

Problem solution code

import requests
from bs4 import BeautifulSoup

url = "https://www.jdla.org/certificate/engineer/#certificate_No04"

res = requests.get(url)
html = res.text

soup = BeautifulSoup(html, 'lxml')

divs = soup.find_all('div', attrs={'class':'wp-block-group'})
preliminary_links = [div.find_all('a') for div in divs]

# preliminary_links is in the form of list of lists. Moreover some lists are empty. We will flatten it and get rid of empty lists simultaneously

import itertools

flat_links = list(itertools.chain.from_iterable(preliminary_links))

# Finally I have got list of links I am interested in. Now I will extract their href attributes and text

links = [(link.get_text(),link.get_attribute_list('href')) for link in flat_links]

# extract actual href from list of lists
clean_links = [(title,href_list_of_list[0]) for (title,href_list_of_list) in links]
links_df = pd.DataFrame(columns=['title','link'], data=clean_links)

Use case 2 : parse non-basic http auth password protected url

Problem

Some URLs are password protected and can not be accessed by basic http auth. Meaning you can not access them with below code

import requests
from requests.auth import HTTPBasicAuth
res=requests.get(url,auth=HTTPBasicAuth('<username>','<password>'))
html=res.text
soup=BeautifulSoup(html,"lxml")

In such cases, you can leverage session as below
The keys in data should match input.name value of the form of the specific url you are trying to access.

s=requests.Session()
data={"u":"<username>","p":"<password>"}
res=s.post(url,data=data)
html=res.text
soup=BeautifulSoup(html,"lxml")

BeautifulSoup use cases

Use case 1 : read anchored links of interest from a url

Problem solution outline

Problem solution code

Use case 2 : parse non-basic http auth password protected url

Problem