More than 5 years have passed since last update.

Beautiful Soup > elem.aの.aは何なのか > htmlのAタグなのだろう

Last updated at 2018-01-30Posted at 2018-01-30

動作環境

GeForce GTX 1070 (8GB)
ASRock Z170M Pro4S [Intel Z170chipset]
Ubuntu 16.04 LTS desktop amd64
TensorFlow v1.2.1
cuDNN v5.1 for Linux
CUDA v8.0
Python 3.5.2
IPython 6.0.0 -- An enhanced Interactive Python.
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
GNU bash, version 4.3.48(1)-release (x86_64-pc-linux-gnu)
scipy v0.19.1
geopandas v0.3.0
MATLAB R2017b (Home Edition)
ADDA v.1.3b6
BeautifulSoup 4.4.1

MemoryEnhancer > MEDC > test_get_list_180130.py > GitHub上のMarkdownファイル名(*.html.md)リストを取得する | GitHub REST API v3 | JSONモジュール | BeautifulSoup
のコメントにて教えていただいたBeautifulSoupのコードが分からなかった。

import requests
from bs4 import BeautifulSoup

IN_URL = "https://github.com/yasokada/TechEnglish_170903/tree/master/data"

res = requests.get(IN_URL)
if res.status_code == requests.codes.ok:
    soup = BeautifulSoup(res.text, 'lxml')
    for elem in soup.find_all('td', class_='content'):
        if elem.a and elem.a.text.endswith('html.md'):
            print(elem.a.text)

elem.aの.aとは何か?

公式サイトを「.a」で検索しても検索結果が多すぎて該当記事にはたどり着かない。

タグと考えて以下を見つけた。
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#tag

Tag
Extremely bold')" data-sourcepos="43:3-44:12">

tag = soup.b
type(tag)


このコードでは`soup.b`という記述により[Bタグ](http://www.htmq.com/html/b.shtml)を読み取っているようだ。

同様に教えていただいたコードの`elem.a`ではAタグを読取るということのようだ。find_all()でAタグが見つからない場合、Noneになる。
読取り対象の`/data`の一部は以下のようになっている。
TDタグ内の要素にAタグが見つかった場合、そのテキストのうちhtml.mdで終わるものをprintする、という処理のようだ。

```html
...
<td class="content">
            <span class="css-truncate css-truncate-target"><a href="/yasokada/TechEnglish_170903/blob/master/data/10.html.md" class="js-navigation-open" id="4c209532a5ed8fa078d0b0e1528eb501-49445b32d8a945e33b1e546d14758149333ea7b6" title="10.html.md">10.html.md</a></span>
          </td>

(追記 2018/01/31)
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
の以下の行以降に例があった。

Here are some simple ways to navigate that data structure:

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# u'title'

soup.title.string
# u'The Dormouse's story'

soup.title.parent.name
# u'head'

soup.p
# <p class="title"><b>The Dormouse's story</b></p>

soup.p['class']
# u'title'

soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up