3
4

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

[機械学習で推奨タグ付け #2.5] スクレイピングスクリプトの修正

Posted at

<ENGLISH>

Hello - I hope you have a good day. Happy weekend should be happy cording day :smile:

Ok, today I will not proceed the scripting and I'd like to modify previous script. The script is below from #2:

scraper = [ 
        ["hatenablog.com","div","class","entry-content"],
        ["qiita.com","section","itemprop", "articleBody"]
        ]
c = 0
for domain in scraper:
    print url, domain[0]
    if re.search( domain[0], url):
        break
    c += 1

response = urllib2.urlopen(url)
html = response.read()

soup = BeautifulSoup( html, "lxml" )
soup.originalEnoding
tag = soup.find( scraper[c][1], {scraper[c][2] : scraper[c][3]})
text = ""
for con in tag.contents:
    p = re.compile(r'<.*?>')
    text += p.sub('', con.encode('utf8'))

Yes, it works, but want to use (1) BeautifulSoup instead of regular expression and (2)Hash list instead of counting inside for.

(1) BeautifulSoup

soup = BeautifulSoup( html, "lxml" )
soup.originalEnoding
tag = soup.find( scraper[c][1], {scraper[c][2] : scraper[c][3]})
text = ""
for con in tag.contents:
    p = re.compile(r'<.*?>')
    text += p.sub('', con.encode('utf8'))

Regular Expression is strong tool, but I have to learn BeautifulSoup more. Beautiful Soup is using unique type for it's string, and we can check how to use it in user's guide.
I modified it as below.

    soup = BeautifulSoup( html, "lxml" )
    soup.originalEnoding
    tag = soup.find( scraper[c][1], {scraper[c][2] : scraper[c][3]})
    soup2 = BeautifulSoup(tag.encode('utf8'), "lxml")
    print "".join([string.encode('utf8') for string in soup2.strings])

Looks smarter? :laughing:
you got another soup for getting strings. Which do you like?

(2) Hash List for splitting.
Watch out!

scraper = [ 
        ["hatenablog.com","div","class","entry-content"],
        ["qiita.com","section","itemprop", "articleBody"]
        ]
c = 0
for domain in scraper:
    print url, domain[0]
    if re.search( domain[0], url):
        break
    c += 1

To get splitter strings for each web site, used c as count up integer. That's not cool. So I modified as below.

    scraper = [ 
            ["hatenablog.com","div","class","entry-content"],
            ["qiita.com","section","itemprop", "articleBody"]
            ]
    numHash = {}
    for i in range(len(scraper)):
        numHash[scraper[i][0]] = i 
    for domain in scraper:
        print url, domain[0]
        if re.search( domain[0], url):
            c = numHash[domain[0]]
            break

yes, it becomes longer, but I think it's much better than previous, isn't it?

Great, next I hope I can proceed to next step... It will be getting elements for learning.

<日本語>

はい、どーも。週末ですね。よい週末を迎えるためにもコーディングにいそしみましょう。
今日は次の進める前に#2でやったスクリプトを修正したいと思います。こちらですね。

scraper = [ 
        ["hatenablog.com","div","class","entry-content"],
        ["qiita.com","section","itemprop", "articleBody"]
        ]
c = 0
for domain in scraper:
    print url, domain[0]
    if re.search( domain[0], url):
        break
    c += 1

response = urllib2.urlopen(url)
html = response.read()

soup = BeautifulSoup( html, "lxml" )
soup.originalEnoding
tag = soup.find( scraper[c][1], {scraper[c][2] : scraper[c][3]})
text = ""
for con in tag.contents:
    p = re.compile(r'<.*?>')
    text += p.sub('', con.encode('utf8'))

これでも動くんですが・・・変更点としては(1)タグ除去を正規表現ではなくBeautifulSoupを使う のと、(2)区切り文字の選択にカウントアップではなく、ハッシュリストを使う といった点になります。

(1) BeautifulSoupを使う

soup = BeautifulSoup( html, "lxml" )
soup.originalEnoding
tag = soup.find( scraper[c][1], {scraper[c][2] : scraper[c][3]})
text = ""
for con in tag.contents:
    p = re.compile(r'<.*?>')
    text += p.sub('', con.encode('utf8'))

正規表現はとても便利なのですが、せっかくのBeautifulSoupをもっと有効にできないかとおもった次第です。BSで、中の文字列の抜出のためのツールはそろいまくっているんですが、独特の文字列形式があるので、最初はとっつきにくかった。でもしっかりドキュメント化もされているので、あとは使いまくって慣れるしかないですかね。

で、変更後がこれだっ!

    soup = BeautifulSoup( html, "lxml" )
    soup.originalEnoding
    tag = soup.find( scraper[c][1], {scraper[c][2] : scraper[c][3]})
    soup2 = BeautifulSoup(tag.encode('utf8'), "lxml")
    print "".join([string.encode('utf8') for string in soup2.strings])

かっこよくなった気がしませんか?スープをもう一回お代わりして、タグの中の文字列を引き出すように変更しました。

(2) 区切り文字にハッシュリストを使う
ここのこと。

scraper = [ 
        ["hatenablog.com","div","class","entry-content"],
        ["qiita.com","section","itemprop", "articleBody"]
        ]
c = 0
for domain in scraper:
    print url, domain[0]
    if re.search( domain[0], url):
        break
    c += 1

C変数をカウントアップして、区切り文字の番号を調整しているような形。うーんなんかぶちゃいく?で、素敵に大変身。

    scraper = [ 
            ["hatenablog.com","div","class","entry-content"],
            ["qiita.com","section","itemprop", "articleBody"]
            ]
    numHash = {}
    for i in range(len(scraper)):
        numHash[scraper[i][0]] = i 
    for domain in scraper:
        print url, domain[0]
        if re.search( domain[0], url):
            c = numHash[domain[0]]
            break

予想に反してスクリプトが長くなってしまいました。でもこっちのほうがとっても好きです。もうちょっときれいにできるかなぁ。

というわけで、今回は自己満足的な修正をほどこしました。次回は次に進めるかと思います。学習もととなるリンクとタグリストのスクレイピングです。いつ機械学習にたどりつけるのでしょうか。。。そろそろ詐欺といわれそう。

3
4
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
3
4

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?