7
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

株式会社エイアイ・フィールドAdvent Calendar 2021

Day 13

【python】Wikipediaを検索し、結果をwordcloudで表示してみた!

Last updated at Posted at 2021-12-12

TL;DR

ソースコードはこちら!

はじめに

皆さんこんにちは!YomamaBananaです。

今回はWordCloudを使って見たくて、Pythonで実装してみました。
また、WordCloudの対象物をWikipediaの情報とし、WikipediaのAPIを使って検索エンジンも軽く実装しました。

まとめると:
Wikipedia検索概要取得WordCloud生成
となります。

使用するツール

今回は下記のライブラリーが必要です。

名前の通りWikipediaの検索とWordCloudの生成のモジュールです。

それらをインストール:

$ pip install wordcloud, wikipedia-api

出力

検索結果:

$ python .\src\wiki_wordcloud.py

Enter keyword: Mathematics
Searching WIKIPEDIA ...  SUCCESSFUL.
>>> Type 'Y' to choose other languages or 'Enter' to conintue in English.
>>> Choose other languages?
-------------------- Title --------------------
Mathematics
-------------------- Summary --------------------
Mathematics (from Greek: μάθημα, máthēma, 'knowledge, study, learning') includes the study of such topics as numbers
 (arithmetic and number theory), formulas and related structures (algebra), shapes and spaces in which they are 
contained (geometry), and quantities and their changes (calculus and analysis). There is no general consensus about its
 exact scope or epistemological status.Most of mathematical activity consists of discovering and proving (by pure 
reasoning) properties of abstract objects. These objects are either abstractions from nature (such as natural numbers 
or "a line"), or (in modern mathematics) abstract entities that are defined by their basic properties, called axioms. A
 proof consists of a succession of applications of some deductive rules to already known results, including previously
 proved theorems, axioms and (in case of abstraction from nature) some basic properties that are considered as true 
starting points of the theory under consideration. The result of a proof is called a theorem. Contrary to physical 
laws, the validity of a theorem (its truth) does not rely on any experimentation but on the correctness of its 
reasoning (though experimentation is often useful for discovering new theorems of interest).

Mathematics is widely used in science for modeling phenomena. This enables the extraction of quantitative predictions... 

//省略//

-------------------- Sections --------------------
*       Areas of mathematics
**      Number theory
**      Geometry
**      Algebra
**      Calculus and analysis
**      Discrete mathematics
**      Mathematical logic and set theory
**      Applied mathematics
**      Statistics and other decision sciences
**      Computational mathematics
*       History
**      Etymology
*       Philosophy of mathematics
**      Three leading types
**      Logicist definitions
**      Intuitionist definitions
**      Formalist definitions
**      Mathematics as science
*       Inspiration, pure and applied mathematics, and aesthetics
*       Notation, language, and rigor
*       Mathematical awards
*       See also
*       Notes
*       References
*       Bibliography
-------------------- EOF --------------------

Wordcloud:

Mathematics.png

ソースコード

ここで今回使われるコードを紹介します。

wiki_wordcloud.py
# モジュールのimport
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
import wikipediaapi
import sys


# 出力をキレイにするクラス。
class bcolors:
    HEADER = '\033[95m'
    OKBLUE = '\033[94m'
    OKCYAN = '\033[96m'
    OKGREEN = '\033[92m'
    WARNING = '\033[93m'
    FAIL = '\033[91m'
    ENDC = '\033[0m'
    BOLD = '\033[1m'
    UNDERLINE = '\033[4m'

# 検索したページのセクションを表示する。
def list_section(sections, level=1):
    for section in sections:
        print(f"{'*'*level}\t{section.title}")
        list_section(section.sections, level=2)

# WordCloudの可視化
def vis_wordcloud(wc, filename):
    plt.figure(figsize = (8, 8), facecolor = None)
    plt.imshow(wc)
    plt.axis("off")
    plt.tight_layout(pad = 0)
    plt.savefig(f"wordcloud/{filename}.png")
    plt.show()

# WordCloudの生成
def create_wordcloud(text):    
    stopwords = set(STOPWORDS)
    wordcloud = WordCloud(width = 800, height = 800,
                    background_color ='white',
                    stopwords = stopwords,
                    min_font_size = 10)
    wc = wordcloud.generate(text)
    return wc

# 検索したWikipediaのページの情報を表示する。
def output(title, summary, section):
    print(f'{"-"*20} Title {"-"*20}')
    print(title)
    print(f'{"-"*20} Summary {"-"*20}')
    print(summary)
    print(f'{"-"*20} Sections {"-"*20}')
    list_section(section)
    print(f'{"-"*20} EOF {"-"*20}')

# 曖昧な言葉に対する検証処理。
def check_ambiguation(summary, selected_page):
    if " may refer to" in summary:
        print(f"{bcolors.WARNING}WARNING: Ambiguation, see suggestions below.{bcolors.ENDC}")
        print(list(selected_page.links.keys()), flush=True)
        sys.exit()

# メイン関数
def main():
    search_word = input("Enter keyword: ") # 検索キーワード入力
    print("Searching WIKIPEDIA ... ", end=" ", flush=True)

    wiki_wiki = wikipediaapi.Wikipedia() # インスタンス化
    page_py = wiki_wiki.page(search_word) # 検索

    if page_py.exists(): # 検索が成功した場合
        print(f"{bcolors.OKGREEN}SUCCESSFUL.{bcolors.ENDC}")
        languages = page_py.langlinks
        
        # wikipediaapiの検索言語は英語をデフォルトとして設定している。
        # 検索したページに多言語対応ができるのであれば、選択できるようにする。
        print(">>> Type 'Y' to choose other languages or 'Enter' to conintue in English.")
        select_en = input(">>> Choose other languages? ")
        if select_en in ["Y", "y"]:
            for idx, (k, v) in enumerate(languages.items()):
                print(f"[{idx}]\t{k}: {v}")
            select = int(input(">>> Select number: "))
            selected_page = page_py.langlinks[list(languages)[select]]
        else:
            selected_page = page_py 
        
        title = selected_page.title
        summary = selected_page.summary

        check_ambiguation(summary, selected_page)
        output(title, summary, selected_page.sections)
        
        wc = create_wordcloud(summary)
        vis_wordcloud(wc, search_word)
    else: # 検索結果がない場合
        print(f"{bcolors.FAIL}ERROR 404: Page not found. {bcolors.ENDC}")

課題

様々課題が残っています。

  • 現状のWordCloudはアルファベットでしか対応できない。
  • 曖昧な言葉だったらsuggestionは表示されるが、さらなる検索はできていない。
  • このプログラムをexe化できていなくて、使い勝手が悪い
  • セクション毎のWordCloudの出力ができていない

あとがき

今回は簡単にWikipediaの検索結果からWordCloudを生成してみました。

まだまだ課題はたくさんだが、大変面白いプログラムでした。
それらの課題を乗り越え、自前でWikipediaの検索エンジンを作り、さらにNLPなどの処理を行って色々できそうです!
GUIにしたら面白いかも!

これからの楽しみ!

7
0
1

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
7
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?