6
5

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

Wikipediaが配布しているデータベースを使ってpixivのタグを翻訳する

Last updated at Posted at 2014-04-30

Wikipediaのダウンロードできるデータファイル一覧を使って表題のことをやってみる。

Wikipediaのデータベースのインポート

使うのは以下の2つ。

  • page.sql : ページ情報(page_idやタイトル等)
  • langlinks.sql : 各ページの「他の言語」情報

MySQLのSQLとして配布されているので、ローカルのMySQLに読み込む。

$ curl -O http://dumps.wikimedia.org/jawiki/latest/jawiki-latest-langlinks.sql.gz
$ gunzip jawiki-latest-langlinks.sql.gz
$ mysql -uroot -Dwikipedia < ./jawiki-latest-langlinks.sql

$ curl -O http://dumps.wikimedia.org/jawiki/latest/$ jawiki-latest-page.sql.gz
$ gunzip jawiki-latest-page.sql.gz
$ mysql -uroot -Dwikipedia < ./jawiki-latest-page.sql

データベースのインポートにはそれなりに時間がかかる。

こういうテーブルになる。

mysql> select * from langlinks limit 5;
+---------+---------+--------------------+
| ll_from | ll_lang | ll_title           |
+---------+---------+--------------------+
| 2433742 | aa      | Category:User aa   |
|  254887 | aa      | Category:User de   |
|  254893 | aa      | Category:User de-1 |
|  351074 | aa      | Main Page          |
| 2166573 | aa      | Template:Delete    |
+---------+---------+--------------------+
5 rows in set (0.01 sec)

mysql> select * from page limit 5;
+---------+----------------+----------------------------------------+-------------------+--------------+------------------+-------------+---------------------+----------------+--------------------+-------------+----------+
| page_id | page_namespace | page_title                             | page_restrictions | page_counter | page_is_redirect | page_is_new | page_random         | page_touched   | page_links_updated | page_latest | page_len |
+---------+----------------+----------------------------------------+-------------------+--------------+------------------+-------------+---------------------+----------------+--------------------+-------------+----------+
|       1 |              4 | アップロードログ_2004年4月             | sysop             |          498 |                0 |           0 | 0.00326005555992951 | 20140408110536 | NULL               |     2168855 |   106607 |
|       2 |              4 | 削除記録/過去ログ_2002年12月           | sysop             |           48 |                0 |           0 |  0.0138938864822443 | 20050702201737 | NULL               |     2168856 |      208 |
|       5 |              0 | アンパサンド                           |                   |           10 |                0 |           0 |   0.870079839481115 | 20140410100515 | 20140403073058     |    49408724 |     4201 |
|       6 |              4 | Sandbox                                |                   |           46 |                1 |           0 |  0.7534875168963221 | 20140416150013 | 20140208143521     |    50624493 |      122 |
|       7 |              2 | Brion_VIBBER                           |                   |          360 |                0 |           0 |   0.453168387015507 | 20121029175050 | NULL               |    24874659 |     8489 |
+---------+----------------+----------------------------------------+-------------------+--------------+------------------+-------------+---------------------+----------------+--------------------+-------------+----------+
5 rows in set (0.00 sec)

pixivのタグ上位5000件を用意

↓ここにあるので、(1ページ500タグ) * (10ページ) で5000タグまでは手に入る。

これを元に以下の様なSQLを組み立てる。

INSERT INTO pixiv_tags (name,num) VALUES
("オリジナル",2247619),
("東方",1584753),

データベースをこんな風に作って、

CREATE TABLE `pixiv_tags` (
  `name` varchar(60) NOT NULL DEFAULT '',
  `num` int(10) unsigned DEFAULT NULL,
  PRIMARY KEY (`name`)
) ENGINE=InnoDB

インポート。

$ mysql -uroot -Dwikipedia < tags.sql

SELECT

こんなSQLで結果を作成。

SELECT
  name,
  num,
  COALESCE(en.ll_title, '') AS en,
  COALESCE(ko.ll_title, '') AS ko,
  COALESCE(zh.ll_title, '') AS zh,
  COALESCE(ru.ll_title, '') AS ru,
  COALESCE(fr.ll_title, '') AS fr,
  COALESCE(es.ll_title, '') AS es,
  COALESCE(th.ll_title, '') AS th
FROM
  pixiv_tags
  LEFT JOIN page ON (name = page_title AND page_namespace = 0)
  LEFT JOIN langlinks AS en ON (page_id = en.ll_from AND en.ll_lang = 'en')
  LEFT JOIN langlinks AS ko ON (page_id = ko.ll_from AND ko.ll_lang = 'ko')
  LEFT JOIN langlinks AS zh ON (page_id = zh.ll_from AND zh.ll_lang = 'zh')
  LEFT JOIN langlinks AS ru ON (page_id = ru.ll_from AND ru.ll_lang = 'ru')
  LEFT JOIN langlinks AS fr ON (page_id = fr.ll_from AND fr.ll_lang = 'fr')
  LEFT JOIN langlinks AS es ON (page_id = es.ll_from AND es.ll_lang = 'es')
  LEFT JOIN langlinks AS th ON (page_id = th.ll_from AND th.ll_lang = 'th')
ORDER BY num DESC

出来上がったものがこちら。

kobito.1398838675.245980.png

6
5
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
6
5

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?