More than 5 years have passed since last update.

なぜかRustで言語処理100本ノック～第3章前編～

Last updated at 2018-10-27Posted at 2018-10-25

Rustで言語処理100本ノックしています。

第3章: 正規表現

この章ではgzipの読み込みとしてflate2を、jsonの読み込みとしてserde_jsonを、正規表現のライブラリとしてregexを使用しています。
後編はこちら

20. JSONデータの読み込み

Wikipedia記事のJSONファイルを読み込み，「イギリス」に関する記事本文を表示せよ．問題21-29では，ここで抽出した記事本文に対して実行せよ．

取得したファイルを見てみると{"text": "本文","title": "タイトル"}が列挙されているという形だったので、この形に限定して考えました。

# [derive(Deserialize)]
pub struct Article {
    pub text: String,
    pub title: String,
}

pub fn json_read_about(path: &Path, about: &str) -> Option<Article> {
    File::open(path).ok()
        .and_then(|file| Some(GzDecoder::new(file)))
        .and_then(|gz| Deserializer::from_reader(gz).into_iter::<Article>()
                        .filter(|v| if let Ok(v) = v { v.title == about } else { false })
                        .map(|v| v.unwrap())
                        .next())
}

serdeの扱いって(慣れの問題かもしれないですが)少し面倒ですね...

21. カテゴリ名を含む行を抽出

記事中でカテゴリ名を宣言している行を抽出せよ．

Wikipediaの「ページの編集」によると、カテゴリ名というのは[[Category:hoge]]という形のものだそうです。これを正規表現で表すと、\[\[Category.*\]\]となります。

pub fn category_line(article: &Article) -> Vec<&str> {
    let regex = Regex::new(r"\[\[Category.*\]\]").unwrap();
    article.text.lines().filter(|l| regex.is_match(l)).collect()
}

raw string便利ですね。

22. カテゴリ名の抽出

記事のカテゴリ名を（行単位ではなく名前で）抽出せよ．

pub fn category_name(article: &Article) -> Vec<&str> {
    let regex = Regex::new(r"\[\[Category:([^|\n]*)\|?.*\]\]").unwrap();
    regex.captures_iter(&article.text).map(|captures| captures.get(1).unwrap().as_str()).collect()
}

23. セクション構造

記事中に含まれるセクション名とそのレベル（例えば"== セクション名 =="なら1）を表示せよ．

「ページの編集」によるとセクションは行の先頭において=で囲まれた部分ということだったので、^(=+)([^=]+)=+という部分が決まりました。これでほぼ完成なのですが、空白類は自由に挿入可能ということでその部分のマッチを加えました。

pub fn sections(article: &Article) -> Vec<(usize, &str)> {
    let regex = Regex::new(r"^\s*(=+)\s*([^=]+)\s*=+").unwrap();
    article.text.lines()
        .filter(|line| regex.is_match(line))
        .map(|line| regex.captures(line).unwrap())
        .map(|captures| (captures.get(1).unwrap().as_str().len() - 1, captures.get(2).unwrap().as_str()))
        .collect()
}

24. ファイル参照の抽出

記事から参照されているメディアファイルをすべて抜き出せ．

ほとんどセクション名の抽出と同じになりました。

pub fn files(article: &Article) -> Vec<&str> {
    let regex = Regex::new(r"\[\[(?:File|ファイル):([^|]*).*\]\]").unwrap();
    regex.captures_iter(&article.text).map(|captures| captures.get(1).unwrap().as_str()).collect()
}

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

なぜかRustで言語処理100本ノック ～第3章 前編～