More than 1 year has passed since last update.

NimAdvent Calendar 2022

Nimで言語処理100本ノック(第2章: UNIXコマンド)

Posted at 2022-12-25

はじめに

この記事はNim Advent Calendar 2022の12日目の記事です。

11日目の記事はUMA821さんのNimで言語処理100本ノック(第1章: 準備運動)、
13日目の記事はUMA821さんのNimのHashSet, OrderedSet(setsモジュール)と
set(system), PackedSet(packedsets)の違いと速度比較になります。

概要

言語処理100本ノックの第2章: UNIXコマンドをNimで実装していく内容となります。

第2章のタイトル通りUNIXコマンドで実現可能な内容となっているため、
UNIXコマンドをNimから使用する形、UNIXコマンドを使用せず処理する形を併記します。

「今」「Nimで」言語処理100本ノックするならPythonとの比較等が欲しいところですが、
11日目の記事タイトルに触発されてクリスマスに~~突如思い付きで~~書き始めた記事なので、
アドベントカレンダー終了に間に合わなくなりお蔵入りするよりは記事としての完成を優先します。

そのため、ところどころ調査・推敲不足な箇所があるかもしれませんので、
不備等あった際はご指摘いただけましたら幸いです(言い訳)。

環境

UNIXコマンドを扱う章の実装となるため、Windows上では動作しません。

OS: WSL2 Debian
OS Version: bullseye(11.6)
Nim Version: 1.6.10

ビルド・実行

別にビルドできて実行できればオプションは何でも良いですが、
この記事ではシンプルに下記のコマンドでビルド・実行します。
「-d:ssl」はhttpclientでSSL/TLSを利用する場合に必要になります。

~$ mkdir -p ${HOME}/言語処理100本ノック
~$ vi ${HOME}/言語処理100本ノック # 実装
~$ nim r --hints:off -d:ssl '言語処理100本ノック/第2章：UNIXコマンド.nim'

言語処理と関係ない大枠の実装方針

popular-names.txtはローカルに保存せず、https経由で取得して参照することにします。

UNIXコマンドを使用

popular-names.txtの参照

curlやWgetの使用が恐らく一般的ですが、折角なのでNim Compilerの--eval:cmdオプションを使用します。
httpclientのgetContentを使用して取得した文字列をstripしてechoする程度のため、
import等のオプションやURL部分が長いですがギリギリワンライナーの範囲で参照可能です。

~$ nim r --hints:off -d:ssl --import:httpclient --import:strutils \
  --eval:'echo newHttpClient().getContent("https://nlp100.github.io/data/popular-names.txt").strip'
~$ curl 'https://nlp100.github.io/data/popular-names.txt' # 👆とおおよそ同等

Nim上でのUNIXコマンドの実行

osのexecShellCmdを使用することで、シェル上でのコマンド実行が可能です。

import os

discard execShellCmd "echo hoge"

UNIXコマンドを使用せず処理

全体的に下記の流れで処理します。

httpclientのgetContentで参照
strutilsのstripで末尾改行を削除 ¹
strutilsのsplitで改行単位で分割
3.で分割した各行に対して処理

ノックしていく

10. 行数のカウント

UNIXコマンドを使用

本記事はNimが主題のため、UNIXコマンドについての詳細は紹介しません。

import os

const popularNames = "nim r --hints:off -d:ssl --import:httpclient --import:strutils" &
  " --eval:'echo newHttpClient().getContent(\"https://nlp100.github.io/data/popular-names.txt\").strip'"

# 以降、👆の2行は省略します

discard execShellCmd popularNames & " | wc -l"

UNIXコマンドを使用せず処理

大枠の実装方針で既に可変長配列化しているため、lenで配列長を参照するだけです。

import std/[httpclient, strutils]

let popularNames = newHttpClient().getContent("https://nlp100.github.io/data/popular-names.txt")
  .strip.split('\n')

# 以降、👆の2行は省略します(importは追加で必要になった場合、追加分のみimportします)

echo popularNames.len

11. タブをスペースに置換

UNIXコマンドを使用

本記事はNimが主題のため(以降略)

discard execShellCmd popularNames & " | tr '\t' ' '"

UNIXコマンドを使用せず処理

sequtilsのmapItで各行に対して処理を行い、strutilsのjoinで結合しています。
各行に対する処理として、strutilsのreplaceで文字・文字列の置換を行っています。

import std/sequtils

echo popularNames
  .mapIt(it.replace('\t', ' '))
  .join("\n")

12. 1列目をcol1.txtに，2列目をcol2.txtに保存

UNIXコマンドを使用

discard execShellCmd popularNames & " | cut -f 1 >col1.txt"
discard execShellCmd popularNames & " | cut -f 2 >col2.txt"

UNIXコマンドを使用せず処理

NimのファイルI/Oは様々な方法がありますが、system/ioのwriteFileが手軽です。
要件に合わせて使い分けると良いですが、本記事の主題ではないためここでは触れません。

また、異常系処理の省略として配列の範囲チェックが抜けているため、
URL先のファイルが不正な場合IndexDefectが送出されます。

  writeFile "col1.txt", popularNames
    .mapIt(it.split('\t')[0])
    .join("\n")
  writeFile "col2.txt", popularNames
    .mapIt(it.split('\t')[1])
    .join("\n")

13. col1.txtとcol2.txtをマージ

UNIXコマンドを使用

discard execShellCmd "paste col1.txt col2.txt"

UNIXコマンドを使用せず処理

NimのファイルI/Oは様々な方法がありますが、system/ioのreadFileが手軽です。
また、a ..< bで数値範囲指定のループが可能です。

let col1 = "col1.txt".readFile.split('\n')
let col2 = "col2.txt".readFile.split('\n')
for i in 0 ..< col1.len:
  echo col1[i], '\t', col2[i]

14. 先頭からN行を出力

system/ioのstdinで標準入力の参照、system/ioのreadLineで入力1行の読込、
strutilsのparseUIntで文字列をuint値に変換可能です。

異常系処理の省略として標準入力の文字列チェックが抜けているため、
uintで表現不能な数値や数値以外の文字列を標準入力した場合、ValueErrorが送出されます。

UNIXコマンドを使用

discard execShellCmd popularNames & " | head -n " & $(stdin.readLine.parseUInt)

ちなみに、strformatの&でPythonのフォーマット済み文字列リテラル相当の文字列結合が可能です。

import std/strformat

discard execShellCmd &"{popularNames} | head -n {stdin.readLine.parseUInt}"

UNIXコマンドを使用せず処理

x[a .. b]、x[a ..< b]で部分配列の抽出が可能です。

echo popularNames[0 ..< stdin.readLine.parseUInt].join("\n")

15. 末尾のN行を出力

UNIXコマンドを使用

discard execShellCmd &"{popularNames} | tail -n {stdin.readLine.parseUInt}"

UNIXコマンドを使用せず処理

lenがintを返却するので、uint値と減算するためキャストしています。
何かの間違いで負の長さが返却される可能性を考慮する場合、適切な範囲チェックが必要です。

echo popularNames[uint(popularNames.len) - stdin.readLine.parseUInt ..< popularNames.len].join("\n")

16. ファイルをN分割する

UNIXコマンドを使用

全体量が分からなければN分割はできないことから
splitの「-n」オプションと標準入力は両立不能なため、
popular-names.txtをローカルに保存しない方針では実現できません。

discard execShellCmd popularNames & " >popular-names.txt"
discard execShellCmd &"split -n l/{stdin.readLine.parseUInt} popular-names.txt"
discard execShellCmd "rm popular-names.txt"

UNIXコマンドを使用せず処理

Nimには条件演算子が存在せず文ではなく式のifを使用している点や、
uintの除算・剰余にdiv・modしている点で、若干長くなっています。²

import std/strformat

let
  l = uint(popularNames.len)
  n = stdin.readLine.parseUInt
  linesPerFile = l.div(n) + (
    if l.mod(n) == 0:
      0'u
    else:
      1'u)
for i in 0 ..< n:
  let currentIndex = i * linesPerFile
  writeFile &"popular-names{i}.txt",
    popularNames[currentIndex ..< min(currentIndex + linesPerFile, l)].join("\n") & '\n'

17. １列目の文字列の異なり

UNIXコマンドを使用

discard execShellCmd popularNames & " | cut -f 1 | sort | uniq"

UNIXコマンドを使用せず処理

setsのtoHashSetで集合に変換することで重複削除、algorithmのsortedでソートします。

import algorithm

echo popularNames
  .mapIt(it.split('\t')[0])
  .toHashSet
  .toSeq
  .sorted
  .join("\n")

18. 各行を3コラム目の数値の降順にソート

UNIXコマンドを使用

discard execShellCmd popularNames & " | sort -rnk3"

UNIXコマンドを使用せず処理

algorithmのsortedByItでソート列の指定、algorithmのreversedで配列要素の逆順並べ替えが可能です。

sortedで比較関数を渡すことでも同等の処理が可能で、
こちらの場合ソート処理の一部として降順指定が可能ですが、比較関数指定が若干面倒になります。

echo popularNames
  .sortedByIt(it.split('\t')[2].parseUInt)
  .reversed
  .join("\n")

19. 各行の1コラム目の文字列の出現頻度を求め，出現頻度の高い順に並べる

UNIXコマンドを使用する正規の方法

discard execShellCmd popularNames & " | cut -f 1 | sort | uniq -c | tr -s ' ' | sort -rn | cut -d ' ' -f 3"

UNIXコマンドを使用せず処理

tablesのTableで出現頻度を保持しています。³

var counts: Table[string, int]
for name in popularNames.mapIt(it.split('\t')[0]):
  if counts.contains(name):
    counts[name] += 1
  else:
    counts[name] = 1
echo counts
  .pairs
  .toSeq
  .sortedByIt(it[0]) # 出現頻度が同じ場合は名前でもソートする、要件上必須ではない処理
  .sortedByIt(it[1])
  .reversed
  .join("\n")
``

目的上stripLineEndの方が必要十分ですが、stripでも十分です ↩
uint値の端数切り上げ除算ができれば条件演算子は不要ですが、
軽く調べた限りでは見つかりませんでした。 ↩
sequtilsのfoldlで上手いこと集計できればforとメソッドチェーンが混在せず
1文で実現可能な気はしますが、気のせいかもしれません。 ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up