言語処理100本ノック2015
http://www.cl.ecei.tohoku.ac.jp/nlp100/
第2章: UNIXコマンドの基礎
「hightemp.txt: http://www.cl.ecei.tohoku.ac.jp/nlp100/data/hightemp.txt
気象庁が公開している「歴代全国ランキング>観測史上の順位>最高気温の高い方から」を基に,手作業でタブ区切り形式に整形したものです.利用規約等はこちらのページを参照して下さい.
http://www.jma.go.jp/jma/kishou/info/coment.html
」
Unixとなっているが、Linux, MacOSでもbashを使っていれば同じ。
docker でubuntu, Raspberry PIでRaspbianでも、bashをつかっていれば同じ。
dockerのubuntuで、コマンドのhelpの出力を記録する。素のubuntuで/bin/bashを起動している限り、第二章のコマンドはすべて動作しました。
コマンドのhelpは、
command --help
10. 行数のカウント
素人の言語処理100本ノック:10
https://qiita.com/segavvy/items/9b14293e50dec0c4b242
wc
wc --help
Usage: wc [OPTION]... [FILE]...
or: wc [OPTION]... --files0-from=F
Print newline, word, and byte counts for each FILE, and a total line if
more than one FILE is specified. A word is a non-zero-length sequence of
characters delimited by white space.
With no FILE, or when FILE is -, read standard input.
The options below may be used to select which counts are printed, always in
the following order: newline, word, character, byte, maximum line length.
-c, --bytes print the byte counts
-m, --chars print the character counts
-l, --lines print the newline counts
--files0-from=F read input from the files specified by
NUL-terminated names in file F;
If F is - then read names from standard input
-L, --max-line-length print the maximum display width
-w, --words print the word counts
--help display this help and exit
--version output version information and exit
GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Report wc translation bugs to <http://translationproject.org/team/>
Full documentation at: <http://www.gnu.org/software/coreutils/wc>
or available locally via: info '(coreutils) wc invocation'
11. タブをスペースに置換
素人の言語処理100本ノック:11
https://qiita.com/segavvy/items/85e78ce405daf10d0ed6
sed
sed --help
Usage: sed [OPTION]... {script-only-if-no-other-script} [input-file]...
-n, --quiet, --silent
suppress automatic printing of pattern space
-e script, --expression=script
add the script to the commands to be executed
-f script-file, --file=script-file
add the contents of script-file to the commands to be executed
--follow-symlinks
follow symlinks when processing in place
-i[SUFFIX], --in-place[=SUFFIX]
edit files in place (makes backup if SUFFIX supplied)
-l N, --line-length=N
specify the desired line-wrap length for the `l' command
--posix
disable all GNU extensions.
-E, -r, --regexp-extended
use extended regular expressions in the script
(for portability use POSIX -E).
-s, --separate
consider files as separate rather than as a single,
continuous long stream.
--sandbox
operate in sandbox mode.
-u, --unbuffered
load minimal amounts of data from the input files and flush
the output buffers more often
-z, --null-data
separate lines by NUL characters
--help display this help and exit
--version output version information and exit
If no -e, --expression, -f, or --file option is given, then the first
non-option argument is taken as the sed script to interpret. All
remaining arguments are names of input files; if no input files are
specified, then the standard input is read.
GNU sed home page: <http://www.gnu.org/software/sed/>.
General help using GNU software: <http://www.gnu.org/gethelp/>.
E-mail bug reports to: <bug-sed@gnu.org>.
tr
tr --help
Usage: tr [OPTION]... SET1 [SET2]
Translate, squeeze, and/or delete characters from standard input,
writing to standard output.
-c, -C, --complement use the complement of SET1
-d, --delete delete characters in SET1, do not translate
-s, --squeeze-repeats replace each sequence of a repeated character
that is listed in the last specified SET,
with a single occurrence of that character
-t, --truncate-set1 first truncate SET1 to length of SET2
--help display this help and exit
--version output version information and exit
SETs are specified as strings of characters. Most represent themselves.
Interpreted sequences are:
\NNN character with octal value NNN (1 to 3 octal digits)
\\ backslash
\a audible BEL
\b backspace
\f form feed
\n new line
\r return
\t horizontal tab
\v vertical tab
CHAR1-CHAR2 all characters from CHAR1 to CHAR2 in ascending order
[CHAR*] in SET2, copies of CHAR until length of SET1
[CHAR*REPEAT] REPEAT copies of CHAR, REPEAT octal if starting with 0
[:alnum:] all letters and digits
[:alpha:] all letters
[:blank:] all horizontal whitespace
[:cntrl:] all control characters
[:digit:] all digits
[:graph:] all printable characters, not including space
[:lower:] all lower case letters
[:print:] all printable characters, including space
[:punct:] all punctuation characters
[:space:] all horizontal or vertical whitespace
[:upper:] all upper case letters
[:xdigit:] all hexadecimal digits
[=CHAR=] all characters which are equivalent to CHAR
Translation occurs if -d is not given and both SET1 and SET2 appear.
-t may be used only when translating. SET2 is extended to length of
SET1 by repeating its last character as necessary. Excess characters
of SET2 are ignored. Only [:lower:] and [:upper:] are guaranteed to
expand in ascending order; used in SET2 while translating, they may
only be used in pairs to specify case conversion. -s uses the last
specified SET, and occurs after translation or deletion.
GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Report tr translation bugs to <http://translationproject.org/team/>
Full documentation at: <http://www.gnu.org/software/coreutils/tr>
or available locally via: info '(coreutils) tr invocation'
expand
expand --help
Usage: expand [OPTION]... [FILE]...
Convert tabs in each FILE to spaces, writing to standard output.
With no FILE, or when FILE is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-i, --initial do not convert tabs after non blanks
-t, --tabs=NUMBER have tabs NUMBER characters apart, not 8
-t, --tabs=LIST use comma separated list of explicit tab positions
--help display this help and exit
--version output version information and exit
GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Report expand translation bugs to <http://translationproject.org/team/>
Full documentation at: <http://www.gnu.org/software/coreutils/expand>
or available locally via: info '(coreutils) expand invocation'
12. 1列目をcol1.txtに,2列目をcol2.txtに保存
素人の言語処理100本ノック:12
https://qiita.com/segavvy/items/51a515c19bcd29b13b7f
cut
cut --help
Usage: cut OPTION... [FILE]...
Print selected parts of lines from each FILE to standard output.
With no FILE, or when FILE is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-b, --bytes=LIST select only these bytes
-c, --characters=LIST select only these characters
-d, --delimiter=DELIM use DELIM instead of TAB for field delimiter
-f, --fields=LIST select only these fields; also print any line
that contains no delimiter character, unless
the -s option is specified
-n (ignored)
--complement complement the set of selected bytes, characters
or fields
-s, --only-delimited do not print lines not containing delimiters
--output-delimiter=STRING use STRING as the output delimiter
the default is to use the input delimiter
-z, --zero-terminated line delimiter is NUL, not newline
--help display this help and exit
--version output version information and exit
Use one, and only one of -b, -c or -f. Each LIST is made up of one
range, or many ranges separated by commas. Selected input is written
in the same order that it is read, and is written exactly once.
Each range is one of:
N N'th byte, character or field, counted from 1
N- from N'th byte, character or field, to end of line
N-M from N'th to M'th (included) byte, character or field
-M from first to M'th (included) byte, character or field
GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Report cut translation bugs to <http://translationproject.org/team/>
Full documentation at: <http://www.gnu.org/software/coreutils/cut>
or available locally via: info '(coreutils) cut invocation'
13. col1.txtとcol2.txtをマージ
素人の言語処理100本ノック:13
https://qiita.com/segavvy/items/ec9a3846f779c81e3d10
paste
paste --help
Usage: paste [OPTION]... [FILE]...
Write lines consisting of the sequentially corresponding lines from
each FILE, separated by TABs, to standard output.
With no FILE, or when FILE is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-d, --delimiters=LIST reuse characters from LIST instead of TABs
-s, --serial paste one file at a time instead of in parallel
-z, --zero-terminated line delimiter is NUL, not newline
--help display this help and exit
--version output version information and exit
GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Report paste translation bugs to <http://translationproject.org/team/>
Full documentation at: <http://www.gnu.org/software/coreutils/paste>
or available locally via: info '(coreutils) paste invocation'
14. 先頭からN行を出力
素人の言語処理100本ノック:14
https://qiita.com/segavvy/items/53271a1327bb3c8c5457
head
# head --help
Usage: head [OPTION]... [FILE]...
Print the first 10 lines of each FILE to standard output.
With more than one FILE, precede each with a header giving the file name.
With no FILE, or when FILE is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-c, --bytes=[-]NUM print the first NUM bytes of each file;
with the leading '-', print all but the last
NUM bytes of each file
-n, --lines=[-]NUM print the first NUM lines instead of the first 10;
with the leading '-', print all but the last
NUM lines of each file
-q, --quiet, --silent never print headers giving file names
-v, --verbose always print headers giving file names
-z, --zero-terminated line delimiter is NUL, not newline
--help display this help and exit
--version output version information and exit
NUM may have a multiplier suffix:
b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024,
GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.
GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Report head translation bugs to <http://translationproject.org/team/>
Full documentation at: <http://www.gnu.org/software/coreutils/head>
or available locally via: info '(coreutils) head invocation'
15. 末尾のN行を出力
素人の言語処理100本ノック:15
https://qiita.com/segavvy/items/c2675e44ea05c56995e3
##tail
tail --help
Usage: tail [OPTION]... [FILE]...
Print the last 10 lines of each FILE to standard output.
With more than one FILE, precede each with a header giving the file name.
With no FILE, or when FILE is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-c, --bytes=[+]NUM output the last NUM bytes; or use -c +NUM to
output starting with byte NUM of each file
-f, --follow[={name|descriptor}]
output appended data as the file grows;
an absent option argument means 'descriptor'
-F same as --follow=name --retry
-n, --lines=[+]NUM output the last NUM lines, instead of the last 10;
or use -n +NUM to output starting with line NUM
--max-unchanged-stats=N
with --follow=name, reopen a FILE which has not
changed size after N (default 5) iterations
to see if it has been unlinked or renamed
(this is the usual case of rotated log files);
with inotify, this option is rarely useful
--pid=PID with -f, terminate after process ID, PID dies
-q, --quiet, --silent never output headers giving file names
--retry keep trying to open a file if it is inaccessible
-s, --sleep-interval=N with -f, sleep for approximately N seconds
(default 1.0) between iterations;
with inotify and --pid=P, check process P at
least once every N seconds
-v, --verbose always output headers giving file names
-z, --zero-terminated line delimiter is NUL, not newline
--help display this help and exit
--version output version information and exit
NUM may have a multiplier suffix:
b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024,
GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.
With --follow (-f), tail defaults to following the file descriptor, which
means that even if a tail'ed file is renamed, tail will continue to track
its end. This default behavior is not desirable when you really want to
track the actual name of the file, not the file descriptor (e.g., log
rotation). Use --follow=name in that case. That causes tail to track the
named file in a way that accommodates renaming, removal and creation.
GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Report tail translation bugs to <http://translationproject.org/team/>
Full documentation at: <http://www.gnu.org/software/coreutils/tail>
or available locally via: info '(coreutils) tail invocation'
16. ファイルをN分割する
素人の言語処理100本ノック:16
https://qiita.com/segavvy/items/993ea169a1111c6f6f69
##split
# split --help
Usage: split [OPTION]... [FILE [PREFIX]]
Output pieces of FILE to PREFIXaa, PREFIXab, ...;
default size is 1000 lines, and default PREFIX is 'x'.
With no FILE, or when FILE is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
-a, --suffix-length=N generate suffixes of length N (default 2)
--additional-suffix=SUFFIX append an additional SUFFIX to file names
-b, --bytes=SIZE put SIZE bytes per output file
-C, --line-bytes=SIZE put at most SIZE bytes of records per output file
-d use numeric suffixes starting at 0, not alphabetic
--numeric-suffixes[=FROM] same as -d, but allow setting the start value
-e, --elide-empty-files do not generate empty output files with '-n'
--filter=COMMAND write to shell COMMAND; file name is $FILE
-l, --lines=NUMBER put NUMBER lines/records per output file
-n, --number=CHUNKS generate CHUNKS output files; see explanation below
-t, --separator=SEP use SEP instead of newline as the record separator;
'\0' (zero) specifies the NUL character
-u, --unbuffered immediately copy input to output with '-n r/...'
--verbose print a diagnostic just before each
output file is opened
--help display this help and exit
--version output version information and exit
The SIZE argument is an integer and optional unit (example: 10K is 10*1024).
Units are K,M,G,T,P,E,Z,Y (powers of 1024) or KB,MB,... (powers of 1000).
CHUNKS may be:
N split into N files based on size of input
K/N output Kth of N to stdout
l/N split into N files without splitting lines/records
l/K/N output Kth of N to stdout without splitting lines/records
r/N like 'l' but use round robin distribution
r/K/N likewise but only output Kth of N to stdout
GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Report split translation bugs to <http://translationproject.org/team/>
Full documentation at: <http://www.gnu.org/software/coreutils/split>
or available locally via: info '(coreutils) split invocation'
17. 1列目の文字列の異なり
素人の言語処理100本ノック:17
https://qiita.com/segavvy/items/202e61ce4e4cccc29d06
##sort
sort --help
Usage: sort [OPTION]... [FILE]...
or: sort [OPTION]... --files0-from=F
Write sorted concatenation of all FILE(s) to standard output.
With no FILE, or when FILE is -, read standard input.
Mandatory arguments to long options are mandatory for short options too.
Ordering options:
-b, --ignore-leading-blanks ignore leading blanks
-d, --dictionary-order consider only blanks and alphanumeric characters
-f, --ignore-case fold lower case to upper case characters
-g, --general-numeric-sort compare according to general numerical value
-i, --ignore-nonprinting consider only printable characters
-M, --month-sort compare (unknown) < 'JAN' < ... < 'DEC'
-h, --human-numeric-sort compare human readable numbers (e.g., 2K 1G)
-n, --numeric-sort compare according to string numerical value
-R, --random-sort shuffle, but group identical keys. See shuf(1)
--random-source=FILE get random bytes from FILE
-r, --reverse reverse the result of comparisons
--sort=WORD sort according to WORD:
general-numeric -g, human-numeric -h, month -M,
numeric -n, random -R, version -V
-V, --version-sort natural sort of (version) numbers within text
Other options:
--batch-size=NMERGE merge at most NMERGE inputs at once;
for more use temp files
-c, --check, --check=diagnose-first check for sorted input; do not sort
-C, --check=quiet, --check=silent like -c, but do not report first bad line
--compress-program=PROG compress temporaries with PROG;
decompress them with PROG -d
--debug annotate the part of the line used to sort,
and warn about questionable usage to stderr
--files0-from=F read input from the files specified by
NUL-terminated names in file F;
If F is - then read names from standard input
-k, --key=KEYDEF sort via a key; KEYDEF gives location and type
-m, --merge merge already sorted files; do not sort
-o, --output=FILE write result to FILE instead of standard output
-s, --stable stabilize sort by disabling last-resort comparison
-S, --buffer-size=SIZE use SIZE for main memory buffer
-t, --field-separator=SEP use SEP instead of non-blank to blank transition
-T, --temporary-directory=DIR use DIR for temporaries, not $TMPDIR or /tmp;
multiple options specify multiple directories
--parallel=N change the number of sorts run concurrently to N
-u, --unique with -c, check for strict ordering;
without -c, output only the first of an equal run
-z, --zero-terminated line delimiter is NUL, not newline
--help display this help and exit
--version output version information and exit
KEYDEF is F[.C][OPTS][,F[.C][OPTS]] for start and stop position, where F is a
field number and C a character position in the field; both are origin 1, and
the stop position defaults to the line's end. If neither -t nor -b is in
effect, characters in a field are counted from the beginning of the preceding
whitespace. OPTS is one or more single-letter ordering options [bdfgiMhnRrV],
which override global ordering options for that key. If no key is given, use
the entire line as the key. Use --debug to diagnose incorrect key usage.
SIZE may be followed by the following multiplicative suffixes:
% 1% of memory, b 1, K 1024 (default), and so on for M, G, T, P, E, Z, Y.
*** WARNING ***
The locale specified by the environment affects sort order.
Set LC_ALL=C to get the traditional sort order that uses
native byte values.
GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Report sort translation bugs to <http://translationproject.org/team/>
Full documentation at: <http://www.gnu.org/software/coreutils/sort>
or available locally via: info '(coreutils) sort invocation'
uniq
uniq --help
Usage: uniq [OPTION]... [INPUT [OUTPUT]]
Filter adjacent matching lines from INPUT (or standard input),
writing to OUTPUT (or standard output).
With no options, matching lines are merged to the first occurrence.
Mandatory arguments to long options are mandatory for short options too.
-c, --count prefix lines by the number of occurrences
-d, --repeated only print duplicate lines, one for each group
-D print all duplicate lines
--all-repeated[=METHOD] like -D, but allow separating groups
with an empty line;
METHOD={none(default),prepend,separate}
-f, --skip-fields=N avoid comparing the first N fields
--group[=METHOD] show all items, separating groups with an empty line;
METHOD={separate(default),prepend,append,both}
-i, --ignore-case ignore differences in case when comparing
-s, --skip-chars=N avoid comparing the first N characters
-u, --unique only print unique lines
-z, --zero-terminated line delimiter is NUL, not newline
-w, --check-chars=N compare no more than N characters in lines
--help display this help and exit
--version output version information and exit
A field is a run of blanks (usually spaces and/or TABs), then non-blank
characters. Fields are skipped before chars.
Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use 'sort -u' without 'uniq'.
Also, comparisons honor the rules specified by 'LC_COLLATE'.
GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Report uniq translation bugs to <http://translationproject.org/team/>
Full documentation at: <http://www.gnu.org/software/coreutils/uniq>
or available locally via: info '(coreutils) uniq invocation'
18. 各行を3コラム目の数値の降順にソート
素人の言語処理100本ノック:18
今日の作業記録 python error(言語処理100本ノック:18)
https://qiita.com/kaizen_nagoya/items/d184d9aec28ca8428f3d
19. 各行の1コラム目の文字列の出現頻度を求め,出現頻度の高い順に並べる
素人の言語処理100本ノック:19
参考資料(reference)
素人の言語処理100本ノック:まとめ
https://qiita.com/segavvy/items/fb50ba8097d59475f760
Dockerでpython言語処理100本ノック
https://qiita.com/taguchi_tomo/items/24483ceaea7638e83310
自己資料(self reference)
言語処理100本ノックをdockerで。
https://qiita.com/kaizen_nagoya/items/7e7eb7c543e0c18438c4
Play with Docker でエラー
https://qiita.com/kaizen_nagoya/items/fbf054705bff725dbc25
言語処理100本ノック 2015(python) 動作確認docker環境構築
https://qiita.com/kaizen_nagoya/items/abaf3fd0198f9f557243
言語処理100本ノック 2015(python) 落ち穂拾い 第1章: 準備運動 00, 01, 02, 03, 04, 05, 06,07
https://qiita.com/kaizen_nagoya/items/ee1b625b0b65cd63d42a
「Python 入門」の入門
https://qiita.com/kaizen_nagoya/items/22c99c5926984ede6573
Windows(MS)にPython(Anaconda)を導入する(6つの罠)
https://qiita.com/kaizen_nagoya/items/7bfd7ecdc4e8edcbd679
65歳からのプログラミング入門
https://qiita.com/kaizen_nagoya/items/1561f910c275b22d7c9f
65歳からのプログラミング入門(2) 二日目
https://qiita.com/kaizen_nagoya/items/57f362fb801fd3132803
なぜdockerで機械学習するか 書籍・ソース一覧作成中 (目標100)
https://qiita.com/kaizen_nagoya/items/ddd12477544bf5ba85e2
dockerで機械学習 with anaconda(1)「ゼロから作るDeep Learning - Pythonで学ぶディープラーニングの理論と実装」斎藤 康毅 著
https://qiita.com/kaizen_nagoya/items/a7e94ef6dca128d035ab
dockerで機械学習with anaconda(2)「ゼロから作るDeep Learning2自然言語処理編」斎藤 康毅 著
https://qiita.com/kaizen_nagoya/items/3b80dfc76933cea522c6
プログラミング言語教育のXYZ
https://qiita.com/kaizen_nagoya/items/1950c5810fb5c0b07be4
文書履歴(document history)
ver. 0.01 初稿 20190120 午前
ver. 0.02 参考資料追記 20190120 昼
ver. 0.03 「素人の言語処理100本ノック」 @segavvy さんのURL追記 20190120 午後
ver. 0.04 表記訂正 20190121 朝
ver. 0.05 資料整理 20190121 昼
#b# 最後までおよみいただきありがとうございました。
いいね 💚、フォローをお願いします。
Thank you very much for reading to the last sentence.
Please press the like icon 💚 and follow me for your happy life.