0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

字幕ファイル(.srt)から単語を切り出してみる

Posted at

はじめに

  • 字幕ファイルから単語リストを切り出しておきたい。。。
  • そうだ、Command の -w 機能で切り出してみよう

前提

  • 対象のファイルは .srt のフォーマット
  • .srt の例
6
00:08:44,033 --> 00:08:46,868
Come in.
please report.

やってみる

  • 単語数
$ cat some_subtitle.srt | grep -v -e '^[.,:0-9> \-]\+\r' | grep -v -e '^\r$' | grep -o -E '\w+' | sort | uniq | wc -l
    1233
  • 単語表示
$ cat some_subtitle.srt | grep -v -e '^[.,:0-9> \-]\+\r' | grep -v -e '^\r$' | grep -o -E '\w+' | sort | uniq
000
1
10
10K
12
12th
13th
157
15th
16
 :
 :
wrong
yeah
year
years
yet
yo
you
young
$
  • ファイルに保存
$ cat some_subtitle.srt | grep -v -e '^[.,:0-9> \-]\+\r' | grep -v -e '^\r$' | grep -o -E '\w+' | sort | uniq > wordlist.txt

さいごに

  • かんたんでしたね
0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?