0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

字幕ファイル(.sub)から単語を切り出してみる

Last updated at Posted at 2022-11-27

はじめに

  • 字幕ファイルから単語リストを切り出しておきたい。。。
  • そうだ、Command の -w 機能で切り出してみよう

前提

  • 対象のファイルは .sub のフォーマット
  • .sub の例
{16819}{16850}Yeah, I am.

やってみる

  • 単語数
$ cat some_subtitle.sub | awk '{print substr($0,index($0,"}")+1)}' | awk '{print substr($0,index($0,"}")+1)}' | grep -o -E '\w+' | sort | uniq | wc -l
    1602
  • 単語表示
$ cat some_subtitle.sub | awk '{print substr($0,index($0,"}")+1)}' | awk '{print substr($0,index($0,"}")+1)}' | grep -o -E '\w+' | sort | uniq
000
1
10
10K
12
12th
13th
157
15th
16
 :
 :
wrong
yeah
year
years
yet
yo
you
young
$
  • ファイルに保存
$ cat some_subtitle.sub | awk '{print substr($0,index($0,"}")+1)}' | awk '{print substr($0,index($0,"}")+1)}' | grep -o -E '\w+' | sort | uniq > wordlist.txt

さいごに

  • かんたんでしたね
0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?