今回は、ユーザインターフェースとして使えるようにJuliusの音声認識の精度を上げていく。
独自の辞書を作成
独自の辞書を作成するために以下のファイルを作成します。
ファイル拡張子 | 説明 |
---|---|
yomi | 読みファイル |
phone | 音素ファイル |
grammar | 構文ファイル |
voca | 語彙ファイル |
pi@raspberrypi:~/julius $ mkdir dict
pi@raspberrypi:~/julius $ cd dict
読みファイルの作成
pi@raspberrypi:~/julius/dict $ vi sensor.yomi
おはよう おはよう
こんにちは こんにちは
こんばんわ こんばんわ
おんど おんど
しつど しつど
きあつ きあつ
音素ファイルの作成
Juliusに付属の yomi2voca.pl を使って音素ファイルを生成します。
pi@raspberrypi:~/julius $ ./julius-4.6/gramtools/yomi2voca/yomi2voca.pl ./dict/sensor.yomi > ./dict/sensor.phone
おはよう o h a y o u
こんにちは k o N n i ch i h a
こんばんわ k o N b a N w a
おんど o N d o
しつど sh i ts u d o
きあつ k i a ts u
構文ファイルの作成
pi@raspberrypi:~/julius $ cd dict
pi@raspberrypi:~/julius/dict $ vi sensor.grammar
S : NS_B SENSOR NS_E
SENSOR : OHAYOU
SENSOR : KONNICHIHA
SENSOR : KONBANWA
SENSOR : ONDO
SENSOR : SHITSUDO
SENSOR : KIATSU
語彙ファイルの作成
pi@raspberrypi:~/julius/dict $ cp sensor.phone sensor.voca
pi@raspberrypi:~/julius/dict $ vi sensor.voca
%OHAYOU
おはよう o h a y o u
%KONNICHIHA
こんにちは k o N n i ch i h a
%KONBANWA
こんばんわ k o N b a N w a
%ONDO
おんど o N d o
%SHITSUDO
しつど sh i ts u d o
%KIATSU
きあつ k i a ts u
% NS_B
[s] silB
% NS_E
[/s] silE
辞書ファイルの生成
最初sensor.dfaファイルとsensor.dictファイルを作るのにmkdfa.plファイルを使っていました。
mkdfa.plを実行するとcygpath not foundというエラーが出ていてxxx.dfaではなくxxx.dfatmpというファイルしか生成されていなかったので、xxx.dfatmpをxxx.dfaにリネームして使っていました。
そうして使っているというWebページがあったので気にしていなかったのですが、色々と試しているとsilBとsilEが逆になっていることに気がつきました。
色々と試してみたのですが、正常にならず困っていてmkdfa.plがあるディレクトリにmkdfa.pyというファイルがあったので、こちらを使って辞書ファイルを生成してみたところエラーも出ずにsensor.dfaファイルとsensor.dictファイルができていました。
(同じフォルダにsensor.termファイルとsensor.dfa.forwardファイルもできていました。)
pi@raspberrypi:~/julius/julius-4.6/gramtools/mkdfa $ mkdfa.py ~/julius/dict/sensor
/home/pi/julius/dict/sensor.grammar has 7 rules
/home/pi/julius/dict/sensor.voca has 8 categories and 8 words
---
Now parsing grammar file
Now modifying grammar to reduce states[-1]
Now parsing vocabulary file
Now making nondeterministic finite automaton[4/4]
Now making deterministic finite automaton[4/4]
Now making triplet list[4/4]
---
8 categories, 4 nodes, 8 arcs
-> minimized: 4 nodes, 8 arcs
Now parsing grammar file
Now modifying grammar to reduce states[-1]
Now parsing vocabulary file
Now making nondeterministic finite automaton[4/4]
Now making deterministic finite automaton[4/4]
Now making triplet list[4/4]
---
8 categories, 4 nodes, 8 arcs
-> minimized: 4 nodes, 8 arcs
---
generated /home/pi/julius/dict/sensor.dfa /home/pi/julius/dict/sensor.term /home/pi/julius/dict/sensor.dict /home/pi/julius/dict/sensor.dfa.forward
できた辞書ファイルを使って音声認識をさせてみます。
(省略)
pass1_best: [s] おんど [/s]
pass1_best_wordseq: 6 3 7
pass1_best_phonemeseq: silB | o N d o | silE
pass1_best_score: -2525.792725
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 107
sentence1: [s] おんど [/s]
wseq1: 6 3 7
phseq1: silB | o N d o | silE
cmscore1: 1.000 1.000 1.000
score1: -2525.794189
pass1_best: [s] しつど [/s]
pass1_best_wordseq: 6 4 7
pass1_best_phonemeseq: silB | sh i ts u d o | silE
pass1_best_score: -1794.630737
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 73
sentence1: [s] しつど [/s]
wseq1: 6 4 7
phseq1: silB | sh i ts u d o | silE
cmscore1: 1.000 0.887 1.000
score1: -1794.629761
(省略)
無事、文頭、文末の空白も正常に認識できるようになりました。
実行ログ(省略なし)
pi@raspberrypi:~ $ julius -C ~/julius/dictation-kit-4.5/am-gmm.jconf -nostrip -gram ~/julius/dict/sensor
STAT: include config: /home/pi/julius/dictation-kit-4.5/am-gmm.jconf
WARNING: m_chkparam: "-lmp" only for N-gram, ignored
WARNING: m_chkparam: "-lmp2" only for N-gram, ignored
STAT: jconf successfully finalized
STAT: *** loading AM00 _default
Stat: init_phmm: Reading in HMM definition
Stat: binhmm-header: variance inversed
Stat: read_binhmm: has inversed variances
Stat: read_binhmm: binary format HMM definition
Stat: read_binhmm: this HMM does not need multipath handling
Stat: init_phmm: defined HMMs: 8443
Stat: init_phmm: loading binary hmmlist
Stat: load_hmmlist_bin: reading hmmlist
Stat: aptree_read: 42857 nodes (21428 branch + 21429 data)
Stat: load_hmmlist_bin: reading pseudo phone set
Stat: aptree_read: 3253 nodes (1626 branch + 1627 data)
Stat: init_phmm: logical names: 21429 in HMMList
Stat: init_phmm: base phones: 43 used in logical
Stat: init_phmm: finished reading HMM definitions
STAT: pseudo phones are loaded from binary hmmlist file
Stat: hmm_lookup: 12 pseudo phones are added to logical HMM list
STAT: *** AM00 _default loaded
STAT: *** loading LM00 _default
STAT: reading [/home/pi/julius/dict/sensor.dfa] and [/home/pi/julius/dict/sensor.dict]...
Stat: init_voca: read 8 words
STAT: reading additional forward dfa [/home/pi/julius/dict/sensor.dfa.forward]
STAT: done
STAT: Gram #0 sensor registered
STAT: Gram #0 sensor: new grammar loaded, now mash it up for recognition
STAT: Gram #0 sensor: extracting category-pair constraint for the 1st pass
STAT: Gram #0 sensor: installed
STAT: Gram #0 sensor: turn on active
STAT: grammar update completed
STAT: *** LM00 _default loaded
STAT: ------
STAT: All models are ready, go for final fusion
STAT: [1] create MFCC extraction instance(s)
STAT: *** create MFCC calculation modules from AM
STAT: AM 0 _default: create a new module MFCC01
STAT: 1 MFCC modules created
STAT: [2] create recognition processing instance(s) with AM and LM
STAT: composing recognizer instance SR00 _default (AM00 _default, LM00 _default)
STAT: Building HMM lexicon tree
STAT: lexicon size: 120+0=120
STAT: coordination check passed
STAT: multi-gram: beam width set to 120 (guess) by lexicon change
STAT: wchmm (re)build completed
STAT: SR00 _default composed
STAT: [3] initialize for acoustic HMM calculation
Stat: outprob_init: state-level mixture PDFs, use calc_mix()
Stat: addlog: generating addlog table (size = 1953 kB)
Stat: addlog: addlog table generated
STAT: [4] prepare MFCC storage(s)
STAT: [5] prepare for real-time decoding
STAT: All init successfully done
STAT: ###### initialize input device
----------------------- System Information begin ---------------------
JuliusLib rev.4.6 (fast)
Engine specification:
- Base setup : fast
- Supported LM : DFA, N-gram, Word
- Extension : LibSndFile
- Compiled by : gcc -g -O2 -fPIC
Library configuration: version 4.6
- Audio input
primary A/D-in driver : alsa (Advanced Linux Sound Architecture)
available drivers : alsa
wavefile formats : various formats by libsndfile ver.1
max. length of an input : 320000 samples, 150 words
- Language Model
class N-gram support : yes
MBR weight support : yes
word id unit : short (2 bytes)
- Acoustic Model
multi-path treatment : autodetect
- External library
file decompression by : zlib library
- Process hangling
fork on adinnet input : no
- built-in SIMD instruction set for DNN
NONE AVAILABLE, DNN computation may be too slow!
- built-in CUDA support: no
------------------------------------------------------------
Configuration of Modules
Number of defined modules: AM=1, LM=1, SR=1
Acoustic Model (with input parameter spec.):
- AM00 "_default"
hmmfilename=/home/pi/julius/dictation-kit-4.5/model/phone_m/jnas-tri-3k16-gid.binhmm
hmmmapfilename=/home/pi/julius/dictation-kit-4.5/model/phone_m/logicalTri-3k16-gid.bin
Language Model:
- LM00 "_default"
grammar #1:
dfa = /home/pi/julius/dict/sensor.dfa
dict = /home/pi/julius/dict/sensor.dict
Recognizer:
- SR00 "_default" (AM00, LM00)
------------------------------------------------------------
Speech Analysis Module(s)
[MFCC01] for [AM00 _default]
Acoustic analysis condition:
parameter = MFCC_E_D_N_Z (25 dim. from 12 cepstrum + energy, abs energy supressed with CMN)
sample frequency = 16000 Hz
sample period = 625 (1 = 100ns)
window size = 400 samples (25.0 ms)
frame shift = 160 samples (10.0 ms)
pre-emphasis = 0.97
# filterbank = 24
cepst. lifter = 22
raw energy = False
energy normalize = False
delta window = 2 frames (20.0 ms) around
hi freq cut = OFF
lo freq cut = OFF
zero mean frame = OFF
use power = OFF
CVN = OFF
VTLN = OFF
spectral subtraction = off
cep. mean normalization = yes, real-time MAP-CMN, updating initial mean with last 500 input frames
initial mean from file = N/A
beginning data weight = 100.00
cep. var. normalization = no
base setup from = Julius defaults
------------------------------------------------------------
Acoustic Model(s)
[AM00 "_default"]
HMM Info:
8443 models, 3090 states, 3090 mpdfs, 49440 Gaussians are defined
model type = context dependency handling ON
training parameter = MFCC_E_N_D_Z
vector length = 25
number of stream = 1
stream info = [0-24]
cov. matrix type = DIAGC
duration type = NULLD
max mixture size = 16 Gaussians
max length of model = 5 states
logical base phones = 43
model skip trans. = not exist, no multi-path handling
AM Parameters:
Gaussian pruning = none (full computation) (-gprune)
short pause HMM name = "sp" specified, "sp" applied (physical) (-sp)
cross-word CD on pass1 = handle by approx. (use average prob. of same LC)
------------------------------------------------------------
Language Model(s)
[LM00 "_default"] type=grammar
DFA grammar info:
4 nodes, 8 arcs, 8 terminal(category) symbols
category-pair matrix: 56 bytes (896 bytes allocated)
additional forward DFA grammar info:
4 nodes, 8 arcs, 8 terminal(category) symbols
category-pair matrix: 0 bytes (0 bytes allocated)
Vocabulary Info:
vocabulary size = 8 words, 40 models
average word len = 5.0 models, 15.0 states
maximum state num = 27 nodes per word
transparent words = not exist
words under class = not exist
Parameters:
found sp category IDs =
------------------------------------------------------------
Recognizer(s)
[SR00 "_default"] AM00 "_default" + LM00 "_default"
Lexicon tree:
total node num = 120
root node num = 8
leaf node num = 8
(-penalty1) IW penalty1 = +0.0
(-penalty2) IW penalty2 = +0.0
(-cmalpha)CM alpha coef = 0.050000
Search parameters:
multi-path handling = no
(-b) trellis beam width = 120 (-1 or not specified - guessed)
(-bs)score pruning thres= disabled
(-n)search candidate num= 1
(-s) search stack size = 500
(-m) search overflow = after 2000 hypothesis poped
2nd pass method = searching sentence, generating N-best
(-b2) pass2 beam width = 30
(-lookuprange)lookup range= 5 (tm-5 <= t <tm+5)
(-sb)2nd scan beamthres = 80.0 (in logscore)
(-n) search till = 1 candidates found
(-output) and output = 1 candidates out of above
IWCD handling:
1st pass: approximation (use average prob. of same LC)
2nd pass: loose (apply when hypo. is popped and scanned)
all possible words will be expanded in 2nd pass
build_wchmm2() used
lcdset limited by word-pair constraint
short pause segmentation = off
fall back on search fail = off, returns search failure
------------------------------------------------------------
Decoding algorithm:
1st pass input processing = real time, on-the-fly
1st pass method = 1-best approx. generating indexed trellis
output word confidence measure based on search-time scores
------------------------------------------------------------
FrontEnd:
Input stream:
input type = waveform
input source = microphone
device API = default
sampling freq. = 16000 Hz
threaded A/D-in = supported, on
zero frames stripping = off
silence cutting = on
level thres = 2000 / 32767
zerocross thres = 60 / sec.
head margin = 300 msec.
tail margin = 400 msec.
chunk size = 1000 samples
FVAD switch value = -1 (disabled)
long-term DC removal = off
level scaling factor = 1.00 (disabled)
reject short input = off
reject long input = off
----------------------- System Information end -----------------------
Notice for feature extraction (01),
*************************************************************
* Cepstral mean normalization for real-time decoding: *
* NOTICE: The first input may not be recognized, since *
* no initial mean is available on startup. *
*************************************************************
------
### read waveform input
Stat: capture audio at 16000Hz
Stat: adin_alsa: latency set to 32 msec (chunk = 512 bytes)
Error: adin_alsa: unable to get pcm info from card control
Warning: adin_alsa: skip output of detailed audio device info
STAT: AD-in thread created
pass1_best: [s] おんど [/s]
pass1_best_wordseq: 6 3 7
pass1_best_phonemeseq: silB | o N d o | silE
pass1_best_score: -2385.513428
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 106
sentence1: [s] おんど [/s]
wseq1: 6 3 7
phseq1: silB | o N d o | silE
cmscore1: 1.000 1.000 1.000
score1: -2385.515625
pass1_best: [s] しつど [/s]
pass1_best_wordseq: 6 4 7
pass1_best_phonemeseq: silB | sh i ts u d o | silE
pass1_best_score: -2274.499512
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 94
sentence1: [s] しつど [/s]
wseq1: 6 4 7
phseq1: silB | sh i ts u d o | silE
cmscore1: 1.000 1.000 1.000
score1: -2274.498291
pass1_best: [s] おんど [/s]
pass1_best_wordseq: 6 3 7
pass1_best_phonemeseq: silB | o N d o | silE
pass1_best_score: -1950.462280
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 82
sentence1: [s] おんど [/s]
wseq1: 6 3 7
phseq1: silB | o N d o | silE
cmscore1: 1.000 1.000 1.000
score1: -1950.461060
pass1_best: [s] こんにちは [/s]
pass1_best_wordseq: 6 1 7
pass1_best_phonemeseq: silB | k o N n i ch i h a | silE
pass1_best_score: -2781.791992
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 113
sentence1: [s] こんにちは [/s]
wseq1: 6 1 7
phseq1: silB | k o N n i ch i h a | silE
cmscore1: 1.000 1.000 1.000
score1: -2781.795654
pass1_best: [s] こんばんわ [/s]
pass1_best_wordseq: 6 2 7
pass1_best_phonemeseq: silB | k o N b a N w a | silE
pass1_best_score: -2732.918213
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 115
sentence1: [s] こんばんわ [/s]
wseq1: 6 2 7
phseq1: silB | k o N b a N w a | silE
cmscore1: 1.000 1.000 1.000
score1: -2732.920410
<<< please speak >>
mkdfa.plを使って辞書ファイルを作る(エラーになる)
失敗していた時の情報も残しておきます。
mkdfa.plを使った時
Juliusに付属している mkdfa.pl を使って、Juliusで使用できる辞書ファイルを作成します。
pi@raspberrypi:~/julius/dict $ cd ~/julius/julius-4.6/gramtools/mkdfa/
pi@raspberrypi:~/julius/julius-4.6/gramtools/mkdfa $ mkdfa.pl ~/julius/dict/sensor
/home/pi/julius/dict/sensor.grammar has 7 rules
/home/pi/julius/dict/sensor.voca has 8 categories and 8 words
---
executing [/usr/local/bin/mkfa -e1 -fg ./tmp1716-rev.grammar -fv ./tmp1716.voca -fo /home/pi/julius/dict/sensor.dfatmp -fh ./tmp1716.h]
Now parsing grammar file
Now modifying grammar to reduce states[-1]
Now parsing vocabulary file
Now making nondeterministic finite automaton[4/4]
Now making deterministic finite automaton[4/4]
Now making triplet list[4/4]
sh: 1: cygpath: not found
sh: 1: cygpath: not found
usage: dfa_minimize [dfafile] [-o outfile]
executing [/usr/local/bin/mkfa -e1 -fg ./tmp1716.grammar -fv ./tmp1716.voca -fo /home/pi/julius/dict/sensor.dfatmp -fh ./tmp1716.h]
Now parsing grammar file
Now modifying grammar to reduce states[-1]
Now parsing vocabulary file
Now making nondeterministic finite automaton[4/4]
Now making deterministic finite automaton[4/4]
Now making triplet list[4/4]
sh: 1: cygpath: not found
sh: 1: cygpath: not found
usage: dfa_minimize [dfafile] [-o outfile]
---
generated: /home/pi/julius/dict/sensor.dfa /home/pi/julius/dict/sensor.term /home/pi/julius/dict/sensor.dict /home/pi/julius/dict/sensor.dfa.forward
pi@raspberrypi:~/julius/julius-4.6/gramtools/mkdfa $ cd ~/julius/dict
pi@raspberrypi:~/julius/dict $ ls
sensor.dfatmp sensor.dict sensor.grammar sensor.phone sensor.term sensor.voca sensor.yomi
pi@raspberrypi:~/julius/dict $ ls -l
合計 28
-rw-r--r-- 1 pi pi 92 10月 24 16:12 sensor.dfatmp
-rw-r--r-- 1 pi pi 206 10月 24 16:12 sensor.dict
-rw-r--r-- 1 pi pi 123 10月 24 16:04 sensor.grammar
-rw-r--r-- 1 pi pi 155 10月 24 15:58 sensor.phone
-rw-r--r-- 1 pi pi 74 10月 24 16:12 sensor.term
-rw-r--r-- 1 pi pi 242 10月 24 16:08 sensor.voca
-rw-r--r-- 1 pi pi 150 10月 24 15:58 sensor.yomi
変換が終わると、sensor.dfatmp,sensor.dict,sensor.termというファイルができている。
あちこちの情報を見ると.dfatmpではなく、.dfaのファイルができるはずなのだが、できていない。
上記のログでcygpathがnot foundになっている影響なのか。
ただ今の環境はラズパイなのでcygpathは関係ないと思うのだが...。
とりあえず、sensor.dfatmpをsensor.dfaに変更して使ってみる。
pi@raspberrypi:~/julius/julius-4.6/gramtools/mkdfa $ cd ~/julius/dict
pi@raspberrypi:~/julius/dict $ cp sensor.dfatmp sensor.dfa
一部気になる点があるが、ファイルが揃ったので実行してみる。
pi@raspberrypi:~/julius/dict $ julius -C ~/julius/dictation-kit-4.5/am-gmm.jconf -nostrip -gram ~/julius/dict/sensor -input mic
実行ログ
pi@raspberrypi:~/julius/dict $ julius -C ~/julius/dictation-kit-4.5/am-gmm.jconf -nostrip -gram ~/julius/dict/sensor -input mic
STAT: include config: /home/pi/julius/dictation-kit-4.5/am-gmm.jconf
WARNING: m_chkparam: "-lmp" only for N-gram, ignored
WARNING: m_chkparam: "-lmp2" only for N-gram, ignored
STAT: jconf successfully finalized
STAT: *** loading AM00 _default
Stat: init_phmm: Reading in HMM definition
Stat: binhmm-header: variance inversed
Stat: read_binhmm: has inversed variances
Stat: read_binhmm: binary format HMM definition
Stat: read_binhmm: this HMM does not need multipath handling
Stat: init_phmm: defined HMMs: 8443
Stat: init_phmm: loading binary hmmlist
Stat: load_hmmlist_bin: reading hmmlist
Stat: aptree_read: 42857 nodes (21428 branch + 21429 data)
Stat: load_hmmlist_bin: reading pseudo phone set
Stat: aptree_read: 3253 nodes (1626 branch + 1627 data)
Stat: init_phmm: logical names: 21429 in HMMList
Stat: init_phmm: base phones: 43 used in logical
Stat: init_phmm: finished reading HMM definitions
STAT: pseudo phones are loaded from binary hmmlist file
Stat: hmm_lookup: 12 pseudo phones are added to logical HMM list
STAT: *** AM00 _default loaded
STAT: *** loading LM00 _default
STAT: reading [/home/pi/julius/dict/sensor.dfa] and [/home/pi/julius/dict/sensor.dict]...
Stat: init_voca: read 8 words
Error: gzfile: unable to open /home/pi/julius/dict/sensor.dfa.forward
Error: init_dfa: failed to open /home/pi/julius/dict/sensor.dfa.forward
STAT: done
STAT: Gram #0 sensor registered
STAT: Gram #0 sensor: new grammar loaded, now mash it up for recognition
STAT: Gram #0 sensor: extracting category-pair constraint for the 1st pass
STAT: Gram #0 sensor: installed
STAT: Gram #0 sensor: turn on active
STAT: grammar update completed
STAT: *** LM00 _default loaded
STAT: ------
STAT: All models are ready, go for final fusion
STAT: [1] create MFCC extraction instance(s)
STAT: *** create MFCC calculation modules from AM
STAT: AM 0 _default: create a new module MFCC01
STAT: 1 MFCC modules created
STAT: [2] create recognition processing instance(s) with AM and LM
STAT: composing recognizer instance SR00 _default (AM00 _default, LM00 _default)
STAT: Building HMM lexicon tree
STAT: lexicon size: 120+0=120
STAT: coordination check passed
STAT: multi-gram: beam width set to 120 (guess) by lexicon change
STAT: wchmm (re)build completed
STAT: SR00 _default composed
STAT: [3] initialize for acoustic HMM calculation
Stat: outprob_init: state-level mixture PDFs, use calc_mix()
Stat: addlog: generating addlog table (size = 1953 kB)
Stat: addlog: addlog table generated
STAT: [4] prepare MFCC storage(s)
STAT: [5] prepare for real-time decoding
STAT: All init successfully done
STAT: ###### initialize input device
----------------------- System Information begin ---------------------
JuliusLib rev.4.6 (fast)
Engine specification:
- Base setup : fast
- Supported LM : DFA, N-gram, Word
- Extension : LibSndFile
- Compiled by : gcc -g -O2 -fPIC
Library configuration: version 4.6
- Audio input
primary A/D-in driver : alsa (Advanced Linux Sound Architecture)
available drivers : alsa
wavefile formats : various formats by libsndfile ver.1
max. length of an input : 320000 samples, 150 words
- Language Model
class N-gram support : yes
MBR weight support : yes
word id unit : short (2 bytes)
- Acoustic Model
multi-path treatment : autodetect
- External library
file decompression by : zlib library
- Process hangling
fork on adinnet input : no
- built-in SIMD instruction set for DNN
NONE AVAILABLE, DNN computation may be too slow!
- built-in CUDA support: no
------------------------------------------------------------
Configuration of Modules
Number of defined modules: AM=1, LM=1, SR=1
Acoustic Model (with input parameter spec.):
- AM00 "_default"
hmmfilename=/home/pi/julius/dictation-kit-4.5/model/phone_m/jnas-tri-3k16-gid.binhmm
hmmmapfilename=/home/pi/julius/dictation-kit-4.5/model/phone_m/logicalTri-3k16-gid.bin
Language Model:
- LM00 "_default"
grammar #1:
dfa = /home/pi/julius/dict/sensor.dfa
dict = /home/pi/julius/dict/sensor.dict
Recognizer:
- SR00 "_default" (AM00, LM00)
------------------------------------------------------------
Speech Analysis Module(s)
[MFCC01] for [AM00 _default]
Acoustic analysis condition:
parameter = MFCC_E_D_N_Z (25 dim. from 12 cepstrum + energy, abs energy supressed with CMN)
sample frequency = 16000 Hz
sample period = 625 (1 = 100ns)
window size = 400 samples (25.0 ms)
frame shift = 160 samples (10.0 ms)
pre-emphasis = 0.97
# filterbank = 24
cepst. lifter = 22
raw energy = False
energy normalize = False
delta window = 2 frames (20.0 ms) around
hi freq cut = OFF
lo freq cut = OFF
zero mean frame = OFF
use power = OFF
CVN = OFF
VTLN = OFF
spectral subtraction = off
cep. mean normalization = yes, real-time MAP-CMN, updating initial mean with last 500 input frames
initial mean from file = N/A
beginning data weight = 100.00
cep. var. normalization = no
base setup from = Julius defaults
------------------------------------------------------------
Acoustic Model(s)
[AM00 "_default"]
HMM Info:
8443 models, 3090 states, 3090 mpdfs, 49440 Gaussians are defined
model type = context dependency handling ON
training parameter = MFCC_E_N_D_Z
vector length = 25
number of stream = 1
stream info = [0-24]
cov. matrix type = DIAGC
duration type = NULLD
max mixture size = 16 Gaussians
max length of model = 5 states
logical base phones = 43
model skip trans. = not exist, no multi-path handling
AM Parameters:
Gaussian pruning = none (full computation) (-gprune)
short pause HMM name = "sp" specified, "sp" applied (physical) (-sp)
cross-word CD on pass1 = handle by approx. (use average prob. of same LC)
------------------------------------------------------------
Language Model(s)
[LM00 "_default"] type=grammar
DFA grammar info:
4 nodes, 8 arcs, 8 terminal(category) symbols
category-pair matrix: 56 bytes (896 bytes allocated)
Vocabulary Info:
vocabulary size = 8 words, 40 models
average word len = 5.0 models, 15.0 states
maximum state num = 27 nodes per word
transparent words = not exist
words under class = not exist
Parameters:
found sp category IDs =
------------------------------------------------------------
Recognizer(s)
[SR00 "_default"] AM00 "_default" + LM00 "_default"
Lexicon tree:
total node num = 120
root node num = 8
leaf node num = 8
(-penalty1) IW penalty1 = +0.0
(-penalty2) IW penalty2 = +0.0
(-cmalpha)CM alpha coef = 0.050000
Search parameters:
multi-path handling = no
(-b) trellis beam width = 120 (-1 or not specified - guessed)
(-bs)score pruning thres= disabled
(-n)search candidate num= 1
(-s) search stack size = 500
(-m) search overflow = after 2000 hypothesis poped
2nd pass method = searching sentence, generating N-best
(-b2) pass2 beam width = 30
(-lookuprange)lookup range= 5 (tm-5 <= t <tm+5)
(-sb)2nd scan beamthres = 80.0 (in logscore)
(-n) search till = 1 candidates found
(-output) and output = 1 candidates out of above
IWCD handling:
1st pass: approximation (use average prob. of same LC)
2nd pass: loose (apply when hypo. is popped and scanned)
all possible words will be expanded in 2nd pass
build_wchmm2() used
lcdset limited by word-pair constraint
short pause segmentation = off
fall back on search fail = off, returns search failure
------------------------------------------------------------
Decoding algorithm:
1st pass input processing = real time, on-the-fly
1st pass method = 1-best approx. generating indexed trellis
output word confidence measure based on search-time scores
------------------------------------------------------------
FrontEnd:
Input stream:
input type = waveform
input source = microphone
device API = default
sampling freq. = 16000 Hz
threaded A/D-in = supported, on
zero frames stripping = off
silence cutting = on
level thres = 2000 / 32767
zerocross thres = 60 / sec.
head margin = 300 msec.
tail margin = 400 msec.
chunk size = 1000 samples
FVAD switch value = -1 (disabled)
long-term DC removal = off
level scaling factor = 1.00 (disabled)
reject short input = off
reject long input = off
----------------------- System Information end -----------------------
Notice for feature extraction (01),
*************************************************************
* Cepstral mean normalization for real-time decoding: *
* NOTICE: The first input may not be recognized, since *
* no initial mean is available on startup. *
*************************************************************
------
### read waveform input
Stat: capture audio at 16000Hz
Stat: adin_alsa: latency set to 32 msec (chunk = 512 bytes)
Error: adin_alsa: unable to get pcm info from card control
Warning: adin_alsa: skip output of detailed audio device info
STAT: AD-in thread created
<<< please speak >>>
pass1_best: [/s] こんにちは [s]
pass1_best_wordseq: 7 1 6
pass1_best_phonemeseq: silE | k o N n i ch i h a | silB
pass1_best_score: -3234.054932
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 131
sentence1: [/s] こんにちは [s]
wseq1: 7 1 6
phseq1: silE | k o N n i ch i h a | silB
cmscore1: 1.000 0.998 1.000
score1: -3234.052979
pass1_best: [/s] おはよう [s]
pass1_best_wordseq: 7 0 6
pass1_best_phonemeseq: silE | o h a y o u | silB
pass1_best_score: -2979.170166
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 123
sentence1: [/s] おはよう [s]
wseq1: 7 0 6
phseq1: silE | o h a y o u | silB
cmscore1: 1.000 1.000 1.000
score1: -2979.166748
pass1_best: [/s] おんど [s]
pass1_best_wordseq: 7 3 6
pass1_best_phonemeseq: silE | o N d o | silB
pass1_best_score: -1790.548218
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 78
sentence1: [/s] おんど [s]
wseq1: 7 3 6
phseq1: silE | o N d o | silB
cmscore1: 1.000 1.000 1.000
score1: -1790.547974
pass1_best: [/s] しつど [s]
pass1_best_wordseq: 7 4 6
pass1_best_phonemeseq: silE | sh i ts u d o | silB
pass1_best_score: -2026.530518
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 84
sentence1: [/s] しつど [s]
wseq1: 7 4 6
phseq1: silE | sh i ts u d o | silB
cmscore1: 1.000 1.000 1.000
score1: -2026.531372
<<< please speak >>>^C
実行結果
「こんにちは」「おはよう」「こんばんわ」「おんど」「しつど」「きあつ」「かきくけこ」と音声認識させてみたら、登録されている単語は認識に成功。
登録していない「かきくけこ」が「しつど」に認識されたのはどうしてかわからないけど、使う言葉のみ登録して辞書を作成すればどうにかなりそうなことはわかりました。
### read waveform input
Stat: capture audio at 16000Hz
Stat: adin_alsa: latency set to 32 msec (chunk = 512 bytes)
Error: adin_alsa: unable to get pcm info from card control
Warning: adin_alsa: skip output of detailed audio device info
STAT: AD-in thread created
pass1_best: [/s] こんにちは [s]
pass1_best_wordseq: 7 1 6
pass1_best_phonemeseq: silE | k o N n i ch i h a | silB
pass1_best_score: -3018.866699
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 122
sentence1: [/s] こんにちは [s]
wseq1: 7 1 6
phseq1: silE | k o N n i ch i h a | silB
cmscore1: 1.000 0.999 1.000
score1: -3018.866455
pass1_best: [/s] おはよう [s]
pass1_best_wordseq: 7 0 6
pass1_best_phonemeseq: silE | o h a y o u | silB
pass1_best_score: -2230.211670
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 94
sentence1: [/s] おはよう [s]
wseq1: 7 0 6
phseq1: silE | o h a y o u | silB
cmscore1: 1.000 1.000 1.000
score1: -2230.211914
pass1_best: [/s] こんばんわ [s]
pass1_best_wordseq: 7 2 6
pass1_best_phonemeseq: silE | k o N b a N w a | silB
pass1_best_score: -2848.747559
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 123
sentence1: [/s] こんばんわ [s]
wseq1: 7 2 6
phseq1: silE | k o N b a N w a | silB
cmscore1: 1.000 1.000 1.000
score1: -2848.744629
pass1_best: [/s] おんど [s]
pass1_best_wordseq: 7 3 6
pass1_best_phonemeseq: silE | o N d o | silB
pass1_best_score: -1869.269531
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 82
sentence1: [/s] おんど [s]
wseq1: 7 3 6
phseq1: silE | o N d o | silB
cmscore1: 1.000 1.000 1.000
score1: -1869.268921
pass1_best: [/s] しつど [s]
pass1_best_wordseq: 7 4 6
pass1_best_phonemeseq: silE | sh i ts u d o | silB
pass1_best_score: -1910.348633
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 78
sentence1: [/s] しつど [s]
wseq1: 7 4 6
phseq1: silE | sh i ts u d o | silB
cmscore1: 1.000 0.797 1.000
score1: -1910.348877
pass1_best: [/s] きあつ [s]
pass1_best_wordseq: 7 5 6
pass1_best_phonemeseq: silE | k i a ts u | silB
pass1_best_score: -2064.148193
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 86
sentence1: [/s] きあつ [s]
wseq1: 7 5 6
phseq1: silE | k i a ts u | silB
cmscore1: 1.000 1.000 1.000
score1: -2064.148926
pass1_best: [/s] しつど [s]
pass1_best_wordseq: 7 4 6
pass1_best_phonemeseq: silE | sh i ts u d o | silB
pass1_best_score: -3752.242188
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 140
sentence1: [/s] しつど [s]
wseq1: 7 4 6
phseq1: silE | sh i ts u d o | silB
cmscore1: 1.000 0.977 1.000
score1: -3752.237061
<<< please speak >>>