LoginSignup
1
0

More than 1 year has passed since last update.

オムロン環境センサ(2JCIE-BU)をラズパイで使ってみた。(7)

Last updated at Posted at 2021-10-24

今回は、ユーザインターフェースとして使えるようにJuliusの音声認識の精度を上げていく。

独自の辞書を作成

独自の辞書を作成するために以下のファイルを作成します。

ファイル拡張子 説明
yomi 読みファイル
phone 音素ファイル
grammar 構文ファイル
voca 語彙ファイル
pi@raspberrypi:~/julius $ mkdir dict
pi@raspberrypi:~/julius $ cd dict

読みファイルの作成

pi@raspberrypi:~/julius/dict $ vi sensor.yomi
sensor.yomi
おはよう おはよう
こんにちは こんにちは
こんばんわ こんばんわ
おんど おんど
しつど しつど
きあつ きあつ

音素ファイルの作成

Juliusに付属の yomi2voca.pl を使って音素ファイルを生成します。

pi@raspberrypi:~/julius $ ./julius-4.6/gramtools/yomi2voca/yomi2voca.pl ./dict/sensor.yomi > ./dict/sensor.phone
sensor.phone
おはよう    o h a y o u
こんにちは k o N n i ch i h a
こんばんわ k o N b a N w a
おんど   o N d o
しつど   sh i ts u d o
きあつ   k i a ts u

構文ファイルの作成

pi@raspberrypi:~/julius $ cd dict
pi@raspberrypi:~/julius/dict $ vi sensor.grammar
sensor.grammar
S : NS_B SENSOR NS_E
SENSOR : OHAYOU
SENSOR : KONNICHIHA
SENSOR : KONBANWA
SENSOR : ONDO
SENSOR : SHITSUDO
SENSOR : KIATSU

語彙ファイルの作成

pi@raspberrypi:~/julius/dict $ cp sensor.phone sensor.voca
pi@raspberrypi:~/julius/dict $ vi sensor.voca 
sensor.voca
%OHAYOU
おはよう    o h a y o u
%KONNICHIHA
こんにちは k o N n i ch i h a
%KONBANWA
こんばんわ k o N b a N w a
%ONDO
おんど   o N d o
%SHITSUDO
しつど   sh i ts u d o
%KIATSU
きあつ   k i a ts u
% NS_B
[s] silB
% NS_E
[/s] silE

辞書ファイルの生成

最初sensor.dfaファイルとsensor.dictファイルを作るのにmkdfa.plファイルを使っていました。
mkdfa.plを実行するとcygpath not foundというエラーが出ていてxxx.dfaではなくxxx.dfatmpというファイルしか生成されていなかったので、xxx.dfatmpをxxx.dfaにリネームして使っていました。

そうして使っているというWebページがあったので気にしていなかったのですが、色々と試しているとsilBとsilEが逆になっていることに気がつきました。

色々と試してみたのですが、正常にならず困っていてmkdfa.plがあるディレクトリにmkdfa.pyというファイルがあったので、こちらを使って辞書ファイルを生成してみたところエラーも出ずにsensor.dfaファイルとsensor.dictファイルができていました。
(同じフォルダにsensor.termファイルとsensor.dfa.forwardファイルもできていました。)

pi@raspberrypi:~/julius/julius-4.6/gramtools/mkdfa $ mkdfa.py ~/julius/dict/sensor
/home/pi/julius/dict/sensor.grammar has 7 rules
/home/pi/julius/dict/sensor.voca has 8 categories and 8 words
---
Now parsing grammar file
Now modifying grammar to reduce states[-1]
Now parsing vocabulary file
Now making nondeterministic finite automaton[4/4]
Now making deterministic finite automaton[4/4] 
Now making triplet list[4/4]
---
8 categories, 4 nodes, 8 arcs
-> minimized: 4 nodes, 8 arcs
Now parsing grammar file
Now modifying grammar to reduce states[-1]
Now parsing vocabulary file
Now making nondeterministic finite automaton[4/4]
Now making deterministic finite automaton[4/4] 
Now making triplet list[4/4]
---
8 categories, 4 nodes, 8 arcs
-> minimized: 4 nodes, 8 arcs
---
generated /home/pi/julius/dict/sensor.dfa /home/pi/julius/dict/sensor.term /home/pi/julius/dict/sensor.dict /home/pi/julius/dict/sensor.dfa.forward

できた辞書ファイルを使って音声認識をさせてみます。

(省略)
pass1_best: [s] おんど [/s]
pass1_best_wordseq: 6 3 7
pass1_best_phonemeseq: silB | o N d o | silE
pass1_best_score: -2525.792725
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 107
sentence1: [s] おんど [/s]
wseq1: 6 3 7
phseq1: silB | o N d o | silE
cmscore1: 1.000 1.000 1.000
score1: -2525.794189

pass1_best: [s] しつど [/s]
pass1_best_wordseq: 6 4 7
pass1_best_phonemeseq: silB | sh i ts u d o | silE
pass1_best_score: -1794.630737
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 73
sentence1: [s] しつど [/s]
wseq1: 6 4 7
phseq1: silB | sh i ts u d o | silE
cmscore1: 1.000 0.887 1.000
score1: -1794.629761
(省略)

無事、文頭、文末の空白も正常に認識できるようになりました。


実行ログ(省略なし)
pi@raspberrypi:~ $ julius -C ~/julius/dictation-kit-4.5/am-gmm.jconf -nostrip -gram ~/julius/dict/sensor
STAT: include config: /home/pi/julius/dictation-kit-4.5/am-gmm.jconf
WARNING: m_chkparam: "-lmp" only for N-gram, ignored
WARNING: m_chkparam: "-lmp2" only for N-gram, ignored
STAT: jconf successfully finalized
STAT: *** loading AM00 _default
Stat: init_phmm: Reading in HMM definition
Stat: binhmm-header: variance inversed
Stat: read_binhmm: has inversed variances
Stat: read_binhmm: binary format HMM definition
Stat: read_binhmm: this HMM does not need multipath handling
Stat: init_phmm: defined HMMs:  8443
Stat: init_phmm: loading binary hmmlist
Stat: load_hmmlist_bin: reading hmmlist
Stat: aptree_read: 42857 nodes (21428 branch + 21429 data)
Stat: load_hmmlist_bin: reading pseudo phone set
Stat: aptree_read: 3253 nodes (1626 branch + 1627 data)
Stat: init_phmm: logical names: 21429 in HMMList
Stat: init_phmm: base phones:    43 used in logical
Stat: init_phmm: finished reading HMM definitions
STAT: pseudo phones are loaded from binary hmmlist file
Stat: hmm_lookup: 12 pseudo phones are added to logical HMM list
STAT: *** AM00 _default loaded
STAT: *** loading LM00 _default
STAT: reading [/home/pi/julius/dict/sensor.dfa] and [/home/pi/julius/dict/sensor.dict]...
Stat: init_voca: read 8 words
STAT: reading additional forward dfa [/home/pi/julius/dict/sensor.dfa.forward]
STAT: done
STAT: Gram #0 sensor registered
STAT: Gram #0 sensor: new grammar loaded, now mash it up for recognition
STAT: Gram #0 sensor: extracting category-pair constraint for the 1st pass
STAT: Gram #0 sensor: installed
STAT: Gram #0 sensor: turn on active
STAT: grammar update completed
STAT: *** LM00 _default loaded
STAT: ------
STAT: All models are ready, go for final fusion
STAT: [1] create MFCC extraction instance(s)
STAT: *** create MFCC calculation modules from AM
STAT: AM 0 _default: create a new module MFCC01
STAT: 1 MFCC modules created
STAT: [2] create recognition processing instance(s) with AM and LM
STAT: composing recognizer instance SR00 _default (AM00 _default, LM00 _default)
STAT: Building HMM lexicon tree
STAT: lexicon size: 120+0=120
STAT: coordination check passed
STAT: multi-gram: beam width set to 120 (guess) by lexicon change
STAT: wchmm (re)build completed
STAT: SR00 _default composed
STAT: [3] initialize for acoustic HMM calculation
Stat: outprob_init: state-level mixture PDFs, use calc_mix()
Stat: addlog: generating addlog table (size = 1953 kB)
Stat: addlog: addlog table generated
STAT: [4] prepare MFCC storage(s)
STAT: [5] prepare for real-time decoding
STAT: All init successfully done

STAT: ###### initialize input device
----------------------- System Information begin ---------------------
JuliusLib rev.4.6 (fast)

Engine specification:
 -  Base setup   : fast
 -  Supported LM : DFA, N-gram, Word
 -  Extension    : LibSndFile
 -  Compiled by  : gcc -g -O2 -fPIC
Library configuration: version 4.6
 - Audio input
    primary A/D-in driver   : alsa (Advanced Linux Sound Architecture)
    available drivers       : alsa
    wavefile formats        : various formats by libsndfile ver.1
    max. length of an input : 320000 samples, 150 words
 - Language Model
    class N-gram support    : yes
    MBR weight support      : yes
    word id unit            : short (2 bytes)
 - Acoustic Model
    multi-path treatment    : autodetect
 - External library
    file decompression by   : zlib library
 - Process hangling
    fork on adinnet input   : no
 - built-in SIMD instruction set for DNN

    NONE AVAILABLE, DNN computation may be too slow!
 - built-in CUDA support: no


------------------------------------------------------------
Configuration of Modules

 Number of defined modules: AM=1, LM=1, SR=1

 Acoustic Model (with input parameter spec.):
 - AM00 "_default"
    hmmfilename=/home/pi/julius/dictation-kit-4.5/model/phone_m/jnas-tri-3k16-gid.binhmm
    hmmmapfilename=/home/pi/julius/dictation-kit-4.5/model/phone_m/logicalTri-3k16-gid.bin

 Language Model:
 - LM00 "_default"
    grammar #1:
        dfa  = /home/pi/julius/dict/sensor.dfa
        dict = /home/pi/julius/dict/sensor.dict

 Recognizer:
 - SR00 "_default" (AM00, LM00)

------------------------------------------------------------
Speech Analysis Module(s)

[MFCC01]  for [AM00 _default]

 Acoustic analysis condition:
           parameter = MFCC_E_D_N_Z (25 dim. from 12 cepstrum + energy, abs energy supressed with CMN)
    sample frequency = 16000 Hz
       sample period =  625  (1 = 100ns)
         window size =  400 samples (25.0 ms)
         frame shift =  160 samples (10.0 ms)
        pre-emphasis = 0.97
        # filterbank = 24
       cepst. lifter = 22
          raw energy = False
    energy normalize = False
        delta window = 2 frames (20.0 ms) around
         hi freq cut = OFF
         lo freq cut = OFF
     zero mean frame = OFF
           use power = OFF
                 CVN = OFF
                VTLN = OFF

    spectral subtraction = off

 cep. mean normalization = yes, real-time MAP-CMN, updating initial mean with last 500 input frames
  initial mean from file = N/A
   beginning data weight = 100.00
 cep. var. normalization = no

     base setup from = Julius defaults

------------------------------------------------------------
Acoustic Model(s)

[AM00 "_default"]

 HMM Info:
    8443 models, 3090 states, 3090 mpdfs, 49440 Gaussians are defined
          model type = context dependency handling ON
      training parameter = MFCC_E_N_D_Z
       vector length = 25
    number of stream = 1
         stream info = [0-24]
    cov. matrix type = DIAGC
       duration type = NULLD
    max mixture size = 16 Gaussians
     max length of model = 5 states
     logical base phones = 43
       model skip trans. = not exist, no multi-path handling

 AM Parameters:
        Gaussian pruning = none (full computation)  (-gprune)
    short pause HMM name = "sp" specified, "sp" applied (physical)  (-sp)
  cross-word CD on pass1 = handle by approx. (use average prob. of same LC)

------------------------------------------------------------
Language Model(s)

[LM00 "_default"] type=grammar

 DFA grammar info:
      4 nodes, 8 arcs, 8 terminal(category) symbols
      category-pair matrix: 56 bytes (896 bytes allocated)

 additional forward DFA grammar info:
      4 nodes, 8 arcs, 8 terminal(category) symbols
      category-pair matrix: 0 bytes (0 bytes allocated)

 Vocabulary Info:
        vocabulary size  = 8 words, 40 models
        average word len = 5.0 models, 15.0 states
       maximum state num = 27 nodes per word
       transparent words = not exist
       words under class = not exist

 Parameters:
   found sp category IDs =

------------------------------------------------------------
Recognizer(s)

[SR00 "_default"]  AM00 "_default"  +  LM00 "_default"

 Lexicon tree:
     total node num =    120
      root node num =      8
      leaf node num =      8

    (-penalty1) IW penalty1 = +0.0
    (-penalty2) IW penalty2 = +0.0
    (-cmalpha)CM alpha coef = 0.050000

 Search parameters: 
        multi-path handling = no
    (-b) trellis beam width = 120 (-1 or not specified - guessed)
    (-bs)score pruning thres= disabled
    (-n)search candidate num= 1
    (-s)  search stack size = 500
    (-m)    search overflow = after 2000 hypothesis poped
            2nd pass method = searching sentence, generating N-best
    (-b2)  pass2 beam width = 30
    (-lookuprange)lookup range= 5  (tm-5 <= t <tm+5)
    (-sb)2nd scan beamthres = 80.0 (in logscore)
    (-n)        search till = 1 candidates found
    (-output)    and output = 1 candidates out of above
     IWCD handling:
       1st pass: approximation (use average prob. of same LC)
       2nd pass: loose (apply when hypo. is popped and scanned)
     all possible words will be expanded in 2nd pass
     build_wchmm2() used
     lcdset limited by word-pair constraint
    short pause segmentation = off
    fall back on search fail = off, returns search failure

------------------------------------------------------------
Decoding algorithm:

    1st pass input processing = real time, on-the-fly
    1st pass method = 1-best approx. generating indexed trellis
    output word confidence measure based on search-time scores

------------------------------------------------------------
FrontEnd:

 Input stream:
                 input type = waveform
               input source = microphone
        device API          = default
              sampling freq. = 16000 Hz
             threaded A/D-in = supported, on
       zero frames stripping = off
             silence cutting = on
                 level thres = 2000 / 32767
             zerocross thres = 60 / sec.
                 head margin = 300 msec.
                 tail margin = 400 msec.
                  chunk size = 1000 samples
           FVAD switch value = -1 (disabled)
        long-term DC removal = off
        level scaling factor = 1.00 (disabled)
          reject short input = off
          reject  long input = off

----------------------- System Information end -----------------------

Notice for feature extraction (01),
    *************************************************************
    * Cepstral mean normalization for real-time decoding:       *
    * NOTICE: The first input may not be recognized, since      *
    *         no initial mean is available on startup.          *
    *************************************************************

------
### read waveform input
Stat: capture audio at 16000Hz
Stat: adin_alsa: latency set to 32 msec (chunk = 512 bytes)
Error: adin_alsa: unable to get pcm info from card control
Warning: adin_alsa: skip output of detailed audio device info
STAT: AD-in thread created
pass1_best: [s] おんど [/s]
pass1_best_wordseq: 6 3 7
pass1_best_phonemeseq: silB | o N d o | silE
pass1_best_score: -2385.513428
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 106
sentence1: [s] おんど [/s]
wseq1: 6 3 7
phseq1: silB | o N d o | silE
cmscore1: 1.000 1.000 1.000
score1: -2385.515625

pass1_best: [s] しつど [/s]
pass1_best_wordseq: 6 4 7
pass1_best_phonemeseq: silB | sh i ts u d o | silE
pass1_best_score: -2274.499512
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 94
sentence1: [s] しつど [/s]
wseq1: 6 4 7
phseq1: silB | sh i ts u d o | silE
cmscore1: 1.000 1.000 1.000
score1: -2274.498291

pass1_best: [s] おんど [/s]
pass1_best_wordseq: 6 3 7
pass1_best_phonemeseq: silB | o N d o | silE
pass1_best_score: -1950.462280
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 82
sentence1: [s] おんど [/s]
wseq1: 6 3 7
phseq1: silB | o N d o | silE
cmscore1: 1.000 1.000 1.000
score1: -1950.461060

pass1_best: [s] こんにちは [/s]
pass1_best_wordseq: 6 1 7
pass1_best_phonemeseq: silB | k o N n i ch i h a | silE
pass1_best_score: -2781.791992
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 113
sentence1: [s] こんにちは [/s]
wseq1: 6 1 7
phseq1: silB | k o N n i ch i h a | silE
cmscore1: 1.000 1.000 1.000
score1: -2781.795654

pass1_best: [s] こんばんわ [/s]
pass1_best_wordseq: 6 2 7
pass1_best_phonemeseq: silB | k o N b a N w a | silE
pass1_best_score: -2732.918213
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 115
sentence1: [s] こんばんわ [/s]
wseq1: 6 2 7
phseq1: silB | k o N b a N w a | silE
cmscore1: 1.000 1.000 1.000
score1: -2732.920410

<<< please speak >>


mkdfa.plを使って辞書ファイルを作る(エラーになる)

失敗していた時の情報も残しておきます。


mkdfa.plを使った時

Juliusに付属している mkdfa.pl を使って、Juliusで使用できる辞書ファイルを作成します。

pi@raspberrypi:~/julius/dict $ cd ~/julius/julius-4.6/gramtools/mkdfa/
pi@raspberrypi:~/julius/julius-4.6/gramtools/mkdfa $ mkdfa.pl ~/julius/dict/sensor
/home/pi/julius/dict/sensor.grammar has 7 rules
/home/pi/julius/dict/sensor.voca    has 8 categories and 8 words
---
executing [/usr/local/bin/mkfa -e1 -fg ./tmp1716-rev.grammar -fv ./tmp1716.voca -fo /home/pi/julius/dict/sensor.dfatmp -fh ./tmp1716.h]
Now parsing grammar file
Now modifying grammar to reduce states[-1]
Now parsing vocabulary file
Now making nondeterministic finite automaton[4/4]
Now making deterministic finite automaton[4/4] 
Now making triplet list[4/4]
sh: 1: cygpath: not found
sh: 1: cygpath: not found
usage: dfa_minimize [dfafile] [-o outfile]
executing [/usr/local/bin/mkfa -e1 -fg ./tmp1716.grammar -fv ./tmp1716.voca -fo /home/pi/julius/dict/sensor.dfatmp -fh ./tmp1716.h]
Now parsing grammar file
Now modifying grammar to reduce states[-1]
Now parsing vocabulary file
Now making nondeterministic finite automaton[4/4]
Now making deterministic finite automaton[4/4] 
Now making triplet list[4/4]
sh: 1: cygpath: not found
sh: 1: cygpath: not found
usage: dfa_minimize [dfafile] [-o outfile]
---
generated: /home/pi/julius/dict/sensor.dfa /home/pi/julius/dict/sensor.term /home/pi/julius/dict/sensor.dict /home/pi/julius/dict/sensor.dfa.forward
pi@raspberrypi:~/julius/julius-4.6/gramtools/mkdfa $ cd ~/julius/dict
pi@raspberrypi:~/julius/dict $ ls
sensor.dfatmp  sensor.dict  sensor.grammar  sensor.phone  sensor.term  sensor.voca  sensor.yomi
pi@raspberrypi:~/julius/dict $ ls -l
合計 28
-rw-r--r-- 1 pi pi  92 10月 24 16:12 sensor.dfatmp
-rw-r--r-- 1 pi pi 206 10月 24 16:12 sensor.dict
-rw-r--r-- 1 pi pi 123 10月 24 16:04 sensor.grammar
-rw-r--r-- 1 pi pi 155 10月 24 15:58 sensor.phone
-rw-r--r-- 1 pi pi  74 10月 24 16:12 sensor.term
-rw-r--r-- 1 pi pi 242 10月 24 16:08 sensor.voca
-rw-r--r-- 1 pi pi 150 10月 24 15:58 sensor.yomi

変換が終わると、sensor.dfatmp,sensor.dict,sensor.termというファイルができている。
あちこちの情報を見ると.dfatmpではなく、.dfaのファイルができるはずなのだが、できていない。
上記のログでcygpathがnot foundになっている影響なのか。
ただ今の環境はラズパイなのでcygpathは関係ないと思うのだが...。

とりあえず、sensor.dfatmpをsensor.dfaに変更して使ってみる。

pi@raspberrypi:~/julius/julius-4.6/gramtools/mkdfa $ cd ~/julius/dict
pi@raspberrypi:~/julius/dict $ cp sensor.dfatmp sensor.dfa

一部気になる点があるが、ファイルが揃ったので実行してみる。

pi@raspberrypi:~/julius/dict $ julius -C ~/julius/dictation-kit-4.5/am-gmm.jconf -nostrip -gram ~/julius/dict/sensor -input mic

実行ログ

実行ログ
pi@raspberrypi:~/julius/dict $ julius -C ~/julius/dictation-kit-4.5/am-gmm.jconf -nostrip -gram ~/julius/dict/sensor -input mic
STAT: include config: /home/pi/julius/dictation-kit-4.5/am-gmm.jconf
WARNING: m_chkparam: "-lmp" only for N-gram, ignored
WARNING: m_chkparam: "-lmp2" only for N-gram, ignored
STAT: jconf successfully finalized
STAT: *** loading AM00 _default
Stat: init_phmm: Reading in HMM definition
Stat: binhmm-header: variance inversed
Stat: read_binhmm: has inversed variances
Stat: read_binhmm: binary format HMM definition
Stat: read_binhmm: this HMM does not need multipath handling
Stat: init_phmm: defined HMMs:  8443
Stat: init_phmm: loading binary hmmlist
Stat: load_hmmlist_bin: reading hmmlist
Stat: aptree_read: 42857 nodes (21428 branch + 21429 data)
Stat: load_hmmlist_bin: reading pseudo phone set
Stat: aptree_read: 3253 nodes (1626 branch + 1627 data)
Stat: init_phmm: logical names: 21429 in HMMList
Stat: init_phmm: base phones:    43 used in logical
Stat: init_phmm: finished reading HMM definitions
STAT: pseudo phones are loaded from binary hmmlist file
Stat: hmm_lookup: 12 pseudo phones are added to logical HMM list
STAT: *** AM00 _default loaded
STAT: *** loading LM00 _default
STAT: reading [/home/pi/julius/dict/sensor.dfa] and [/home/pi/julius/dict/sensor.dict]...
Stat: init_voca: read 8 words
Error: gzfile: unable to open /home/pi/julius/dict/sensor.dfa.forward
Error: init_dfa: failed to open /home/pi/julius/dict/sensor.dfa.forward
STAT: done
STAT: Gram #0 sensor registered
STAT: Gram #0 sensor: new grammar loaded, now mash it up for recognition
STAT: Gram #0 sensor: extracting category-pair constraint for the 1st pass
STAT: Gram #0 sensor: installed
STAT: Gram #0 sensor: turn on active
STAT: grammar update completed
STAT: *** LM00 _default loaded
STAT: ------
STAT: All models are ready, go for final fusion
STAT: [1] create MFCC extraction instance(s)
STAT: *** create MFCC calculation modules from AM
STAT: AM 0 _default: create a new module MFCC01
STAT: 1 MFCC modules created
STAT: [2] create recognition processing instance(s) with AM and LM
STAT: composing recognizer instance SR00 _default (AM00 _default, LM00 _default)
STAT: Building HMM lexicon tree
STAT: lexicon size: 120+0=120
STAT: coordination check passed
STAT: multi-gram: beam width set to 120 (guess) by lexicon change
STAT: wchmm (re)build completed
STAT: SR00 _default composed
STAT: [3] initialize for acoustic HMM calculation
Stat: outprob_init: state-level mixture PDFs, use calc_mix()
Stat: addlog: generating addlog table (size = 1953 kB)
Stat: addlog: addlog table generated
STAT: [4] prepare MFCC storage(s)
STAT: [5] prepare for real-time decoding
STAT: All init successfully done

STAT: ###### initialize input device
----------------------- System Information begin ---------------------
JuliusLib rev.4.6 (fast)

Engine specification:
 -  Base setup   : fast
 -  Supported LM : DFA, N-gram, Word
 -  Extension    : LibSndFile
 -  Compiled by  : gcc -g -O2 -fPIC
Library configuration: version 4.6
 - Audio input
    primary A/D-in driver   : alsa (Advanced Linux Sound Architecture)
    available drivers       : alsa
    wavefile formats        : various formats by libsndfile ver.1
    max. length of an input : 320000 samples, 150 words
 - Language Model
    class N-gram support    : yes
    MBR weight support      : yes
    word id unit            : short (2 bytes)
 - Acoustic Model
    multi-path treatment    : autodetect
 - External library
    file decompression by   : zlib library
 - Process hangling
    fork on adinnet input   : no
 - built-in SIMD instruction set for DNN

    NONE AVAILABLE, DNN computation may be too slow!
 - built-in CUDA support: no


------------------------------------------------------------
Configuration of Modules

 Number of defined modules: AM=1, LM=1, SR=1

 Acoustic Model (with input parameter spec.):
 - AM00 "_default"
    hmmfilename=/home/pi/julius/dictation-kit-4.5/model/phone_m/jnas-tri-3k16-gid.binhmm
    hmmmapfilename=/home/pi/julius/dictation-kit-4.5/model/phone_m/logicalTri-3k16-gid.bin

 Language Model:
 - LM00 "_default"
    grammar #1:
        dfa  = /home/pi/julius/dict/sensor.dfa
        dict = /home/pi/julius/dict/sensor.dict

 Recognizer:
 - SR00 "_default" (AM00, LM00)

------------------------------------------------------------
Speech Analysis Module(s)

[MFCC01]  for [AM00 _default]

 Acoustic analysis condition:
           parameter = MFCC_E_D_N_Z (25 dim. from 12 cepstrum + energy, abs energy supressed with CMN)
    sample frequency = 16000 Hz
       sample period =  625  (1 = 100ns)
         window size =  400 samples (25.0 ms)
         frame shift =  160 samples (10.0 ms)
        pre-emphasis = 0.97
        # filterbank = 24
       cepst. lifter = 22
          raw energy = False
    energy normalize = False
        delta window = 2 frames (20.0 ms) around
         hi freq cut = OFF
         lo freq cut = OFF
     zero mean frame = OFF
           use power = OFF
                 CVN = OFF
                VTLN = OFF

    spectral subtraction = off

 cep. mean normalization = yes, real-time MAP-CMN, updating initial mean with last 500 input frames
  initial mean from file = N/A
   beginning data weight = 100.00
 cep. var. normalization = no

     base setup from = Julius defaults

------------------------------------------------------------
Acoustic Model(s)

[AM00 "_default"]

 HMM Info:
    8443 models, 3090 states, 3090 mpdfs, 49440 Gaussians are defined
          model type = context dependency handling ON
      training parameter = MFCC_E_N_D_Z
       vector length = 25
    number of stream = 1
         stream info = [0-24]
    cov. matrix type = DIAGC
       duration type = NULLD
    max mixture size = 16 Gaussians
     max length of model = 5 states
     logical base phones = 43
       model skip trans. = not exist, no multi-path handling

 AM Parameters:
        Gaussian pruning = none (full computation)  (-gprune)
    short pause HMM name = "sp" specified, "sp" applied (physical)  (-sp)
  cross-word CD on pass1 = handle by approx. (use average prob. of same LC)

------------------------------------------------------------
Language Model(s)

[LM00 "_default"] type=grammar

 DFA grammar info:
      4 nodes, 8 arcs, 8 terminal(category) symbols
      category-pair matrix: 56 bytes (896 bytes allocated)

 Vocabulary Info:
        vocabulary size  = 8 words, 40 models
        average word len = 5.0 models, 15.0 states
       maximum state num = 27 nodes per word
       transparent words = not exist
       words under class = not exist

 Parameters:
   found sp category IDs =

------------------------------------------------------------
Recognizer(s)

[SR00 "_default"]  AM00 "_default"  +  LM00 "_default"

 Lexicon tree:
     total node num =    120
      root node num =      8
      leaf node num =      8

    (-penalty1) IW penalty1 = +0.0
    (-penalty2) IW penalty2 = +0.0
    (-cmalpha)CM alpha coef = 0.050000

 Search parameters: 
        multi-path handling = no
    (-b) trellis beam width = 120 (-1 or not specified - guessed)
    (-bs)score pruning thres= disabled
    (-n)search candidate num= 1
    (-s)  search stack size = 500
    (-m)    search overflow = after 2000 hypothesis poped
            2nd pass method = searching sentence, generating N-best
    (-b2)  pass2 beam width = 30
    (-lookuprange)lookup range= 5  (tm-5 <= t <tm+5)
    (-sb)2nd scan beamthres = 80.0 (in logscore)
    (-n)        search till = 1 candidates found
    (-output)    and output = 1 candidates out of above
     IWCD handling:
       1st pass: approximation (use average prob. of same LC)
       2nd pass: loose (apply when hypo. is popped and scanned)
     all possible words will be expanded in 2nd pass
     build_wchmm2() used
     lcdset limited by word-pair constraint
    short pause segmentation = off
    fall back on search fail = off, returns search failure

------------------------------------------------------------
Decoding algorithm:

    1st pass input processing = real time, on-the-fly
    1st pass method = 1-best approx. generating indexed trellis
    output word confidence measure based on search-time scores

------------------------------------------------------------
FrontEnd:

 Input stream:
                 input type = waveform
               input source = microphone
        device API          = default
              sampling freq. = 16000 Hz
             threaded A/D-in = supported, on
       zero frames stripping = off
             silence cutting = on
                 level thres = 2000 / 32767
             zerocross thres = 60 / sec.
                 head margin = 300 msec.
                 tail margin = 400 msec.
                  chunk size = 1000 samples
           FVAD switch value = -1 (disabled)
        long-term DC removal = off
        level scaling factor = 1.00 (disabled)
          reject short input = off
          reject  long input = off

----------------------- System Information end -----------------------

Notice for feature extraction (01),
    *************************************************************
    * Cepstral mean normalization for real-time decoding:       *
    * NOTICE: The first input may not be recognized, since      *
    *         no initial mean is available on startup.          *
    *************************************************************

------
### read waveform input
Stat: capture audio at 16000Hz
Stat: adin_alsa: latency set to 32 msec (chunk = 512 bytes)
Error: adin_alsa: unable to get pcm info from card control
Warning: adin_alsa: skip output of detailed audio device info
STAT: AD-in thread created
<<< please speak >>>
pass1_best: [/s] こんにちは [s]
pass1_best_wordseq: 7 1 6
pass1_best_phonemeseq: silE | k o N n i ch i h a | silB
pass1_best_score: -3234.054932
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 131
sentence1: [/s] こんにちは [s]
wseq1: 7 1 6
phseq1: silE | k o N n i ch i h a | silB
cmscore1: 1.000 0.998 1.000
score1: -3234.052979

pass1_best: [/s] おはよう [s]
pass1_best_wordseq: 7 0 6
pass1_best_phonemeseq: silE | o h a y o u | silB
pass1_best_score: -2979.170166
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 123
sentence1: [/s] おはよう [s]
wseq1: 7 0 6
phseq1: silE | o h a y o u | silB
cmscore1: 1.000 1.000 1.000
score1: -2979.166748

pass1_best: [/s] おんど [s]
pass1_best_wordseq: 7 3 6
pass1_best_phonemeseq: silE | o N d o | silB
pass1_best_score: -1790.548218
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 78
sentence1: [/s] おんど [s]
wseq1: 7 3 6
phseq1: silE | o N d o | silB
cmscore1: 1.000 1.000 1.000
score1: -1790.547974

pass1_best: [/s] しつど [s]
pass1_best_wordseq: 7 4 6
pass1_best_phonemeseq: silE | sh i ts u d o | silB
pass1_best_score: -2026.530518
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 84
sentence1: [/s] しつど [s]
wseq1: 7 4 6
phseq1: silE | sh i ts u d o | silB
cmscore1: 1.000 1.000 1.000
score1: -2026.531372

<<< please speak >>>^C

実行結果

「こんにちは」「おはよう」「こんばんわ」「おんど」「しつど」「きあつ」「かきくけこ」と音声認識させてみたら、登録されている単語は認識に成功。
登録していない「かきくけこ」が「しつど」に認識されたのはどうしてかわからないけど、使う言葉のみ登録して辞書を作成すればどうにかなりそうなことはわかりました。

### read waveform input
Stat: capture audio at 16000Hz
Stat: adin_alsa: latency set to 32 msec (chunk = 512 bytes)
Error: adin_alsa: unable to get pcm info from card control
Warning: adin_alsa: skip output of detailed audio device info
STAT: AD-in thread created
pass1_best: [/s] こんにちは [s]
pass1_best_wordseq: 7 1 6
pass1_best_phonemeseq: silE | k o N n i ch i h a | silB
pass1_best_score: -3018.866699
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 122
sentence1: [/s] こんにちは [s]
wseq1: 7 1 6
phseq1: silE | k o N n i ch i h a | silB
cmscore1: 1.000 0.999 1.000
score1: -3018.866455

pass1_best: [/s] おはよう [s]
pass1_best_wordseq: 7 0 6
pass1_best_phonemeseq: silE | o h a y o u | silB
pass1_best_score: -2230.211670
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 94
sentence1: [/s] おはよう [s]
wseq1: 7 0 6
phseq1: silE | o h a y o u | silB
cmscore1: 1.000 1.000 1.000
score1: -2230.211914

pass1_best: [/s] こんばんわ [s]
pass1_best_wordseq: 7 2 6
pass1_best_phonemeseq: silE | k o N b a N w a | silB
pass1_best_score: -2848.747559
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 123
sentence1: [/s] こんばんわ [s]
wseq1: 7 2 6
phseq1: silE | k o N b a N w a | silB
cmscore1: 1.000 1.000 1.000
score1: -2848.744629

pass1_best: [/s] おんど [s]
pass1_best_wordseq: 7 3 6
pass1_best_phonemeseq: silE | o N d o | silB
pass1_best_score: -1869.269531
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 82
sentence1: [/s] おんど [s]
wseq1: 7 3 6
phseq1: silE | o N d o | silB
cmscore1: 1.000 1.000 1.000
score1: -1869.268921

pass1_best: [/s] しつど [s]
pass1_best_wordseq: 7 4 6
pass1_best_phonemeseq: silE | sh i ts u d o | silB
pass1_best_score: -1910.348633
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 78
sentence1: [/s] しつど [s]
wseq1: 7 4 6
phseq1: silE | sh i ts u d o | silB
cmscore1: 1.000 0.797 1.000
score1: -1910.348877

pass1_best: [/s] きあつ [s]
pass1_best_wordseq: 7 5 6
pass1_best_phonemeseq: silE | k i a ts u | silB
pass1_best_score: -2064.148193
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 86
sentence1: [/s] きあつ [s]
wseq1: 7 5 6
phseq1: silE | k i a ts u | silB
cmscore1: 1.000 1.000 1.000
score1: -2064.148926

pass1_best: [/s] しつど [s]
pass1_best_wordseq: 7 4 6
pass1_best_phonemeseq: silE | sh i ts u d o | silB
pass1_best_score: -3752.242188
### Recognition: 2nd pass (RL heuristic best-first)
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 140
sentence1: [/s] しつど [s]
wseq1: 7 4 6
phseq1: silE | sh i ts u d o | silB
cmscore1: 1.000 0.977 1.000
score1: -3752.237061

<<< please speak >>>

1
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
0