Juliusをサーバとして動作させる。
-moduleオプションをつけてJuliusを起動するとsocket接続したアプリケーションに認識結果をXML形式のデータで出力するようになります。
pi@raspberrypi:~ $ julius -C ~/julius/dictation-kit-4.5/am-gmm.jconf -nostrip -gram ~/julius/dict/sensor -module
実行ログ
pi@raspberrypi:~ $ julius -C ~/julius/dictation-kit-4.5/am-gmm.jconf -nostrip -gram ~/julius/dict/sensor -module
STAT: include config: /home/pi/julius/dictation-kit-4.5/am-gmm.jconf
WARNING: m_chkparam: "-lmp" only for N-gram, ignored
WARNING: m_chkparam: "-lmp2" only for N-gram, ignored
STAT: jconf successfully finalized
STAT: *** loading AM00 _default
Stat: init_phmm: Reading in HMM definition
Stat: binhmm-header: variance inversed
Stat: read_binhmm: has inversed variances
Stat: read_binhmm: binary format HMM definition
Stat: read_binhmm: this HMM does not need multipath handling
Stat: init_phmm: defined HMMs: 8443
Stat: init_phmm: loading binary hmmlist
Stat: load_hmmlist_bin: reading hmmlist
Stat: aptree_read: 42857 nodes (21428 branch + 21429 data)
Stat: load_hmmlist_bin: reading pseudo phone set
Stat: aptree_read: 3253 nodes (1626 branch + 1627 data)
Stat: init_phmm: logical names: 21429 in HMMList
Stat: init_phmm: base phones: 43 used in logical
Stat: init_phmm: finished reading HMM definitions
STAT: pseudo phones are loaded from binary hmmlist file
Stat: hmm_lookup: 12 pseudo phones are added to logical HMM list
STAT: *** AM00 _default loaded
STAT: *** loading LM00 _default
STAT: reading [/home/pi/julius/dict/sensor.dfa] and [/home/pi/julius/dict/sensor.dict]...
Stat: init_voca: read 8 words
STAT: reading additional forward dfa [/home/pi/julius/dict/sensor.dfa.forward]
STAT: done
STAT: Gram #0 sensor registered
STAT: Gram #0 sensor: new grammar loaded, now mash it up for recognition
STAT: Gram #0 sensor: extracting category-pair constraint for the 1st pass
STAT: Gram #0 sensor: installed
STAT: Gram #0 sensor: turn on active
STAT: grammar update completed
STAT: *** LM00 _default loaded
STAT: ------
STAT: All models are ready, go for final fusion
STAT: [1] create MFCC extraction instance(s)
STAT: *** create MFCC calculation modules from AM
STAT: AM 0 _default: create a new module MFCC01
STAT: 1 MFCC modules created
STAT: [2] create recognition processing instance(s) with AM and LM
STAT: composing recognizer instance SR00 _default (AM00 _default, LM00 _default)
STAT: Building HMM lexicon tree
STAT: lexicon size: 120+0=120
STAT: coordination check passed
STAT: multi-gram: beam width set to 120 (guess) by lexicon change
STAT: wchmm (re)build completed
STAT: SR00 _default composed
STAT: [3] initialize for acoustic HMM calculation
Stat: outprob_init: state-level mixture PDFs, use calc_mix()
Stat: addlog: generating addlog table (size = 1953 kB)
Stat: addlog: addlog table generated
STAT: [4] prepare MFCC storage(s)
STAT: [5] prepare for real-time decoding
STAT: All init successfully done
Stat: server-client: socket ready as server
///////////////////////////////
/// Module mode ready
/// waiting client at 10500
///////////////////////////////
/// Stat: server-client: connect from 127.0.0.1
STAT: ###### initialize input device
----------------------- System Information begin ---------------------
JuliusLib rev.4.6 (fast)
Engine specification:
- Base setup : fast
- Supported LM : DFA, N-gram, Word
- Extension : LibSndFile
- Compiled by : gcc -g -O2 -fPIC
Library configuration: version 4.6
- Audio input
primary A/D-in driver : alsa (Advanced Linux Sound Architecture)
available drivers : alsa
wavefile formats : various formats by libsndfile ver.1
max. length of an input : 320000 samples, 150 words
- Language Model
class N-gram support : yes
MBR weight support : yes
word id unit : short (2 bytes)
- Acoustic Model
multi-path treatment : autodetect
- External library
file decompression by : zlib library
- Process hangling
fork on adinnet input : no
- built-in SIMD instruction set for DNN
NONE AVAILABLE, DNN computation may be too slow!
- built-in CUDA support: no
------------------------------------------------------------
Configuration of Modules
Number of defined modules: AM=1, LM=1, SR=1
Acoustic Model (with input parameter spec.):
- AM00 "_default"
hmmfilename=/home/pi/julius/dictation-kit-4.5/model/phone_m/jnas-tri-3k16-gid.binhmm
hmmmapfilename=/home/pi/julius/dictation-kit-4.5/model/phone_m/logicalTri-3k16-gid.bin
Language Model:
- LM00 "_default"
grammar #1:
dfa = /home/pi/julius/dict/sensor.dfa
dict = /home/pi/julius/dict/sensor.dict
Recognizer:
- SR00 "_default" (AM00, LM00)
------------------------------------------------------------
Speech Analysis Module(s)
[MFCC01] for [AM00 _default]
Acoustic analysis condition:
parameter = MFCC_E_D_N_Z (25 dim. from 12 cepstrum + energy, abs energy supressed with CMN)
sample frequency = 16000 Hz
sample period = 625 (1 = 100ns)
window size = 400 samples (25.0 ms)
frame shift = 160 samples (10.0 ms)
pre-emphasis = 0.97
# filterbank = 24
cepst. lifter = 22
raw energy = False
energy normalize = False
delta window = 2 frames (20.0 ms) around
hi freq cut = OFF
lo freq cut = OFF
zero mean frame = OFF
use power = OFF
CVN = OFF
VTLN = OFF
spectral subtraction = off
cep. mean normalization = yes, real-time MAP-CMN, updating initial mean with last 500 input frames
initial mean from file = N/A
beginning data weight = 100.00
cep. var. normalization = no
base setup from = Julius defaults
------------------------------------------------------------
Acoustic Model(s)
[AM00 "_default"]
HMM Info:
8443 models, 3090 states, 3090 mpdfs, 49440 Gaussians are defined
model type = context dependency handling ON
training parameter = MFCC_E_N_D_Z
vector length = 25
number of stream = 1
stream info = [0-24]
cov. matrix type = DIAGC
duration type = NULLD
max mixture size = 16 Gaussians
max length of model = 5 states
logical base phones = 43
model skip trans. = not exist, no multi-path handling
AM Parameters:
Gaussian pruning = none (full computation) (-gprune)
short pause HMM name = "sp" specified, "sp" applied (physical) (-sp)
cross-word CD on pass1 = handle by approx. (use average prob. of same LC)
------------------------------------------------------------
Language Model(s)
[LM00 "_default"] type=grammar
DFA grammar info:
4 nodes, 8 arcs, 8 terminal(category) symbols
category-pair matrix: 56 bytes (896 bytes allocated)
additional forward DFA grammar info:
4 nodes, 8 arcs, 8 terminal(category) symbols
category-pair matrix: 0 bytes (0 bytes allocated)
Vocabulary Info:
vocabulary size = 8 words, 40 models
average word len = 5.0 models, 15.0 states
maximum state num = 27 nodes per word
transparent words = not exist
words under class = not exist
Parameters:
found sp category IDs =
------------------------------------------------------------
Recognizer(s)
[SR00 "_default"] AM00 "_default" + LM00 "_default"
Lexicon tree:
total node num = 120
root node num = 8
leaf node num = 8
(-penalty1) IW penalty1 = +0.0
(-penalty2) IW penalty2 = +0.0
(-cmalpha)CM alpha coef = 0.050000
Search parameters:
multi-path handling = no
(-b) trellis beam width = 120 (-1 or not specified - guessed)
(-bs)score pruning thres= disabled
(-n)search candidate num= 1
(-s) search stack size = 500
(-m) search overflow = after 2000 hypothesis poped
2nd pass method = searching sentence, generating N-best
(-b2) pass2 beam width = 30
(-lookuprange)lookup range= 5 (tm-5 <= t <tm+5)
(-sb)2nd scan beamthres = 80.0 (in logscore)
(-n) search till = 1 candidates found
(-output) and output = 1 candidates out of above
IWCD handling:
1st pass: approximation (use average prob. of same LC)
2nd pass: loose (apply when hypo. is popped and scanned)
all possible words will be expanded in 2nd pass
build_wchmm2() used
lcdset limited by word-pair constraint
short pause segmentation = off
fall back on search fail = off, returns search failure
------------------------------------------------------------
Decoding algorithm:
1st pass input processing = real time, on-the-fly
1st pass method = 1-best approx. generating indexed trellis
output word confidence measure based on search-time scores
------------------------------------------------------------
FrontEnd:
Input stream:
input type = waveform
input source = microphone
device API = default
sampling freq. = 16000 Hz
threaded A/D-in = supported, on
zero frames stripping = off
silence cutting = on
level thres = 2000 / 32767
zerocross thres = 60 / sec.
head margin = 300 msec.
tail margin = 400 msec.
chunk size = 1000 samples
FVAD switch value = -1 (disabled)
long-term DC removal = off
level scaling factor = 1.00 (disabled)
reject short input = off
reject long input = off
----------------------- System Information end -----------------------
Notice for feature extraction (01),
*************************************************************
* Cepstral mean normalization for real-time decoding: *
* NOTICE: The first input may not be recognized, since *
* no initial mean is available on startup. *
*************************************************************
------
### read waveform input
Stat: capture audio at 16000Hz
Stat: adin_alsa: latency set to 32 msec (chunk = 512 bytes)
Error: adin_alsa: unable to get pcm info from card control
Warning: adin_alsa: skip output of detailed audio device info
STAT: AD-in thread created
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 94
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 69
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 78
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 115
STAT: 00 _default: 8 generated, 8 pushed, 4 nodes popped in 111
socket error, connection closed
コマンドが長いので、以下のようにシェルスクリプトにしておきました。
pi@raspberrypi:~/julius $ vi run.sh
pi@raspberrypi:~/julius $ chmod +x run.sh```
```bash:run.sh
#!/usr/bin/bash
julius -C ~/julius/dictation-kit-4.5/am-gmm.jconf -nostrip -gram ~/julius/dict/sensor
データを取得するクライアントアプリを作る
Juliusをサーバとして起動すると以下のような表示がされますので、起動したマシンへポート番号10500でアクセスします。
///////////////////////////////
/// Module mode ready
/// waiting client at 10500
///////////////////////////////
今回は、Juliusとクライアントのスクリプトは同じラズパイで起動させるので、IPアドレスはlocalhostにしてsoketを作ります。
Juliusを起動後に以下のサンプルスクリプトを実行します。
Juliusが音声を認識するとサンプルスクリプトへXML形式でデータが送られてくるのでXML ElementTreeを使ってパースします。
認識した言葉はWHYPOのWORDという属性のところに入ってくるので、サンプルスクリプトは、それを抜き出して表示しています。
sensor.py
# -*- coding: utf-8 -*-
import socket
import xml.etree.ElementTree as ET
host = 'localhost'
port = 10500
# connect
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.connect((host, port))
while True:
recv_data = ''
while (recv_data.find('\n.') == -1):
recv_data += sock.recv(1024).decode()
recv_data = recv_data.strip('.\n')
#print(recv_data)
root = ET.fromstring(recv_data)
for i in root.iter('WHYPO'):
word = i.attrib['WORD']
cm = i.attrib['CM']
if word != '[s]' and word != '[/s]':
print('WORD = ' + word + ' : CM = ' + cm )
次回は、WORDで受け取った項目をセンサーのデータを取り出し、OpenJTalkで喋らせる予定。