Help us understand the problem. What is going on with this article?

tesseract LSTM in Docker

How to Tuning Tesseract in Docker Container?

やること

Tesseractのチューニングに必要な環境構成を、Dockerfileにまとめる。
チューニングに必要なスクリプトは、JupyterNotebookを介して作成することを想定。
Dockerコンテナの実行ユーザは、ホストのログインユーザに設定している。
そのため、volumeを設定しているsrcディレクトリは、ホスト側からも自由に編集可能。

Create dockerfile

肝心のDockerfileは、下記のとおり。

FROM ubuntu:18.04

############# ↓↓↓root user↓↓↓ #############
ARG UID
ENV UID $UID

# ユーザ設定
RUN apt update && apt install -y sudo && \
    groupadd -g ${UID} tess && \
    useradd -m -u ${UID} -g ${UID} -s /bin/bash tess && \
    echo "tess ALL=(ALL) NOPASSWORD:ALL" > /etc/sudoers.d/tess

RUN apt-get -qq update && apt-get -qq -y install curl bzip2 libmysqlclient-dev gcc wget git

# 日本語環境に設定
RUN apt-get update \
    && apt-get install -y locales \
    && locale-gen ja_JP.UTF-8 \
    && echo "export LANG=ja_JP.UTF-8" >> ~/.bashrc

# tesseractインストール
RUN apt-get install -y software-properties-common && \
    apt update && \
    add-apt-repository ppa:alex-p/tesseract-ocr -y && apt update && \
    apt install -y tesseract-ocr && \
    apt install -y fonts-noto-cjk fonts-takao fonts-vlgothic fonts-ipafont    # <= fontは増やしてもよいです。


############# ↓↓↓tess user↓↓↓ #############
USER tess

# Python環境
RUN curl -sSL https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -o /tmp/miniconda.sh \
    && bash /tmp/miniconda.sh -bfp /home/tess/miniconda3 \
    && rm -rf /tmp/miniconda.sh

ENV PATH /home/tess/miniconda3/bin:$PATH

RUN mkdir /home/tess/tesstrain && mkdir /home/tess/tesstrain/tessdata

WORKDIR /home/tess/tesstrain

# 学習用ファイルをgitからクローン
RUN git clone --depth 1 https://github.com/tesseract-ocr/tesseract.git && \
    git clone --depth 1 https://github.com/tesseract-ocr/langdata.git

# tesseract設定
ENV TESSDATA_PREFIX /home/tess/tesstrain/tessdata

RUN wget https://github.com/tesseract-ocr/tessdata_best/raw/master/jpn.traineddata -O $TESSDATA_PREFIX/jpn_best.traineddata && \
    wget https://github.com/tesseract-ocr/tessdata_best/raw/master/eng.traineddata -O $TESSDATA_PREFIX/eng_best.traineddata && \
    wget https://github.com/tesseract-ocr/tessdata_best/raw/master/jpn_vert.traineddata -P $TESSDATA_PREFIX
    git clone https://github.com/tesseract-ocr/tessconfigs.git $TESSDATA_PREFIX/tessconfigs && \
    ln -s $TESSDATA_PREFIX/tessconfigs/configs $TESSDATA_PREFIX/configs

RUN combine_tessdata -e $TESSDATA_PREFIX/jpn_best.traineddata ./jpn_best.lstm

# bashに切替でcondaを操作
SHELL ["/bin/bash", "-c"]

RUN conda init bash && \
    source /home/tess/.bashrc

RUN conda install python=3.7 && \
    conda install -y -c anaconda jupyter

CMD jupyter-notebook --ip=0.0.0.0 --allow-root --NotebookApp.token=''

あとは、Dockerfileを置いた場所で、下記コマンドを実施すればおっけー。

$ docker build --build-arg UID=$UID -t tesstrain:latest .
$ docker run -d --name tesstrain -v `pwd`/src:/home/tess/tesstrain/src -p 9595:8888 -u=${UID}:${UID} tesstrain

コンテナが正常に作成できているかをdocker psコマンドで確認して、正常にできあがっていたら、ブラウザからJupyterにアクセスできます。

Jupyter以外の操作

Jupyter以外はふつうにdocker exec -it tesstrain bashでコンテナの中に入ってください。

実際のチューニング方法は別途記述するかもしれない。

Why do not you register as a user and use Qiita more conveniently?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
Comments
Sign up for free and join this conversation.
If you already have a Qiita account
Why do not you register as a user and use Qiita more conveniently?
You need to log in to use this function. Qiita can be used more conveniently after logging in.
You seem to be reading articles frequently this month. Qiita can be used more conveniently after logging in.
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away