More than 3 years have passed since last update.

yum updateでDockerが起動しなくなった(DeepLearning AMI)

Last updated at 2021-01-03Posted at 2020-07-02

tl;dr

docker.serviceのExecStartPre=/usr/libexec/docker/docker-setup-runtimes.shをコメントアウトして、/run/docker/runtimes.envを削除して、サービス再起動したら治ったよ

経緯

GPU付きでDockerを動かしたかったから、半年前くらいに、DeepLearning AMI(amzn2ベース)をベースイメージにサーバを構築した。
最近、上記サーバから作ったAMIでEC2を立てたら、nvidia-smiが動かなかった。
yum updateしたらnvidia-smiは使えるようになったが、docker.serviceが死んだ。

nvidia-smiが動かない

nvidia-smiを実行すると以下のエラーが発生

$ nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. 
Make sure that the latest NVIDIA driver is installed and running.

たぶんnvidia driverが古いせいと考え、パッケージ全体のアップデートをした

# 本当は必要なパッケージだけアップデートすること
sudo yum update -y
sudo reboot

$ nvidia-smi -l
Thu Jul  xxxxxxx
+-----------------------------------------------------------------------------+
| NVIDIA-SMI xxx.xx.xx    Driver Version: xxx.xx.xx    CUDA Version: xx.x     |
|-------------------------------+----------------------+----------------------+
(以下略)

問題なし。

Dockerが動かない

その後、Dockerコマンドを使おうとするとエラー
初期セットアップ時にDockerサービス起動し忘れていたかな？と思ってサービス起動するがエラー

$ docker ps
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
See 'docker run --help'.

$ sudo systemctl start docker
Job for docker.service failed because start of the service was attempted too often. See "systemctl status docker.service" and "journalctl -xe" for details.
To force a start use "systemctl reset-failed docker.service" followed by "systemctl start docker.service" again.

なんでや！と思い、とりあえずjournalctlで確認

$ journalctl -xeu 'docker'
xxx xx xx:xx:xx xxxxxxx dockerd[xxxx]: unable to configure the Docker daemon　with file /etc/docker/daemon.json:
the following directives are specified both as a flag and in the configuration file: runtimes:
(from flag: [neuron], from file: map[nvidia:map[path:nvidia-container-runtime runtimeArgs:[]]])

(長いので改行してるが実際はめっちゃ横に長い)

どうもflagで指定しているRuntimeと、/etc/docker/daemon.jsonで指定しているRuntimeが違うみたい。
flagではneuronってのを指定している模様。そんなの指定したっけ……？
daemon.jsonで指定しているのは以下。(GPU使うため、nvvidia-container-runtimeを指定している.DeepLearning AMIのデフォルト。)

$ cat /etc/docker/daemon.json
{
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

じゃあflagって何？そんなの指定してないよぉ～って思い、エラーについて調べてたら、以下を発見。

The conflict between the system docker startup file and the docker daemon.json file.
It is strongly recommended that the installed ExecStart=/usr/bin/dockerd -H fd://, -H fd:// be moved to the daemon.json file.
(https://github.com/docker/for-linux/issues/165)

どうもdocker.serviceの中のExecStartが悪いみたい。
ただ、systemctl statusで見てみると、ExecStartコマンドがちょっと違う。

$ sudo systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
   Active: failed (Result: start-limit) since Thu xxxx-xx-xx xx:xx:xx xxx; xxs ago
     Docs: https://docs.docker.com
  Process: yyyy ExecStart=/usr/bin/dockerd $OPTIONS $DOCKER_STORAGE_OPTIONS $DOCKER_ADD_RUNTIMES (code=exited, status=1/FAILURE)
  Process: yyyy ExecStartPre=/usr/libexec/docker/docker-setup-runtimes.sh (code=exited, status=0/SUCCESS)
  Process: yyyy ExecStartPre=/bin/mkdir -p /run/docker (code=exited, status=0/SUCCESS)
 Main PID: yyyy (code=exited, status=1/FAILURE)

xxx xx xx:xx:xx yyyyyyy systemd[1]: Failed to start Docker Application Container Engine.
xxx xx xx:xx:xx yyyyyyy systemd[1]: Unit docker.service entered failed state.
xxx xx xx:xx:xx yyyyyyy systemd[1]: docker.service failed.
xxx xx xx:xx:xx yyyyyyy systemd[1]: docker.service holdoff time over, scheduling restart.
xxx xx xx:xx:xx yyyyyyy systemd[1]: start request repeated too quickly for docker.service
xxx xx xx:xx:xx yyyyyyy systemd[1]: Failed to start Docker Application Container Engine.
xxx xx xx:xx:xx yyyyyyy systemd[1]: Unit docker.service entered failed state.
xxx xx xx:xx:xx yyyyyyy systemd[1]: docker.service failed.

OPTIONSとかDOCKER_STORAGE_OPTIONSとかDOCKER_ADD_RUNTIMESとか、わかんないよぉと思ったけど
どうもExecStartPreがくさいんじゃないかと思って、docker-setup-runtimes.shを見てみる

$ cat /usr/libexec/docker/docker-setup-runtimes.sh
# !/bin/sh
{
    echo -n "DOCKER_ADD_RUNTIMES=\""
    for file in /etc/docker-runtimes.d/*; do
        [ -f "$file" ] && [ -x "$file" ] && echo -n "--add-runtime $(basename "$file")=$file "
    done
    echo "\""
} > /run/docker/runtimes.env

/etc/docker-runtimes.d/のファイルの数だけループし、runtimes.envに突っ込んでるっぽい

$ ls /etc/docker-runtimes.d/
neuron

$ cat /run/docker/runtimes.env
DOCKER_ADD_RUNTIMES="--add-runtime neuron=/etc/docker-runtimes.d/neuron "

悪い子見つけた！
ということで、runtimes.envを退避させ、docker-setup-runtimes.shを実行しているExecStartPreをコメントアウトし
serviceを書き換えたので、systemctlのデーモンリロード ← これ大事！

$ sudo mv /run/docker/runtimes.env /run/docker/runtimes.env.old
$ vi /usr/lib/systemd/system/docker.service
(該当のExecStartPreをコメントアウト)

$ sudo systemctl daemon-reload
$ sudo systemctl restart docker

以上で動いた！良かった。

まとめ

一からCUDA入れるの面倒だからとか、Docker + GPUの環境ややこしいからって面倒くさがらず、必要なものをちゃんと理解して管理したらこんなことにはならなかった

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up