3
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

AnsibleでNVIDIAドライバをインストール

Last updated at Posted at 2018-03-23

AnsibleでCentOS7.4にNVIDIAドライバー、CUDA,cuDNNのインストールを行った履歴を残します。
ただ、NVIDIAドライバーをインストールした後、CUDAをインストールすると、NVIDIAドライバーがアンインストール状態になってしまいました。
⇛再インストールすることで、最終的に利用可能になりました。おそらく、CUDAのインストールを先に行うことで、上手くいくようです。そのような記事もあり。

  同じ現象になった方、正式な手順がお分かりになる方いれば、共有いただければと思います。

[2018/3/26 更新]
・CUDA9.1を初期インストールしましたが、tensorflow-gpuでエラーが発生したため、9.0にインストール対象を変更しました。
・cudaパッチのインストールにshell->rpmを使っていましたが、yumを使えることがわかったので、
  yum利用に変更しました。

#前提環境
[コントロールノード: Mac]
Mac:macOS High Sierra 10.13.3
ansible 2.4.3.0

[ターゲットノード:Linuxサーバ]

# cat /etc/redhat-release
CentOS Linux release 7.4.1708 (Core) 
# uname -r
4.15.10-1.el7.elrepo.x86_64

#Ansibleのディレクトリ構成
./ansible
+--[inventory]
  +- home_inventory.ini
+--[roles]
+-[nvidia_driver]
  +-[defaults] - main.yml
  +-[handlers]- main.yml
  +-[tasks] - main.yml
  +-[vars] - main.yml

Ansible実行コマンド

# ansible-playbook -i ./inventory/home_inventory.ini ./homeserver_deploy.yml

Ansibleコード

defaults/main.yml
# defaults file for nvidia-blob-install
nvidia_driver_version: '390.42'
vars/main.yml
---
# vars file for nvidia-blob-install
download_dir: /home/download_files/
nvidia_file_name: NVIDIA-Linux-x86_64-{{nvidia_driver_version}}.run
nvidia_driver_url: http://jp.download.nvidia.com/XFree86/Linux-x86_64/{{nvidia_driver_version}}/{{nvidia_file_name}}
exe: /home/download_files/{{nvidia_file_name}}
cuda_url: https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda-repo-rhel7-9-0-local-9.0.176-1.x86_64-rpm
cuda_patch1_url: https://developer.nvidia.com/compute/cuda/9.0/Prod/patches/1/cuda-repo-rhel7-9-0-local-cublas-performance-update-1.0-1.x86_64-rpm
cuda_patch2_url: https://developer.nvidia.com/compute/cuda/9.0/Prod/patches/2/cuda-repo-rhel7-9-0-local-cublas-performance-update-2-1.0-1.x86_64-rpm
#cuda_patch3_url:
cuda_exe: /home/download_files/cuda-repo-rhel7-9-0-local-9.0.176-1.x86_64.rpm
cuda_patch1: /home/download_files/cuda-repo-rhel7-9-0-local-cublas-performance-update-1.0-1.x86_64.rpm
cuda_patch2: /home/download_files/cuda-repo-rhel7-9-0-local-cublas-performance-update-2-1.0-1.x86_64.rpm
#cuda_patch3: /home/download_files/cuda-repo-rhel7-9-1-local-cublas-performance-update-3-1.0-1.x86_64.rpm

rpm_cuda_patch1: cuda-repo-rhel7-9-0-local-cublas-performance-update-1.0-1.x86_64
rpm_cuda_patch2: cuda-repo-rhel7-9-0-local-cublas-performance-update-2-1.0-1.x86_64
#rpm_cuda_patch3: cuda-repo-rhel7-9-1-local-cublas-performance-update-3-1.0-1.x86_64

cuDNN: cudnn-9.0-linux-x64-v7.1.tgz
defaults/main.yml
# defaults file for nvidia-blob-install
nvidia_driver_version: '390.42'
tasks/main.yml
---
# nouveauドライバーの無効化
- name: change boot image file
  shell: "mv /boot/initramfs-4.15.10-1.el7.elrepo.x86_64.img /boot/initramfs-4.15.10-1.el7.elrepo.x86_64-nouveau.img"
  args:
    creates: /boot/initramfs-4.15.10-1.el7.elrepo.x86_64-nouveau.img
  become: true

- name: dracut
  shell: "dracut --omit-drivers nouveau /boot/initramfs-$(uname -r).img $(uname -r)"
  args:
    creates: /boot/initramfs-4.15.10-1.el7.elrepo.x86_64.img
  become: true

# Blacklist the nouveau driver module
- name: nouveau_in_kernel_blacklist
  copy:
    dest: /etc/modprobe.d/nouveau_blacklist.conf
    content: blacklist nouveau
  become: true

- name: nouveau_in_modprobe.conf
  copy:
    dest: /etc/modprobe.d/modprobe.conf
    content: blacklist nouveau
  become: true
  notify: restart server
##
- name: change runlevel 3
  shell: "systemctl set-default multi-user.target"
  become: true

# 前提パッケージのインストール
- name: install pakages
  yum: name={{item}} state=present
  with_items:
    - gcc
    - kernel-devel
    - epel-release
    - dkms

- name: update gcc
  yum: name={{item}} state=latest
  with_items:
    - gcc
  become: true

# VNCを稼働させていたので、停止する
# NOTES: 停止しないと、NVIDIAドライバーインストール時にエラーとなる
- name: stop vnc server, if running
  systemd: name={{item}} state=stopped
  with_items:
    - vncserver@:3

# CUDAのインストールを先に実施する手順としたが、実際はNVIDIAのドライバーを先にインストールした

- name: Download cuda
  get_url:
    dest: "{{download_dir}}"
    mode: 0755
    owner: root
    group: root
    url: "{{cuda_url}}"
- name: Download cuda patch1
  get_url:
    dest: "{{download_dir}}"
    mode: 0755
    owner: root
    group: root
    url: "{{cuda_patch1_url}}"
- name: Download cuda patch2
  get_url:
    dest: "{{download_dir}}"
    mode: 0755
    owner: root
    group: root
    url: "{{cuda_patch2_url}}"


- name: Running cuda installer1
  become: yes
  become_user: root
  shell: "rpm -i {{cuda_exe}}"
  register: cuda_log

- name: debug result var
  debug: var=cuda_log

- name: yum clean all
  become: yes
  become_user: root
  shell: "yum clean all"


- name: install cuda
  yum: name={{item}} state=present
  with_items:
    - cuda

# エラーが発生したため、実施(必要ないかのうせいあり)
- name: install xorg-x11
  yum: name={{item}} state=present
  with_items:
    - xorg-x11-drv-nvidia


- name: yum install patch1
  become: yes
  become_user: root
  yum:
    name: "{{cuda_patch1}}"
    state: present


- name: yum install patch2
  become: yes
  become_user: root
  yum:
    name: "{{cuda_patch2}}"
    state: present



- name: create nvidia driver download directory
  file: path={{download_dir}} state=directory owner=root group=root mode=0755

# tasks file for nvidia-blob-install
- name: Download driver blob
  get_url:
    dest: "{{download_dir}}"
    mode: 0755
    owner: root
    group: root
    url: "{{nvidia_driver_url}}"


- name: Running NVIDIA installer
  become: yes
  become_user: root
  shell: "{{exe}} -s --kernel-source-path=/usr/src/kernels/4.15.10-1.el7.elrepo.x86_64"
  register: nvidia_log
  notify: restart server
  with_first_found:
  - files: /usr/bin/nvidia-smi
    skip: true

- name: debug result var
  debug: var=nvidia_log

- name: start vnc server, if stopped
  systemd: name={{item}} state=started
  with_items:
    - vncserver@:3

- name: change runlevel 5
  become: yes
  become_user: root
  shell: "systemctl set-default graphical.target"

- name: Extract cuDNN into download_dir
  unarchive:
    src: cudnn-9.1-linux-x64-v7.1.tgz
    dest: "{{download_dir}}"
    mode: 0644
    owner: root
    group: root

- name: Install / Copy  cudnn.h
  become: yes
  become_user: root
  shell: "cp {{download_dir}}cuda/include/cudnn.h /usr/local/cuda/include/"
  args:
    creates: /usr/local/cuda/include/cudnn.h

- name: Install / Copy libcudnn*
  become: yes
  become_user: root
  shell: "cp {{download_dir}}cuda/lib64/libcudnn* /usr/local/cuda/lib64"
  args:
    creates: /usr/local/cuda/lib64/libcudnn.so

- name: chmod
  file:
    path: /usr/local/cuda/include/cudnn.h
    owner: root
    group: root
    mode: 0444

- name: Install / Copy libcudnn*
  become: yes
  become_user: root
  shell: "ldconfig /usr/local/cuda/lib64"

エラー&再NVIDIAドライバの再インストール

CUDAをインストールすると、以下のようにnvidia-smiが入力されなくなり、
NVIDIAドライバをアンインストールしようとすると、
There is no NVIDIA driver currently installed. となり、
そもそもインストールされていないことになる。

#nvidia-smi
Failed to initialize NVML: Driver/library version mismatch

そこで、以下のコードのみを再実行して、NVIDIAドライバのインストールを実行すると、完了した。
そもそも、CUDAインストール -> NVIDIAドライバのインストールという順序が良いのだろうか?
(上手くいった方がおられたら、共有して下さい)

tasks/main.yml
- name: change runlevel 3
  shell: "systemctl set-default multi-user.target"
  become: true

- name: stop vnc server, if running
  systemd: name={{item}} state=stopped
  with_items:
    - vncserver@:3

- name: Running NVIDIA installer
  become: yes
  become_user: root
  shell: "{{exe}} -s --kernel-source-path=/usr/src/kernels/4.15.10-1.el7.elrepo.x86_64"
  register: nvidia_log

- name: change runlevel 5
  become: yes
  become_user: root
  shell: "systemctl set-default graphical.target"
3
2
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
3
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?