0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

proxmoxのceph volumeが破損した場合の復旧方法 (※冗長化されている場合)

Posted at

環境

  1. PRIMERGY TX1320 M3 3台によるPROXMOX 8.3クラスタ
  2. /dev/sda システム用LVM ADATA SU650 (2.5インチ SATA SSD 256GB)
  3. /dev/sdb ceph poll用LVM CT500BX500SSD1 (2.5インチ SATA SSD 500GB)
  4. 各ノードのメモリ容量は40GB

sdaのパーティションはPROXMOXインストーラーのデフォルト

# lsblk -F
NAME                         FSTYPE    FSVER    LABEL UUID                                   FSAVAIL FSUSE% MOUNTPOINTS
sda                                                                                                         
├─sda1                                                                                                      
├─sda2                       vfat      FAT32          588D-F53E                              1010.3M     1% /boot/efi
└─sda3                       LVM2_memb LVM2 001       vLA6ES-DvDL-viY5-lMtC-LmhQ-fQNw-OK2vin                
  ├─pve-swap                 swap      1              b54e017d-1fb4-44b4-9c3f-f5cb90b5a6d5                  
  ├─pve-root                 ext4      1.0            532a4cfd-48bf-40c7-a9df-37f0a67d1e28     47.9G    24% /
  ├─pve-data_tmeta                                                                                          
  │ └─pve-data-tpool                                                                                        
  │   ├─pve-data                                                                                            
  │   └─pve-vm--107--disk--0                                                                                
  └─pve-data_tdata                                                                                          
    └─pve-data-tpool                                                                                        
      ├─pve-data                                                                                            
      └─pve-vm--107--disk--0                                                                                
sdb                          LVM2_memb LVM2 001       wbbo0P-tGmN-6m36-PzTW-fhvV-GGM7-Y56cYR                
└─ceph--f58a1d23--ba4c--4e06--bda6--4e299eae5634-osd--block--e2884e27--9f82--4bed--9e61--440147b99a7e
                             ceph_blue            

現象

  1. VMバックアップをCIFSのリモートストレージに格納
  2. osdをstop,start
  3. osdが起動しない(systemd[1]: Failed to start ceph-osd@2.service - Ceph object storage daemon osd.2.)

解析

dmesgによると、「SST file is ahead of WALs in CF O-0」 という RocksDB エラーと、_txc_apply_kv のアサート (FAILED ceph_assert(r == 0)) で強制終了している。RocksDB (BlueStore) が格納している SSTファイルが破損、不整合を起こしているため、DB をマウントできず OSD が強制終了している状況。

更にswap fileが破損しているエラーも発生している。swap fileの破損がOSD Storage破壊を起こした可能性が高いが、因果は特定できない。

[111584.838843] get_swap_device: Bad swap file entry 1003ffffffffffff
[111584.838847] BUG: Bad page map in process smbclient ...
...
[111584.838987] get_swap_device: Bad swap offset entry 37fffffffffff

対処

  1. swap fileの再作成
  2. OSD Volumeを破棄し、冗長ノードのOSD Volumeから再生成

ceph ストレージは3台のノードで冗長化しているため、障害発生ノードのosd(ceph volume)を破棄し、残りのノードから再生成する。

1. swap fileの再作成

lvm型、サイズ8Gの場合。

# swapoff /dev/pve/swap
# lvremove /dev/pve/swap
# lvcreate -L 8G -n swap pve
# mkswap /dev/pve/swap
# swapon /dev/pve/swap

2. OSD Volume再生成

osd番号は2、ceph volumeは/dev/sdbの場合。

2.1. OSD.2 を “out” にし、クラスタから削除

root@pve03:~# ceph osd out 2
osd.2 is already out. 

root@pve03:~# ceph osd crush remove osd.2
removed item id 2 name 'osd.2' from crush map

root@pve03:~# ceph auth del osd.2

root@pve03:~# ceph osd rm 2
removed osd.2

2.2. OSD Volumeを“zap”(初期化)

lvmの場合

root@pve03:~# ceph-volume lvm zap /dev/sdb --destroy
--> Zapping: /dev/sdb
--> Zapping lvm member /dev/sdb. lv_path is /dev/ceph-efd5c8f7-fec4-425a-9138-ba50292a74e8/osd-block-f071e1ed-7e33-4fee-ac77-96269454e21b
--> Unmounting /var/lib/ceph/osd/ceph-2
Running command: /usr/bin/umount -v /var/lib/ceph/osd/ceph-2
 stderr: umount: /var/lib/ceph/osd/ceph-2 unmounted
Running command: /usr/bin/dd if=/dev/zero of=/dev/ceph-efd5c8f7-fec4-425a-9138-ba50292a74e8/osd-block-f071e1ed-7e33-4fee-ac77-96269454e21b bs=1M count=10 conv=fsync
 stderr: 10+0 records in
10+0 records out
 stderr: 10485760 bytes (10 MB, 10 MiB) copied, 0.042772 s, 245 MB/s
--> Only 1 LV left in VG, will proceed to destroy volume group ceph-efd5c8f7-fec4-425a-9138-ba50292a74e8
Running command: vgremove -v -f ceph-efd5c8f7-fec4-425a-9138-ba50292a74e8
 stderr: Removing ceph--efd5c8f7--fec4--425a--9138--ba50292a74e8-osd--block--f071e1ed--7e33--4fee--ac77--96269454e21b (252:0)
 stderr: Releasing logical volume "osd-block-f071e1ed-7e33-4fee-ac77-96269454e21b"
 stderr: Archiving volume group "ceph-efd5c8f7-fec4-425a-9138-ba50292a74e8" metadata (seqno 5).
 stdout: Logical volume "osd-block-f071e1ed-7e33-4fee-ac77-96269454e21b" successfully removed.
 stderr: Removing physical volume "/dev/sdb" from volume group "ceph-efd5c8f7-fec4-425a-9138-ba50292a74e8"
 stdout: Volume group "ceph-efd5c8f7-fec4-425a-9138-ba50292a74e8" successfully removed
 stderr: Creating volume group backup "/etc/lvm/backup/ceph-efd5c8f7-fec4-425a-9138-ba50292a74e8" (seqno 6).
Running command: pvremove -v -f -f /dev/sdb
 stdout: Labels on physical volume "/dev/sdb" successfully wiped.
Running command: /usr/bin/dd if=/dev/zero of=/dev/sdb bs=1M count=10 conv=fsync
 stderr: 10+0 records in
10+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 0.0436723 s, 240 MB/s
 stderr: 
--> Zapping successful for: <Raw Device: /dev/sdb>

2.3. OSD Volumeを再作成

root@pve03:~# ceph-volume lvm create --data /dev/sdb
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new e2884e27-9f82-4bed-9e61-440147b99a7e
Running command: vgcreate --force --yes ceph-f58a1d23-ba4c-4e06-bda6-4e299eae5634 /dev/sdb
 stdout: Physical volume "/dev/sdb" successfully created.
 stdout: Volume group "ceph-f58a1d23-ba4c-4e06-bda6-4e299eae5634" successfully created
Running command: lvcreate --yes -l 119234 -n osd-block-e2884e27-9f82-4bed-9e61-440147b99a7e ceph-f58a1d23-ba4c-4e06-bda6-4e299eae5634
 stdout: Logical volume "osd-block-e2884e27-9f82-4bed-9e61-440147b99a7e" created.
Running command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-2
--> Executable selinuxenabled not in PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
Running command: /usr/bin/chown -h ceph:ceph /dev/ceph-f58a1d23-ba4c-4e06-bda6-4e299eae5634/osd-block-e2884e27-9f82-4bed-9e61-440147b99a7e
Running command: /usr/bin/chown -R ceph:ceph /dev/dm-0
Running command: /usr/bin/ln -s /dev/ceph-f58a1d23-ba4c-4e06-bda6-4e299eae5634/osd-block-e2884e27-9f82-4bed-9e61-440147b99a7e /var/lib/ceph/osd/ceph-2/block
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-2/activate.monmap
 stderr: 2025-01-05T13:03:49.264+0900 7cbc03c006c0 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.bootstrap-osd.keyring: (2) No such file or directory
2025-01-05T13:03:49.264+0900 7cbc03c006c0 -1 AuthRegistry(0x7cbbfc065d50) no keyring found at /etc/pve/priv/ceph.client.bootstrap-osd.keyring, disabling cephx
 stderr: got monmap epoch 1
--> Creating keyring file for osd.2
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2/keyring
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2/
Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 2 --monmap /var/lib/ceph/osd/ceph-2/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-2/ --osd-uuid e2884e27-9f82-4bed-9e61-440147b99a7e --setuser ceph --setgroup ceph
 stderr: 2025-01-05T13:03:49.508+0900 7a3049a49840 -1 bluestore(/var/lib/ceph/osd/ceph-2//block) _read_bdev_label unable to decode label /var/lib/ceph/osd/ceph-2//block at offset 102: void bluestore_bdev_label_t::decode(ceph::buffer::v15_2_0::list::const_iterator&) decode past end of struct encoding: Malformed input [buffer:3]
 stderr: 2025-01-05T13:03:49.508+0900 7a3049a49840 -1 bluestore(/var/lib/ceph/osd/ceph-2//block) _read_bdev_label unable to decode label /var/lib/ceph/osd/ceph-2//block at offset 102: void bluestore_bdev_label_t::decode(ceph::buffer::v15_2_0::list::const_iterator&) decode past end of struct encoding: Malformed input [buffer:3]
 stderr: 2025-01-05T13:03:49.509+0900 7a3049a49840 -1 bluestore(/var/lib/ceph/osd/ceph-2//block) _read_bdev_label unable to decode label /var/lib/ceph/osd/ceph-2//block at offset 102: void bluestore_bdev_label_t::decode(ceph::buffer::v15_2_0::list::const_iterator&) decode past end of struct encoding: Malformed input [buffer:3]
 stderr: 2025-01-05T13:03:49.509+0900 7a3049a49840 -1 bluestore(/var/lib/ceph/osd/ceph-2/) _read_fsid unparsable uuid
--> ceph-volume lvm prepare successful for: /dev/sdb
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2
Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-f58a1d23-ba4c-4e06-bda6-4e299eae5634/osd-block-e2884e27-9f82-4bed-9e61-440147b99a7e --path /var/lib/ceph/osd/ceph-2 --no-mon-config
Running command: /usr/bin/ln -snf /dev/ceph-f58a1d23-ba4c-4e06-bda6-4e299eae5634/osd-block-e2884e27-9f82-4bed-9e61-440147b99a7e /var/lib/ceph/osd/ceph-2/block
Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-2/block
Running command: /usr/bin/chown -R ceph:ceph /dev/dm-0
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2
Running command: /usr/bin/systemctl enable ceph-volume@lvm-2-e2884e27-9f82-4bed-9e61-440147b99a7e
 stderr: Created symlink /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-2-e2884e27-9f82-4bed-9e61-440147b99a7e.service → /lib/systemd/system/ceph-volume@.service.
Running command: /usr/bin/systemctl enable --runtime ceph-osd@2
Running command: /usr/bin/systemctl start ceph-osd@2
--> ceph-volume lvm activate successful for osd ID: 2
--> ceph-volume lvm create successful for: /dev/sdb

2.4. OSDの復帰

正常に作成されると、自動的にOSD.2が起動し、ceph poolクラスタに参加(in)する。(と思う。)

根本原因

物理的なDISK, MEMORYの故障か、カーネルの不具合が疑われるが、不明。CIFS経由でVMのバックアップを行ってCIFSでエラーが発生した後に発生しているが、因果関係は不明。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?