環境
- PRIMERGY TX1320 M3 3台によるPROXMOX 8.3クラスタ
- /dev/sda システム用LVM ADATA SU650 (2.5インチ SATA SSD 256GB)
- /dev/sdb ceph poll用LVM CT500BX500SSD1 (2.5インチ SATA SSD 500GB)
- 各ノードのメモリ容量は40GB
sdaのパーティションはPROXMOXインストーラーのデフォルト
# lsblk -F
NAME FSTYPE FSVER LABEL UUID FSAVAIL FSUSE% MOUNTPOINTS
sda
├─sda1
├─sda2 vfat FAT32 588D-F53E 1010.3M 1% /boot/efi
└─sda3 LVM2_memb LVM2 001 vLA6ES-DvDL-viY5-lMtC-LmhQ-fQNw-OK2vin
├─pve-swap swap 1 b54e017d-1fb4-44b4-9c3f-f5cb90b5a6d5
├─pve-root ext4 1.0 532a4cfd-48bf-40c7-a9df-37f0a67d1e28 47.9G 24% /
├─pve-data_tmeta
│ └─pve-data-tpool
│ ├─pve-data
│ └─pve-vm--107--disk--0
└─pve-data_tdata
└─pve-data-tpool
├─pve-data
└─pve-vm--107--disk--0
sdb LVM2_memb LVM2 001 wbbo0P-tGmN-6m36-PzTW-fhvV-GGM7-Y56cYR
└─ceph--f58a1d23--ba4c--4e06--bda6--4e299eae5634-osd--block--e2884e27--9f82--4bed--9e61--440147b99a7e
ceph_blue
現象
- VMバックアップをCIFSのリモートストレージに格納
- osdをstop,start
- osdが起動しない(systemd[1]: Failed to start ceph-osd@2.service - Ceph object storage daemon osd.2.)
解析
dmesgによると、「SST file is ahead of WALs in CF O-0」 という RocksDB エラーと、_txc_apply_kv のアサート (FAILED ceph_assert(r == 0)) で強制終了している。RocksDB (BlueStore) が格納している SSTファイルが破損、不整合を起こしているため、DB をマウントできず OSD が強制終了している状況。
更にswap fileが破損しているエラーも発生している。swap fileの破損がOSD Storage破壊を起こした可能性が高いが、因果は特定できない。
[111584.838843] get_swap_device: Bad swap file entry 1003ffffffffffff
[111584.838847] BUG: Bad page map in process smbclient ...
...
[111584.838987] get_swap_device: Bad swap offset entry 37fffffffffff
対処
- swap fileの再作成
- OSD Volumeを破棄し、冗長ノードのOSD Volumeから再生成
ceph ストレージは3台のノードで冗長化しているため、障害発生ノードのosd(ceph volume)を破棄し、残りのノードから再生成する。
1. swap fileの再作成
lvm型、サイズ8Gの場合。
# swapoff /dev/pve/swap
# lvremove /dev/pve/swap
# lvcreate -L 8G -n swap pve
# mkswap /dev/pve/swap
# swapon /dev/pve/swap
2. OSD Volume再生成
osd番号は2、ceph volumeは/dev/sdbの場合。
2.1. OSD.2 を “out” にし、クラスタから削除
root@pve03:~# ceph osd out 2
osd.2 is already out.
root@pve03:~# ceph osd crush remove osd.2
removed item id 2 name 'osd.2' from crush map
root@pve03:~# ceph auth del osd.2
root@pve03:~# ceph osd rm 2
removed osd.2
2.2. OSD Volumeを“zap”(初期化)
lvmの場合
root@pve03:~# ceph-volume lvm zap /dev/sdb --destroy
--> Zapping: /dev/sdb
--> Zapping lvm member /dev/sdb. lv_path is /dev/ceph-efd5c8f7-fec4-425a-9138-ba50292a74e8/osd-block-f071e1ed-7e33-4fee-ac77-96269454e21b
--> Unmounting /var/lib/ceph/osd/ceph-2
Running command: /usr/bin/umount -v /var/lib/ceph/osd/ceph-2
stderr: umount: /var/lib/ceph/osd/ceph-2 unmounted
Running command: /usr/bin/dd if=/dev/zero of=/dev/ceph-efd5c8f7-fec4-425a-9138-ba50292a74e8/osd-block-f071e1ed-7e33-4fee-ac77-96269454e21b bs=1M count=10 conv=fsync
stderr: 10+0 records in
10+0 records out
stderr: 10485760 bytes (10 MB, 10 MiB) copied, 0.042772 s, 245 MB/s
--> Only 1 LV left in VG, will proceed to destroy volume group ceph-efd5c8f7-fec4-425a-9138-ba50292a74e8
Running command: vgremove -v -f ceph-efd5c8f7-fec4-425a-9138-ba50292a74e8
stderr: Removing ceph--efd5c8f7--fec4--425a--9138--ba50292a74e8-osd--block--f071e1ed--7e33--4fee--ac77--96269454e21b (252:0)
stderr: Releasing logical volume "osd-block-f071e1ed-7e33-4fee-ac77-96269454e21b"
stderr: Archiving volume group "ceph-efd5c8f7-fec4-425a-9138-ba50292a74e8" metadata (seqno 5).
stdout: Logical volume "osd-block-f071e1ed-7e33-4fee-ac77-96269454e21b" successfully removed.
stderr: Removing physical volume "/dev/sdb" from volume group "ceph-efd5c8f7-fec4-425a-9138-ba50292a74e8"
stdout: Volume group "ceph-efd5c8f7-fec4-425a-9138-ba50292a74e8" successfully removed
stderr: Creating volume group backup "/etc/lvm/backup/ceph-efd5c8f7-fec4-425a-9138-ba50292a74e8" (seqno 6).
Running command: pvremove -v -f -f /dev/sdb
stdout: Labels on physical volume "/dev/sdb" successfully wiped.
Running command: /usr/bin/dd if=/dev/zero of=/dev/sdb bs=1M count=10 conv=fsync
stderr: 10+0 records in
10+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 0.0436723 s, 240 MB/s
stderr:
--> Zapping successful for: <Raw Device: /dev/sdb>
2.3. OSD Volumeを再作成
root@pve03:~# ceph-volume lvm create --data /dev/sdb
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph-authtool --gen-print-key
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new e2884e27-9f82-4bed-9e61-440147b99a7e
Running command: vgcreate --force --yes ceph-f58a1d23-ba4c-4e06-bda6-4e299eae5634 /dev/sdb
stdout: Physical volume "/dev/sdb" successfully created.
stdout: Volume group "ceph-f58a1d23-ba4c-4e06-bda6-4e299eae5634" successfully created
Running command: lvcreate --yes -l 119234 -n osd-block-e2884e27-9f82-4bed-9e61-440147b99a7e ceph-f58a1d23-ba4c-4e06-bda6-4e299eae5634
stdout: Logical volume "osd-block-e2884e27-9f82-4bed-9e61-440147b99a7e" created.
Running command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-2
--> Executable selinuxenabled not in PATH: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
Running command: /usr/bin/chown -h ceph:ceph /dev/ceph-f58a1d23-ba4c-4e06-bda6-4e299eae5634/osd-block-e2884e27-9f82-4bed-9e61-440147b99a7e
Running command: /usr/bin/chown -R ceph:ceph /dev/dm-0
Running command: /usr/bin/ln -s /dev/ceph-f58a1d23-ba4c-4e06-bda6-4e299eae5634/osd-block-e2884e27-9f82-4bed-9e61-440147b99a7e /var/lib/ceph/osd/ceph-2/block
Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /var/lib/ceph/osd/ceph-2/activate.monmap
stderr: 2025-01-05T13:03:49.264+0900 7cbc03c006c0 -1 auth: unable to find a keyring on /etc/pve/priv/ceph.client.bootstrap-osd.keyring: (2) No such file or directory
2025-01-05T13:03:49.264+0900 7cbc03c006c0 -1 AuthRegistry(0x7cbbfc065d50) no keyring found at /etc/pve/priv/ceph.client.bootstrap-osd.keyring, disabling cephx
stderr: got monmap epoch 1
--> Creating keyring file for osd.2
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2/keyring
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2/
Running command: /usr/bin/ceph-osd --cluster ceph --osd-objectstore bluestore --mkfs -i 2 --monmap /var/lib/ceph/osd/ceph-2/activate.monmap --keyfile - --osd-data /var/lib/ceph/osd/ceph-2/ --osd-uuid e2884e27-9f82-4bed-9e61-440147b99a7e --setuser ceph --setgroup ceph
stderr: 2025-01-05T13:03:49.508+0900 7a3049a49840 -1 bluestore(/var/lib/ceph/osd/ceph-2//block) _read_bdev_label unable to decode label /var/lib/ceph/osd/ceph-2//block at offset 102: void bluestore_bdev_label_t::decode(ceph::buffer::v15_2_0::list::const_iterator&) decode past end of struct encoding: Malformed input [buffer:3]
stderr: 2025-01-05T13:03:49.508+0900 7a3049a49840 -1 bluestore(/var/lib/ceph/osd/ceph-2//block) _read_bdev_label unable to decode label /var/lib/ceph/osd/ceph-2//block at offset 102: void bluestore_bdev_label_t::decode(ceph::buffer::v15_2_0::list::const_iterator&) decode past end of struct encoding: Malformed input [buffer:3]
stderr: 2025-01-05T13:03:49.509+0900 7a3049a49840 -1 bluestore(/var/lib/ceph/osd/ceph-2//block) _read_bdev_label unable to decode label /var/lib/ceph/osd/ceph-2//block at offset 102: void bluestore_bdev_label_t::decode(ceph::buffer::v15_2_0::list::const_iterator&) decode past end of struct encoding: Malformed input [buffer:3]
stderr: 2025-01-05T13:03:49.509+0900 7a3049a49840 -1 bluestore(/var/lib/ceph/osd/ceph-2/) _read_fsid unparsable uuid
--> ceph-volume lvm prepare successful for: /dev/sdb
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2
Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-f58a1d23-ba4c-4e06-bda6-4e299eae5634/osd-block-e2884e27-9f82-4bed-9e61-440147b99a7e --path /var/lib/ceph/osd/ceph-2 --no-mon-config
Running command: /usr/bin/ln -snf /dev/ceph-f58a1d23-ba4c-4e06-bda6-4e299eae5634/osd-block-e2884e27-9f82-4bed-9e61-440147b99a7e /var/lib/ceph/osd/ceph-2/block
Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-2/block
Running command: /usr/bin/chown -R ceph:ceph /dev/dm-0
Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2
Running command: /usr/bin/systemctl enable ceph-volume@lvm-2-e2884e27-9f82-4bed-9e61-440147b99a7e
stderr: Created symlink /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-2-e2884e27-9f82-4bed-9e61-440147b99a7e.service → /lib/systemd/system/ceph-volume@.service.
Running command: /usr/bin/systemctl enable --runtime ceph-osd@2
Running command: /usr/bin/systemctl start ceph-osd@2
--> ceph-volume lvm activate successful for osd ID: 2
--> ceph-volume lvm create successful for: /dev/sdb
2.4. OSDの復帰
正常に作成されると、自動的にOSD.2が起動し、ceph poolクラスタに参加(in)する。(と思う。)
根本原因
物理的なDISK, MEMORYの故障か、カーネルの不具合が疑われるが、不明。CIFS経由でVMのバックアップを行ってCIFSでエラーが発生した後に発生しているが、因果関係は不明。