Failed to start Ceph object storage daemon osd.14

problem: the ceph cluster osd has been turned down, and restarting osd has been failing.
analysis:

[root@shnode183 ~]# systemctl status ceph-osd@14
● [email protected] - Ceph object storage daemon osd.14
   Loaded: loaded (/usr/lib/systemd/system/[email protected]; enabled-runtime; vendor preset: disabled)
   Active: failed (Result: start-limit) since Mon 2020-06-08 17:47:25 CST; 2s ago
  Process: 291595 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i --setuser ceph --setgroup ceph (code=exited, status=1/FAILURE)
  Process: 291589 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
 Main PID: 291595 (code=exited, status=1/FAILURE)

Jun 08 17:47:25 shnode183 systemd[1]: Unit [email protected] entered failed state.
Jun 08 17:47:25 shnode183 systemd[1]: [email protected] failed.
Jun 08 17:47:25 shnode183 systemd[1]: [email protected] holdoff time over, scheduling restart.
Jun 08 17:47:25 shnode183 systemd[1]: Stopped Ceph object storage daemon osd.14.
Jun 08 17:47:25 shnode183 systemd[1]: start request repeated too quickly for [email protected]
Jun 08 17:47:25 shnode183 systemd[1]: Failed to start Ceph object storage daemon osd.14.
Jun 08 17:47:25 shnode183 systemd[1]: Unit [email protected] entered failed state.
Jun 08 17:47:25 shnode183 systemd[1]: [email protected] failed.

[root@shnode183 ~]# tail /var/log/ceph/ceph-osd.14.log
2020-06-08 17:47:25.091 7f8f9d863a80  0 set uid:gid to 167:167 (ceph:ceph)
2020-06-08 17:47:25.091 7f8f9d863a80  0 ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable), process ceph-osd, pid 291575
2020-06-08 17:47:25.091 7f8f9d863a80  0 pidfile_write: ignore empty --pid-file
2020-06-08 17:47:25.114 7f8f9d863a80 -1 bluestore(/var/lib/ceph/osd/ceph-14/block) _read_bdev_label failed to read from /var/lib/ceph/osd/ceph-14/block: (5) Input/output error
2020-06-08 17:47:25.114 7f8f9d863a80 -1  ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-14: (2) No such file or directory
2020-06-08 17:47:25.343 7f826fb16a80  0 set uid:gid to 167:167 (ceph:ceph)
2020-06-08 17:47:25.343 7f826fb16a80  0 ceph version 14.2.8 (2d095e947a02261ce61424021bb43bd3022d35cb) nautilus (stable), process ceph-osd, pid 291595
2020-06-08 17:47:25.343 7f826fb16a80  0 pidfile_write: ignore empty --pid-file
2020-06-08 17:47:25.366 7f826fb16a80 -1 bluestore(/var/lib/ceph/osd/ceph-14/block) _read_bdev_label failed to read from /var/lib/ceph/osd/ceph-14/block: (5) Input/output error
2020-06-08 17:47:25.366 7f826fb16a80 -1  ** ERROR: unable to open OSD superblock on /var/lib/ceph/osd/ceph-14: (2) No such file or directory

[root@shnode183 ~]# dmesg -T
[Tue Jun  2 04:07:26 2020] sd 0:2:1:0: [sdb] tag#10 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Tue Jun  2 04:07:26 2020] sd 0:2:1:0: [sdb] tag#10 CDB: Read(16) 88 00 00 00 00 02 fc 7f 41 80 00 00 02 00 00 00
[Tue Jun  2 04:07:26 2020] print_req_error: I/O error, dev sdb, sector 12826132864
[Tue Jun  2 04:07:30 2020] sd 0:2:1:0: [sdb] tag#31 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Tue Jun  2 04:07:30 2020] sd 0:2:1:0: [sdb] tag#43 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Tue Jun  2 04:07:30 2020] sd 0:2:1:0: [sdb] tag#31 CDB: Write(16) 8a 00 00 00 00 02 18 71 bc d0 00 00 00 10 00 00
[Tue Jun  2 04:07:30 2020] sd 0:2:1:0: [sdb] tag#43 CDB: Read(16) 88 00 00 00 00 02 bf 09 53 80 00 00 02 00 00 00
[Tue Jun  2 04:07:30 2020] print_req_error: I/O error, dev sdb, sector 11794994048
[Tue Jun  2 04:07:30 2020] print_req_error: I/O error, dev sdb, sector 9000041680
[Tue Jun  2 04:07:30 2020] Buffer I/O error on dev dm-1, logical block 1125004954, lost async page write
[Tue Jun  2 04:07:30 2020] sd 0:2:1:0: [sdb] tag#17 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Tue Jun  2 04:07:30 2020] print_req_error: I/O error, dev sdb, sector 10183874816
[Tue Jun  2 04:07:30 2020] sd 0:2:1:0: [sdb] tag#8 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Tue Jun  2 04:07:30 2020] sd 0:2:1:0: [sdb] tag#18 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Tue Jun  2 04:07:30 2020] sd 0:2:1:0: [sdb] tag#17 CDB: Read(16) 88 00 00 00 00 00 5a 16 d3 80 00 00 00 48 00 00
[Tue Jun  2 04:07:30 2020] Buffer I/O error on dev dm-1, logical block 1125004955, lost async page write
[Tue Jun  2 04:07:30 2020] print_req_error: I/O error, dev sdb, sector 1511445376
[Tue Jun  2 04:07:30 2020] sd 0:2:1:0: [sdb] tag#18 CDB: Read(16) 88 00 00 00 00 02 19 12 0f 00 00 00 00 10 00 00
[Tue Jun  2 04:07:30 2020] print_req_error: I/O error, dev sdb, sector 9010548480
[Tue Jun  2 04:07:30 2020] sd 0:2:1:0: [sdb] tag#8 CDB: Read(16) 88 00 00 00 00 02 19 e7 83 80 00 00 00 10 00 00
[Tue Jun  2 04:07:30 2020] sd 0:2:1:0: [sdb] tag#44 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Tue Jun  2 04:07:30 2020] print_req_error: I/O error, dev sdb, sector 9024537472
[Tue Jun  2 04:07:30 2020] sd 0:2:1:0: [sdb] tag#44 CDB: Read(16) 88 00 00 00 00 02 bf 09 55 80 00 00 01 e8 00 00
[Tue Jun  2 04:07:30 2020] print_req_error: I/O error, dev sdb, sector 11794994560
[Tue Jun  2 04:07:30 2020] sd 0:2:1:0: [sdb] tag#8 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Tue Jun  2 04:07:30 2020] sd 0:2:1:0: [sdb] tag#13 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Tue Jun  2 04:07:30 2020] sd 0:2:1:0: [sdb] tag#8 CDB: Read(16) 88 00 00 00 00 02 19 12 0f 00 00 00 00 08 00 00
[Tue Jun  2 04:07:30 2020] print_req_error: I/O error, dev sdb, sector 9010548480
[Tue Jun  2 04:07:30 2020] sd 0:2:1:0: [sdb] tag#13 CDB: Read(16) 88 00 00 00 00 02 19 e7 83 80 00 00 00 08 00 00
[Tue Jun  2 04:07:30 2020] print_req_error: I/O error, dev sdb, sector 9024537472
[Tue Jun  2 04:07:30 2020] Buffer I/O error on dev dm-1, logical block 1126318304, async page read
[Tue Jun  2 04:07:30 2020] Buffer I/O error on dev dm-1, logical block 1128066928, async page read

[root@shnode183 ~]# pvs
  Error reading device /dev/sdb at 0 length 512.
  Error reading device /dev/sdb at 0 length 4.
  Error reading device /dev/sdb at 4096 length 4.
  PV         VG                                        Fmt  Attr PSize  PFree
  /dev/sdb   ceph-0a213fb7-3bdd-49fc-904c-9aecf750ef05 lvm2 a--   8.73t    0
  /dev/sdc   ceph-bf6136eb-671c-44ee-aa24-9a460c2901bd lvm2 a--   8.73t    0
  /dev/sdd   ceph-22bbd5e1-f98d-40a2-950d-023a08ba5eb3 lvm2 a--   8.73t    0
  /dev/sde   ceph-b1df4cad-fc0e-430a-8a2b-8fd08ce4cb62 lvm2 a--   8.73t    0
  /dev/sdf   ceph-36c57ac2-0724-4f6f-bdb0-020cd18d0643 lvm2 a--   8.73t    0
  /dev/sdg   ceph-52d9bdc0-f9f0-4659-83d4-4b6cc80e387f lvm2 a--  <6.55t    0
  /dev/sdh   ceph-75b81cc4-095c-4281-8b26-222a7e669d09 lvm2 a--  <6.55t    0
[root@shnode183 ~]# hpssacli ctrl slot=0  show  config detail
-bash: hpssacli: command not found
You have new mail in /var/spool/mail/root
[root@shnode183 ~]# vgs
  VG                                        #PV #LV #SN Attr   VSize  VFree
  ceph-0a213fb7-3bdd-49fc-904c-9aecf750ef05   1   1   0 wz--n-  8.73t    0
  ceph-22bbd5e1-f98d-40a2-950d-023a08ba5eb3   1   1   0 wz--n-  8.73t    0
  ceph-36c57ac2-0724-4f6f-bdb0-020cd18d0643   1   1   0 wz--n-  8.73t    0
  ceph-52d9bdc0-f9f0-4659-83d4-4b6cc80e387f   1   1   0 wz--n- <6.55t    0
  ceph-75b81cc4-095c-4281-8b26-222a7e669d09   1   1   0 wz--n- <6.55t    0
  ceph-b1df4cad-fc0e-430a-8a2b-8fd08ce4cb62   1   1   0 wz--n-  8.73t    0
  ceph-bf6136eb-671c-44ee-aa24-9a460c2901bd   1   1   0 wz--n-  8.73t    0
You have new mail in /var/spool/mail/root

troubleshooting found that the disk /dev/sdb is damaged and needs to be replaced. Clear /dev/sdb logical volume information

[root@shnode183 ~]# df -h|grep ceph
tmpfs            63G   24K   63G   1% /var/lib/ceph/osd/ceph-15
tmpfs            63G   24K   63G   1% /var/lib/ceph/osd/ceph-17
tmpfs            63G   24K   63G   1% /var/lib/ceph/osd/ceph-20
tmpfs            63G   24K   63G   1% /var/lib/ceph/osd/ceph-16
tmpfs            63G   24K   63G   1% /var/lib/ceph/osd/ceph-19
tmpfs            63G   24K   63G   1% /var/lib/ceph/osd/ceph-18
tmpfs            63G   24K   63G   1% /var/lib/ceph/osd/ceph-14
[root@shnode183 ~]# umount /var/lib/ceph/osd/ceph-14
[root@shnode183 ~]# lvs
  LV                                             VG                                        Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  osd-block-fbd4f71a-9ada-4fbd-b87f-9d1f4f9dab93 ceph-0a213fb7-3bdd-49fc-904c-9aecf750ef05 -wi-a-----  8.73t
  osd-block-091f4915-e79f-43fa-b40d-89f3cdf1cf4f ceph-22bbd5e1-f98d-40a2-950d-023a08ba5eb3 -wi-ao----  8.73t
  osd-block-f52a5fbd-e4ac-41dd-869c-f25b7867b726 ceph-36c57ac2-0724-4f6f-bdb0-020cd18d0643 -wi-ao----  8.73t
  osd-block-8035bf12-6a30-4a57-910e-ddf7e7f319cd ceph-52d9bdc0-f9f0-4659-83d4-4b6cc80e387f -wi-ao---- <6.55t
  osd-block-882e7034-f5d2-480d-be60-3e7c8a746f1b ceph-75b81cc4-095c-4281-8b26-222a7e669d09 -wi-ao---- <6.55t
  osd-block-1fbd5079-51bd-479b-9e2e-80a3264f46ba ceph-b1df4cad-fc0e-430a-8a2b-8fd08ce4cb62 -wi-ao----  8.73t
  osd-block-0a09de8e-354f-407e-a57e-cb346d8cac6c ceph-bf6136eb-671c-44ee-aa24-9a460c2901bd -wi-ao----  8.73t
[root@shnode183 ~]# pvs
  Error reading device /dev/sdb at 0 length 512.
  Error reading device /dev/sdb at 0 length 4.
  Error reading device /dev/sdb at 4096 length 4.
  PV         VG                                        Fmt  Attr PSize  PFree
  /dev/sdb   ceph-0a213fb7-3bdd-49fc-904c-9aecf750ef05 lvm2 a--   8.73t    0
  /dev/sdc   ceph-bf6136eb-671c-44ee-aa24-9a460c2901bd lvm2 a--   8.73t    0
  /dev/sdd   ceph-22bbd5e1-f98d-40a2-950d-023a08ba5eb3 lvm2 a--   8.73t    0
  /dev/sde   ceph-b1df4cad-fc0e-430a-8a2b-8fd08ce4cb62 lvm2 a--   8.73t    0
  /dev/sdf   ceph-36c57ac2-0724-4f6f-bdb0-020cd18d0643 lvm2 a--   8.73t    0
  /dev/sdg   ceph-52d9bdc0-f9f0-4659-83d4-4b6cc80e387f lvm2 a--  <6.55t    0
  /dev/sdh   ceph-75b81cc4-095c-4281-8b26-222a7e669d09 lvm2 a--  <6.55t    0
[root@shnode183 ~]# lvremove osd-block-fbd4f71a-9ada-4fbd-b87f-9d1f4f9dab93/ceph-0a213fb7-3bdd-49fc-904c-9aecf750ef05
  Volume group "osd-block-fbd4f71a-9ada-4fbd-b87f-9d1f4f9dab93" not found
  Cannot process volume group osd-block-fbd4f71a-9ada-4fbd-b87f-9d1f4f9dab93
[root@shnode183 ~]# lvremove ceph-0a213fb7-3bdd-49fc-904c-9aecf750ef05/osd-block-fbd4f71a-9ada-4fbd-b87f-9d1f4f9dab93
Do you really want to remove active logical volume ceph-0a213fb7-3bdd-49fc-904c-9aecf750ef05/osd-block-fbd4f71a-9ada-4fbd-b87f-9d1f4f9dab93?[y/n]: y
  Error reading device /dev/sdb at 4096 length 512.
  Failed to read metadata area header on /dev/sdb at 4096
  WARNING: Failed to write an MDA of VG ceph-0a213fb7-3bdd-49fc-904c-9aecf750ef05.
  Failed to write VG ceph-0a213fb7-3bdd-49fc-904c-9aecf750ef05.

cannot be deleted at this point, you need to refresh the cache with the following command.

[root@shnode183 ~]#  pvscan --cache
You have new mail in /var/spool/mail/root
[root@shnode183 ~]# pvs
  Error reading device /dev/sdb at 0 length 512.
  Error reading device /dev/sdb at 0 length 4.
  Error reading device /dev/sdb at 4096 length 4.
  PV         VG                                        Fmt  Attr PSize  PFree
  /dev/sdc   ceph-bf6136eb-671c-44ee-aa24-9a460c2901bd lvm2 a--   8.73t    0
  /dev/sdd   ceph-22bbd5e1-f98d-40a2-950d-023a08ba5eb3 lvm2 a--   8.73t    0
  /dev/sde   ceph-b1df4cad-fc0e-430a-8a2b-8fd08ce4cb62 lvm2 a--   8.73t    0
  /dev/sdf   ceph-36c57ac2-0724-4f6f-bdb0-020cd18d0643 lvm2 a--   8.73t    0
  /dev/sdg   ceph-52d9bdc0-f9f0-4659-83d4-4b6cc80e387f lvm2 a--  <6.55t    0
  /dev/sdh   ceph-75b81cc4-095c-4281-8b26-222a7e669d09 lvm2 a--  <6.55t    0
[root@shnode183 ~]# lvs
  LV                                             VG                                        Attr       LSize  Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  osd-block-091f4915-e79f-43fa-b40d-89f3cdf1cf4f ceph-22bbd5e1-f98d-40a2-950d-023a08ba5eb3 -wi-ao----  8.73t
  osd-block-f52a5fbd-e4ac-41dd-869c-f25b7867b726 ceph-36c57ac2-0724-4f6f-bdb0-020cd18d0643 -wi-ao----  8.73t
  osd-block-8035bf12-6a30-4a57-910e-ddf7e7f319cd ceph-52d9bdc0-f9f0-4659-83d4-4b6cc80e387f -wi-ao---- <6.55t
  osd-block-882e7034-f5d2-480d-be60-3e7c8a746f1b ceph-75b81cc4-095c-4281-8b26-222a7e669d09 -wi-ao---- <6.55t
  osd-block-1fbd5079-51bd-479b-9e2e-80a3264f46ba ceph-b1df4cad-fc0e-430a-8a2b-8fd08ce4cb62 -wi-ao----  8.73t
  osd-block-0a09de8e-354f-407e-a57e-cb346d8cac6c ceph-bf6136eb-671c-44ee-aa24-9a460c2901bd -wi-ao----  8.73t

you can see that the corrupted logical volume has disappeared.

ProgrammerAH

Programmer Guide, Tips and Tutorial

Failed to start Ceph object storage daemon osd.14