Ceph: restore cluster after losing all ceph-mon’s

Here we will describe how to restore a Ceph cluster after a disaster where all ceph-mon’s are lost. We obviously assume that the data on the OSD devices are preserved!

The procedure refers to a Ceph cluster created with Juju.

Suppose that you have lost (or removed by mistake) all ceph-mon’s. We start recreating them, i.e. we have three new ceph-mon units, unfortunately the new mon’s do not know anything of the OSD’s.

Stop all OSDs

N.B. change the unit IDs accordingly to your cluster:

juju run-action --wait ceph-osd/15 stop osds=all
juju run-action --wait ceph-osd/16 stop osds=all
juju run-action --wait ceph-osd/17 stop osds=all

Stop all MONs and MGRs

SSH to each MON and use systemctl to stop mon and mgr services, e.g:

systemctl stop ceph-mon@juju-c1a2b7-24-lxd-21.service
systemctl stop ceph-mgr@juju-c1a2b7-24-lxd-21.service

Rebuild OSD mon map

On the juju client machine, rebuild old mon map using existing OSDs with the following script, which will save result to ./mon-store:

script:

ms=./mon-store
rm -rf $ms
mkdir $ms

hosts=`juju status ceph-osd | grep 'ceph-osd/' | awk '{print $1}' | cut -d'*' -f1`

# collect the cluster map from stopped OSDs
for host in $hosts; do
   echo $host
   juju ssh $host sudo rm -rf /tmp/$ms
   juju scp -- -r $ms $host:/tmp
   juju ssh $host sudo bash <<EOF
     for osd in /var/lib/ceph/osd/ceph-*; do
        ceph-objectstore-tool  --data-path \$osd --no-mon-config --op update-mon-db --mon-store-path /tmp/$ms
     done
     chmod -R +r /tmp/$ms
EOF
   rm -rf $ms
   juju scp -- -r $host:/tmp/$ms .
done

keyring creation

Create a keyring file with the following data from the new mon’s:

  • admin key: copy content of /etc/ceph/ceph.client.admin.keyring from any mon;

  • mon key: copy content of /var/lib/ceph/mon/ceph-xxx/keyring for any mon;

  • mgr key, copy content of /var/lib/ceph/mgr/ceph-xxx/keyring for each mon.

the keyring file will look like the following:

[mon.]
        key = AQAr9TphSxvCFxAACOS8KkIROPsvgVCfcFjh1Q==
        caps mon = "allow *"
[client.admin]
        key = AQAmBTthVM5wEhAALJS9IEVTuRKiHYRUztxgng==
        caps mds = "allow *"
        caps mgr = "allow *"
        caps mon = "allow *"
        caps osd = "allow *"
[mgr.juju-74bc65-21-lxd-49]
        key = AQAvBTth4HObLhAAlM140CBeuYrLRhxnuSwdKQ==
        caps mds = "allow *"
        caps mon = "allow profile mgr"
        caps osd = "allow *"
[mgr.juju-74bc65-21-lxd-50]
        key = AQAvBTthqwQ5LRAAISUJt9j4Qb3MZ5jn2B1SwQ==
        caps mds = "allow *"
        caps mon = "allow profile mgr"
        caps osd = "allow *"
[mgr.juju-74bc65-21-lxd-51]
        key = AQAtBTthU36NJBAAjMhYPoPcdDh5L6Coj2grqw==
        caps mds = "allow *"
        caps mon = "allow profile mgr"
        caps osd = "allow *"

(N.B.) if needed add proper permissions to mgr keys:

ceph-authtool keyring -n mgr.juju-74bc65-21-lxd-49 --cap mon 'allow profile mgr' --cap osd 'allow *' --cap mds 'allow *'
ceph-authtool keyring -n mgr.juju-74bc65-21-lxd-50 --cap mon 'allow profile mgr' --cap osd 'allow *' --cap mds 'allow *'
ceph-authtool keyring -n mgr.juju-74bc65-21-lxd-51 --cap mon 'allow profile mgr' --cap osd 'allow *' --cap mds 'allow *'

Rebuild mon store

Copy local ./mon-store directory and keyring file from juju client machine to a mon unit, e.g. under /tmp:

juju scp -- -r mon-store.new keyring ceph-mon/51:/tmp

Log on that mon unit and rebuild mon store with the following command:

ceph-monstore-tool /tmp/mon-store/ rebuild – –keyring /tmp/keyring –mon-ids juju-74bc65-21-lxd-49 juju-74bc65-21-lxd-50 juju-74bc65-21-lxd-51

Important The order of –mon-ids is important: it needs to match “mon host” order in `/etc/ceph/ceph.conf~ on mon units, otherwise mon’s won’t start!

Therefore, look for the IP addresses in /etc/ceph/ceph.conf, e.g.:

mon host = 10.7.5.131 10.7.5.132 10.7.5.133

Compare the order with the IP addresses shown by “juju status ceph-mon”:

# juju status ceph-mon
  ...
  Unit          Workload  Agent  Machine    Public address  Ports  Message
  ceph-mon/49   active    idle   21/lxd/29  10.7.5.132             Unit is ready and clustered
  ceph-mon/50*  active    idle   22/lxd/21  10.7.5.133             Unit is ready and clustered
  ceph-mon/51   active    idle   24/lxd/21  10.7.5.131             Unit is ready and clustered

In this example, 10.7.5.131 is the address of 24/lxd/21, 10.7.5.132 is 21/lxd/29, 10.7.5.133 is 22/lxd/21, so –mon-ids must be:

--mon-ids juju-c1a2b7-24-lxd-21 juju-c1a2b7-21-lxd-29 juju-c1a2b7-22-lxd-21

Copy mon-store on all mon’s

The ceph-monstore-tool command will rebuild the data in /tmp/mon-store.

Now copy the rebuilt /tmp/mon-store to all other mon’s units (e.g. again in /tmp).

On all mon’s units:

- mv /var/lib/ceph/mon/ceph-xxx/store.db /var/lib/ceph/mon/ceph-xxx/store.db.bak
- cp -r /tmp/mon-store/store.db /var/lib/ceph/mon/ceph-xxx/store.db
- chown ceph:ceph -R /var/lib/ceph/mon/ceph-xxx/store.db

Start services

Start all mon and mgr services from mon units:

systemctl start ceph-mon@juju-c1a2b7-24-lxd-21.service
systemctl start ceph-mgr@juju-c1a2b7-24-lxd-21.service

Start all OSDs

Start all osd’s from the client:

juju run-action --wait ceph-osd/17 start osds=all
juju run-action --wait ceph-osd/16 start osds=all
juju run-action --wait ceph-osd/15 start osds=all

Troubleshooting

Check the status of the cluster:

# ceph -s
cluster:
id: 5d26e488-cf89-11eb-8283-00163e78c363
health: HEALTH_WARN
mons are allowing insecure global_id reclaim
67 pgs not deep-scrubbed in time
67 pgs not scrubbed in time

services:
mon: 3 daemons, quorum juju-c1a2b7-24-lxd-21,juju-c1a2b7-21-lxd-29,juju-c1a2b7-22-lxd-21 (age 3m)
mgr: juju-c1a2b7-21-lxd-29(active, since 4m), standbys: juju-c1a2b7-22-lxd-21, juju-c1a2b7-24-lxd-21
osd: 6 osds: 6 up (since 78s), 6 in (since 3M)

data:
pools: 19 pools, 203 pgs
objects: 5.92k objects, 22 GiB
usage: 69 GiB used, 23 TiB / 23 TiB avail
pgs: 203 active+clean

It happened to us that the ceph-osd’s were blocked with the message: “non-pristine devices detected”. This was because we tried to add new devices to ceph-osd:osd-devices while the cluster was down.

Here are some commands to check the status of the devices in the osd hosts:

# ceph-volume lvm list

====== osd.0 =======

 [block]       /dev/ceph-9eff9b57-a7d7-497d-b0bc-a502e5658e32/osd-block-9eff9b57-a7d7-497d-b0bc-a502e5658e32

   block device              /dev/ceph-9eff9b57-a7d7-497d-b0bc-a502e5658e32/osd-block-9eff9b57-a7d7-497d-b0bc-a502e5658e32
   block uuid                oCfbdk-nKLh-yH2G-R82v-WhbI-cg3Y-Wo3L5A
   cephx lockbox secret
   cluster fsid              5d26e488-cf89-11eb-8283-00163e78c363
   cluster name              ceph
   crush device class        None
   encrypted                 0
   osd fsid                  9eff9b57-a7d7-497d-b0bc-a502e5658e32
   osd id                    0
   osdspec affinity
   type                      block
   vdo                       0
   devices                   /dev/nvme1n1p1


 ...

Another command is lsblk:

# lsblk

...
nvme2n1                                                                                                 259:0    0 11.7T  0
disk
├─nvme2n1p1                                                                                             259:2    0  3.9T  0 part

   └─ceph--5fd8f96e--2ccb--460f--87b8--359ff81cff8a-osd--block--5fd8f96e--2ccb--460f--87b8--359ff81cff8a 253:0    0  3.9T  0  lvm

In our case we need to zap (delete everything including partition table) device nvme3n1p1, the device that was added but not initialized, and then re-add it to ceph.

To recover this, firstly fix the keys on osd units. On a mon, get client.bootstrap-osd key and client.osd-upgrade:

ceph auth get client.bootstrap-osd
ceph auth get client.osd-upgrade

If not present, create them using the commands:

ceph auth get-or-create client.bootstrap-osd mon "allow profile bootstrap-osd"
ceph auth get-or-create client.osd-upgrade mon "allow command \"config-key\"; allow command \"osd tree\"; allow command \"config-key list\"; allow command \"config-key put\"; allow command \"config-key get\"; allow command \"config-key exists\"; allow command \"osd out\"; allow command \"osd in\"; allow command \"osd rm\"; allow command \"auth del\""

Replace the key value in following files on EACH osd unit:

/var/lib/ceph/bootstrap-osd/ceph.keyring <---- client.bootstrap-osd key
/var/lib/ceph/osd/ceph.client.osd-upgrade.keyring <----- client.osd-upgrade key

Those keys were created when new mons were installed.

Now, FOR EACH OSD:

  • ssh to the OSD’s and zap nvme3n1p1:

    !!! Please note this will DESTROY all data on nvme3n1p1 and CANNOT be recovered. !!!
    !!! Please double check before proceeding. !!!
    
    ceph-volume lvm zap /dev/nvme3n1p1 --destroy
    
  • recreate the partition:

    parted -a optimal /dev/nvme3n1 mkpart primary 0% 4268G
    
  • Now go back to juju client machine and use juju to remove it from juju’s internal db:

    juju run-action --wait ceph-osd/X zap-disk devices=/dev/nvme3n1p1 i-really-mean-it=true
    

Then check juju status.

juju status can be forced to update with:

juju run --unit ceph-osd/X 'hooks/update-status'

At this stage we should see the osd status is back to normal (green).

If not, let’s run the following commands:

juju run-action --wait ceph-osd/21 zap-disk devices=/dev/nvme3n1p1 i-really-mean-it=true
juju run-action ceph-osd/21 add-disk osd-devices=\"/dev/nvme3n1p1\""