Ceph: restore cluster after losing all ceph-mon’s¶
Here we will describe how to restore a Ceph cluster after a disaster where all ceph-mon’s are lost. We obviously assume that the data on the OSD devices are preserved!
The procedure refers to a Ceph cluster created with Juju.
Suppose that you have lost (or removed by mistake) all ceph-mon’s. We start recreating them, i.e. we have three new ceph-mon units, unfortunately the new mon’s do not know anything of the OSD’s.
Stop all OSDs¶
N.B. change the unit IDs accordingly to your cluster:
juju run-action --wait ceph-osd/15 stop osds=all
juju run-action --wait ceph-osd/16 stop osds=all
juju run-action --wait ceph-osd/17 stop osds=all
Stop all MONs and MGRs¶
SSH to each MON and use systemctl to stop mon and mgr services, e.g:
systemctl stop ceph-mon@juju-c1a2b7-24-lxd-21.service
systemctl stop ceph-mgr@juju-c1a2b7-24-lxd-21.service
Rebuild OSD mon map¶
On the juju client machine, rebuild old mon map using existing OSDs with the following script, which will save result to ./mon-store:
script:
ms=./mon-store
rm -rf $ms
mkdir $ms
hosts=`juju status ceph-osd | grep 'ceph-osd/' | awk '{print $1}' | cut -d'*' -f1`
# collect the cluster map from stopped OSDs
for host in $hosts; do
echo $host
juju ssh $host sudo rm -rf /tmp/$ms
juju scp -- -r $ms $host:/tmp
juju ssh $host sudo bash <<EOF
for osd in /var/lib/ceph/osd/ceph-*; do
ceph-objectstore-tool --data-path \$osd --no-mon-config --op update-mon-db --mon-store-path /tmp/$ms
done
chmod -R +r /tmp/$ms
EOF
rm -rf $ms
juju scp -- -r $host:/tmp/$ms .
done
keyring creation¶
Create a keyring file with the following data from the new mon’s:
admin key: copy content of /etc/ceph/ceph.client.admin.keyring from any mon;
mon key: copy content of /var/lib/ceph/mon/ceph-xxx/keyring for any mon;
mgr key, copy content of /var/lib/ceph/mgr/ceph-xxx/keyring for each mon.
the keyring file will look like the following:
[mon.]
key = AQAr9TphSxvCFxAACOS8KkIROPsvgVCfcFjh1Q==
caps mon = "allow *"
[client.admin]
key = AQAmBTthVM5wEhAALJS9IEVTuRKiHYRUztxgng==
caps mds = "allow *"
caps mgr = "allow *"
caps mon = "allow *"
caps osd = "allow *"
[mgr.juju-74bc65-21-lxd-49]
key = AQAvBTth4HObLhAAlM140CBeuYrLRhxnuSwdKQ==
caps mds = "allow *"
caps mon = "allow profile mgr"
caps osd = "allow *"
[mgr.juju-74bc65-21-lxd-50]
key = AQAvBTthqwQ5LRAAISUJt9j4Qb3MZ5jn2B1SwQ==
caps mds = "allow *"
caps mon = "allow profile mgr"
caps osd = "allow *"
[mgr.juju-74bc65-21-lxd-51]
key = AQAtBTthU36NJBAAjMhYPoPcdDh5L6Coj2grqw==
caps mds = "allow *"
caps mon = "allow profile mgr"
caps osd = "allow *"
(N.B.) if needed add proper permissions to mgr keys:
ceph-authtool keyring -n mgr.juju-74bc65-21-lxd-49 --cap mon 'allow profile mgr' --cap osd 'allow *' --cap mds 'allow *'
ceph-authtool keyring -n mgr.juju-74bc65-21-lxd-50 --cap mon 'allow profile mgr' --cap osd 'allow *' --cap mds 'allow *'
ceph-authtool keyring -n mgr.juju-74bc65-21-lxd-51 --cap mon 'allow profile mgr' --cap osd 'allow *' --cap mds 'allow *'
Rebuild mon store¶
Copy local ./mon-store directory and keyring file from juju client machine to a mon unit, e.g. under /tmp:
juju scp -- -r mon-store.new keyring ceph-mon/51:/tmp
Log on that mon unit and rebuild mon store with the following command:
ceph-monstore-tool /tmp/mon-store/ rebuild – –keyring /tmp/keyring –mon-ids juju-74bc65-21-lxd-49 juju-74bc65-21-lxd-50 juju-74bc65-21-lxd-51
Important The order of –mon-ids is important: it needs to match “mon host” order in `/etc/ceph/ceph.conf~ on mon units, otherwise mon’s won’t start!
Therefore, look for the IP addresses in /etc/ceph/ceph.conf, e.g.:
mon host = 10.7.5.131 10.7.5.132 10.7.5.133
Compare the order with the IP addresses shown by “juju status ceph-mon”:
# juju status ceph-mon
...
Unit Workload Agent Machine Public address Ports Message
ceph-mon/49 active idle 21/lxd/29 10.7.5.132 Unit is ready and clustered
ceph-mon/50* active idle 22/lxd/21 10.7.5.133 Unit is ready and clustered
ceph-mon/51 active idle 24/lxd/21 10.7.5.131 Unit is ready and clustered
In this example, 10.7.5.131 is the address of 24/lxd/21, 10.7.5.132 is 21/lxd/29, 10.7.5.133 is 22/lxd/21, so –mon-ids must be:
--mon-ids juju-c1a2b7-24-lxd-21 juju-c1a2b7-21-lxd-29 juju-c1a2b7-22-lxd-21
Copy mon-store on all mon’s¶
The ceph-monstore-tool command will rebuild the data in /tmp/mon-store.
Now copy the rebuilt /tmp/mon-store to all other mon’s units (e.g. again in /tmp).
On all mon’s units:
- mv /var/lib/ceph/mon/ceph-xxx/store.db /var/lib/ceph/mon/ceph-xxx/store.db.bak
- cp -r /tmp/mon-store/store.db /var/lib/ceph/mon/ceph-xxx/store.db
- chown ceph:ceph -R /var/lib/ceph/mon/ceph-xxx/store.db
Start services¶
Start all mon and mgr services from mon units:
systemctl start ceph-mon@juju-c1a2b7-24-lxd-21.service
systemctl start ceph-mgr@juju-c1a2b7-24-lxd-21.service
Start all OSDs¶
Start all osd’s from the client:
juju run-action --wait ceph-osd/17 start osds=all
juju run-action --wait ceph-osd/16 start osds=all
juju run-action --wait ceph-osd/15 start osds=all
Troubleshooting¶
Check the status of the cluster:
# ceph -s
cluster:
id: 5d26e488-cf89-11eb-8283-00163e78c363
health: HEALTH_WARN
mons are allowing insecure global_id reclaim
67 pgs not deep-scrubbed in time
67 pgs not scrubbed in time
services:
mon: 3 daemons, quorum juju-c1a2b7-24-lxd-21,juju-c1a2b7-21-lxd-29,juju-c1a2b7-22-lxd-21 (age 3m)
mgr: juju-c1a2b7-21-lxd-29(active, since 4m), standbys: juju-c1a2b7-22-lxd-21, juju-c1a2b7-24-lxd-21
osd: 6 osds: 6 up (since 78s), 6 in (since 3M)
data:
pools: 19 pools, 203 pgs
objects: 5.92k objects, 22 GiB
usage: 69 GiB used, 23 TiB / 23 TiB avail
pgs: 203 active+clean
It happened to us that the ceph-osd’s were blocked with the message: “non-pristine devices detected”. This was because we tried to add new devices to ceph-osd:osd-devices while the cluster was down.
Here are some commands to check the status of the devices in the osd hosts:
# ceph-volume lvm list
====== osd.0 =======
[block] /dev/ceph-9eff9b57-a7d7-497d-b0bc-a502e5658e32/osd-block-9eff9b57-a7d7-497d-b0bc-a502e5658e32
block device /dev/ceph-9eff9b57-a7d7-497d-b0bc-a502e5658e32/osd-block-9eff9b57-a7d7-497d-b0bc-a502e5658e32
block uuid oCfbdk-nKLh-yH2G-R82v-WhbI-cg3Y-Wo3L5A
cephx lockbox secret
cluster fsid 5d26e488-cf89-11eb-8283-00163e78c363
cluster name ceph
crush device class None
encrypted 0
osd fsid 9eff9b57-a7d7-497d-b0bc-a502e5658e32
osd id 0
osdspec affinity
type block
vdo 0
devices /dev/nvme1n1p1
...
Another command is lsblk:
# lsblk
...
nvme2n1 259:0 0 11.7T 0
disk
├─nvme2n1p1 259:2 0 3.9T 0 part
└─ceph--5fd8f96e--2ccb--460f--87b8--359ff81cff8a-osd--block--5fd8f96e--2ccb--460f--87b8--359ff81cff8a 253:0 0 3.9T 0 lvm
In our case we need to zap (delete everything including partition table) device nvme3n1p1, the device that was added but not initialized, and then re-add it to ceph.
To recover this, firstly fix the keys on osd units. On a mon, get client.bootstrap-osd key and client.osd-upgrade:
ceph auth get client.bootstrap-osd
ceph auth get client.osd-upgrade
If not present, create them using the commands:
ceph auth get-or-create client.bootstrap-osd mon "allow profile bootstrap-osd"
ceph auth get-or-create client.osd-upgrade mon "allow command \"config-key\"; allow command \"osd tree\"; allow command \"config-key list\"; allow command \"config-key put\"; allow command \"config-key get\"; allow command \"config-key exists\"; allow command \"osd out\"; allow command \"osd in\"; allow command \"osd rm\"; allow command \"auth del\""
Replace the key value in following files on EACH osd unit:
/var/lib/ceph/bootstrap-osd/ceph.keyring <---- client.bootstrap-osd key
/var/lib/ceph/osd/ceph.client.osd-upgrade.keyring <----- client.osd-upgrade key
Those keys were created when new mons were installed.
Now, FOR EACH OSD:
ssh to the OSD’s and zap nvme3n1p1:
!!! Please note this will DESTROY all data on nvme3n1p1 and CANNOT be recovered. !!! !!! Please double check before proceeding. !!! ceph-volume lvm zap /dev/nvme3n1p1 --destroy
recreate the partition:
parted -a optimal /dev/nvme3n1 mkpart primary 0% 4268G
Now go back to juju client machine and use juju to remove it from juju’s internal db:
juju run-action --wait ceph-osd/X zap-disk devices=/dev/nvme3n1p1 i-really-mean-it=true
Then check juju status.
juju status can be forced to update with:
juju run --unit ceph-osd/X 'hooks/update-status'
At this stage we should see the osd status is back to normal (green).
If not, let’s run the following commands:
juju run-action --wait ceph-osd/21 zap-disk devices=/dev/nvme3n1p1 i-really-mean-it=true
juju run-action ceph-osd/21 add-disk osd-devices=\"/dev/nvme3n1p1\""