Ceph: bootstrap cluster after catastrophic event

This file provides some hints on how to restart a whole cluster after a catastrophic event, say a site power-cut.

Make sure all MON nodes indicated in Ceph configuration file at key “mon_initial_nodes” are up.

As soon as possible, prevent Ceph from rebalancing:

$ ceph --cluster <cluster_name> osd set noout

If needed, restart OSD servers one by one, possibly resolving booting issues.

Make sure all, or most, of the OSD disks pertaining to each server are up and running, possibly executing:

$ ceph-disk activate-all

on each OSD server, as explained in Ceph-OSD fail to start at boot in this section.

Clear the “noout” flag once all problems have been fixed, or when you are left with just a minimal fraction of problematic OSDs:

$ ceph --cluster <cluster_name> osd set noout

After a while your cluster should appear in state “HEALTH_OK” or at least all PGs should be in state “active” (some of the PGs may be followed by other flags like “scrubbing”, “downgraded”,…).

If some OSD disks are still in trouble, you may want to consider removing them from the cluster (see Ceph-OSD replacing a failed disk).