OpenStack Release Upgrade

Instructions for upgrading an OpenStack cloud deployed with Juju.

These are the steps for an upgrade from Mitaka to Newton OpenStack release, adapted from [1].

For upgrading to Ocata see the release notes.

Warning

For Ocata use openstack-origin=cloud:xenial-ocata

For upgradig to Pike see the release notes.

Warning

For Pike use openstack-origin=cloud:xenial-pike

For upgradig to Queens see the release notes.

Warning

For Queens use openstack-origin=cloud:xenial-queens

For upgrading OpenStack Juju charms see [2].

In order to align with the suggested naming according to the OpenStack documentation, as a preliminary step we rename the “Service Project” to service:

$ juju config keystone service-tenant=services

Check the status of the services, in particular Keystone:

$ juju status keystone
Model      Controller     Cloud/Region   Version
cloudbase  garrmaas       garr/Bari      2.1.2

App                         Version  Status   Scale  Charm      Store       Rev  OS      Notes
defaultgw-ba1-cl2                    active       3  defaultgw  jujucharms    6  ubuntu
keystone-ba1-cl2            9.3.0    active       3  keystone   local         3  ubuntu
keystone-hacluster-ba1-cl2           active       3  hacluster  jujucharms   33  ubuntu
nrpe-keystone-ba1-cl2                unknown      3  nrpe       jujucharms   21  ubuntu

Charm upgrades

Ensure the nrpe-xxx charm use at least version nrpe-30.

For Ocata we are using the 17.08 release of the OpenStack Charms.

$ juju upgrade-charm nova-cloud-controller
$ juju upgrade-charm openstack-dashboard
$ juju upgrade-charm ceph-radosgw
$ juju upgrade-charm percona-cluster
$ juju upgrade-charm neutron-api-hacluster
$ juju upgrade-charm neutron-gateway
$ juju upgrade-charm neutron-ovs
$ juju upgrade-charm nova-cloud-controller
$ juju upgrade-charm nova-compute
$ juju upgrade-charm ceilometer
$ juju upgrade-charm ceilometer-agent
$ juju upgrade-charm ceph-proxy
$ juju upgrade-charm cinder
$ juju upgrade-charm cinder-ceph
$ juju upgrade-charm glance
$ juju upgrade-charm gnocchi
$ juju upgrade-charm memcached
$ juju upgrade-charm nagios
$ juju upgrade-charm neutron-api
$ juju upgrade-charm neutron-gateway
$ juju upgrade-charm ntp
$ juju upgrade-charm postgresql
$ juju upgrade-charm rabbitmq-server

Clean the database

Remove expired tokens from the Keystone database keystone. Find the leader:

$ juju run --application keystone is-leader

Assuming the leader unit is L:

$ juju ssh keystone/$L sudo keystone-manage token_flush

Upgrading the OpenStack Services

In a rolling upgrade of an OpenStack service, each unit within a service is upgraded one at a time, thus rolling the update across the service.

This is procedure to perform the upgrade on each service:

  1. configure the charm of the service for managed upgrade
  2. pause the services on the leader unit in a cluster
  3. perform the upgrade
  4. resume the services on the leader unit

Step 1 exploits the openstack-origin configuration option, that is used to specify the repository from which to download the upgraded packages for a service. Changing the value of the openstack-origin configuration option is done using the juju config command.

Keystone

In order to speed up the upgrade, temorarly disable saml2:

$ juju config keystone enable-saml2=false

Find the leader:

$ juju run --application keystone is-leader

Assuming that the leader is L, stop the identity service:

$ juju run-action keystone/$L --wait pause

Configure for the upgrade:

$ juju config keystone action-managed-upgrade=true

Set the origin for the upgrade:

$ juju config keystone openstack-origin=cloud:xenial-queens

Launch the upgrade:

$ juju run-action keystone/L --wait openstack-upgrade

Resume the service:

$ juju run-action keystone/$L resume

Repeat the proess for the other units of the service:

$ for i in {0..2}; do
    juju run-action keystone/$i --wait pause;
    juju run-action keystone/$i --wait openstack-upgrade;
    juju run-action keystone/$i resume;
done

Warning

If you get the following error in apache2.log of the leader unit:

InternalError (1054, "Unknown column 'user.created_at' in 'field list'")

you need to upgrade manually the keystone database

Log into the leader unit of the service:

$ juju run-action keystone/$L --wait pause
$ juju ssh keystone/$L sudo -u keystone keystone-manage --config-file /etc/keystone/keystone.conf db_sync --expand
$ juju ssh keystone/$L sudo -u keystone keystone-manage --config-file /etc/keystone/keystone.conf db_sync --migrate
$ juju ssh keystone/$L sudo -u keystone keystone-manage --config-file /etc/keystone/keystone.conf db_sync --contract
$ juju run-action keystone/$L resume

Check the status:

$ juju status keystone
Model      Controller     Cloud/Region   Version
cloudbase  garrmaas       garr/Bari      2.1.2

App                         Version  Status   Scale  Charm      Store       Rev  OS      Notes
defaultgw-ba1-cl2                    active       3  defaultgw  jujucharms    6  ubuntu
keystone-ba1-cl2            10.0.1   active       3  keystone   local         3  ubuntu
keystone-hacluster-ba1-cl2           active       3  hacluster  jujucharms   33  ubuntu
nrpe-keystone-ba1-cl2                unknown      3  nrpe       jujucharms   21  ubuntu

Enable saml2:

$ juju config keystone enable-saml2=true

Workaround for incomplete relations

With Juju 2.2 a problem occurs when creating new relations after the upgrade. For the moment, Canonical suggests the following workaround.

Figure out the relation between percona and keystone:

$ juju run --unit keystone-ba1-cl2/$L "relation-ids shared-db"
shared-db:33

Set allowed_units on the relation:

$ juju run --unit keystone/$L "relation-set -r shared-db:33 allowed_units='keystone/0 keystone/1 keystone/2'"

Check the results:

$ juju run --unit keystone/$L "relation-get -r shared-db:33 - keystone/0"
allowed_units: keystone/$L keystone/1 keystone/2
database: keystone
hostname: 10.4.4.153
private-address: 10.4.4.153
username: keystone

Reset incomplete relations:

$ juju remove-relation glance keystone

Reset keystone:

$ juju resolved --no-retry keystone/$L

Wait for it to come back and then:

$ juju add-relation glance keystone

All other relations should now be established correctly. Check it with:

$ juju status glance

RabbitMQ

If upgrading the charm fails, you may need to do this:

$ juju ssh rabbitmq-server/$L sudo /bin/mkdir -p /usr//local/lib/nagios/plugins

on all units $L.

Glance

$ juju config glance action-managed-upgrade=true
$ juju config glance openstack-origin=cloud:xenial-queens

Find the leader:

$ juju run --application glance is-leader

Assuming that the leader is L:

$ juju run-action glance/$L --wait pause
$ juju run-action glance/$L --wait openstack-upgrade
$ juju run-action glance/$L resume

Repeat on the other units of the service:

$ for i in {0..2}; do
    juju run-action glance/$i --wait pause;
    juju run-action glance/$i --wait openstack-upgrade;
    juju run-action glance/$i resume;
done

Ceph

Upgrade charm ceph-proxy:

$ juju upgrade-charm ceph-proxy

Cinder

$ juju config cinder action-managed-upgrade=true
$ juju config cinder openstack-origin=cloud:xenial-queens

Find the leader, and the apply this to the leader:

$ juju run-action cinder/$L --wait pause
$ juju run-action cinder/$L --wait openstack-upgrade
$ juju run-action cinder/$L resume

Repeat on the other units of the service:

$ for i in {0..2}; do
    juju run-action cinder/$i --wait pause;
    juju run-action cinder/$i --wait openstack-upgrade;
    juju run-action cinder/$i resume;
done

For Ocata and ‘Pike`, there is a bug still being investigated. As a temporary fix, on each unit apply to the file /usr/lib/python2.7/dist-packages/oslo_messaging/_utils.py this patch (Note: we didn’t observer this issue upgrading from Ocata to Pike):

*** _utils.py.orig   2017-10-05 15:39:26.728073723 +0000
--- _utils.py        2017-10-05 15:42:08.308323044 +0000
***************
*** 20,25 ****
--- 20,28 ----
      :param imp_version: The version implemented
      :param version: The version requested by an incoming message.
      """
+     # Attardi: workaround for error in cinder-scheduler: "Requested message version, 3.0 is incompatible."
+     return True
+
      if imp_version is None:
          return True

and then issue:

$ for i in {0..2}; do \
     juju ssh cinder/$i sudo service cinder-scheduler restart;
  done

Cinder Ceph

No OpenStack upgrade is currently available, but check by doing:

$ juju config cinder-ceph action-managed-upgrade=true
$ juju config cinder-ceph openstack-origin=cloud:xenial-queens

If present, apply the same procedure as above.

Rados GW

No OpenStack upgrade is currently available, but check by doing:

$ juju config rados-gw action-managed-upgrade=true
$ juju config rados-gw openstack-origin=cloud:xenial-queens

If present, apply the same procedure as above.

Nova Controller

When upgrading to xenial-ocata you need to do this first:

$ juju upgrade-charm nova-cloud-controller

Upgrade OpenStack:

$ juju config nova-cloud-controller action-managed-upgrade=true
$ juju config nova-cloud-controller openstack-origin=cloud:xenial-queens

On the leader unit L do:

$ juju run-action nova-cloud-controller/$L --wait pause
$ juju run-action nova-cloud-controller/$L --wait openstack-upgrade

Upgrade the database, on the leader unit L of the service (only Mitaka to Newton):

$ juju ssh nova-cloud-controller/$L sudo nova-manage db sync
$ juju ssh nova-cloud-controller/$L sudo nova-manage api_db sync;
$ juju ssh nova-cloud-controller/$L sudo nova-manage db online_data_migrations;
$ juju ssh nova-cloud-controller/$L sudo service nova-api-os-compute restart;
$ juju ssh nova-cloud-controller/$L sudo service nova-consoleauth restart;
$ juju ssh nova-cloud-controller/$L sudo service nova-scheduler restart;
$ juju ssh nova-cloud-controller/$L sudo service nova-conductor restart;
$ juju ssh nova-cloud-controller/$L sudo service nova-novncproxy restart;
$ juju run-action nova-cloud-controller/$L resume

Nova Compute

Ensure the presence of a relation between nova-compute and percona-cluster:

$ juju add-relation nova-compute percona-cluster

For upgrading to Ocata, also do this:

$ juju add-relation nova-compute cinder-ceph

Perform the upgrade:

$ juju config nova-compute action-managed-upgrade=true
$ juju config nova-compute openstack-origin=cloud:xenial-queens

On the leader unit L do:

$ juju run-action nova-compute/$L --wait pause
$ juju run-action nova-compute/$L --wait openstack-upgrade
$ juju run-action nova-compute/$L resume

Warning

If you get the following error:

juju run-action nova-compute-/$L --wait pause
action-id: <id action>
message: exit status 1
status: failed

you need to switch on false and then again on true the action-managed-upgrade configuration parameter

Complete the upgrade on the other units:

$ for i {0..2}; do
  juju run-action nova-compute/$i --wait pause;
  juju run-action nova-compute/$i --wait openstack-upgrade;
  juju run-action nova-compute/$i resume;
done

N.B. Upgrading our production cluster one of our 25 compute nodes failed the upgrade with a “config-change error”.

Looking on the log of juju-unit-nova-compute we found that the error was related to an error with a virsh command:

DEBUG config-changed subprocess.CalledProcessError: Command '['virsh', '-c', 'qemu:///system', 'secret-list']' returned non-zero exit status 1
ERROR juju.worker.uniter.operation runhook.go:107 hook "config-changed" failed: exit status 1

and after some debugging we noted that the file /etc/libvirt/libvirtd.conf was empty!

So, we deduced that something went wrong during the installation of the new packages. The solution was to:

- uninstall and re-install libvirt-bin
- uninstall and reinstall nova-libvirt
- reboot the node

Ather the reboot all the instances on the hypervisor were shut down; re-starting them with nova was enough to make them running again.

Neutron API

$ juju config neutron-api action-managed-upgrade=true
$ juju config neutron-api openstack-origin=cloud:xenial-queens

On the leader unit L do:

$ juju run-action neutron-api/$L --wait pause;
$ juju run-action neutron-api/$L --wait openstack-upgrade;
$ juju run-action neutron-api/$L resume;

Complete the upgrade on the other units:

$ for i {0..2}; do
    juju run-action neutron-api/$i --wait pause;
    juju run-action neutron-api/$i --wait openstack-upgrade;
    juju run-action neutron-api/$i resume;
done

Neutron Gateway

$ juju config neutron-gateway action-managed-upgrade=true
$ juju config neutron-gateway openstack-origin=cloud:xenial-queens
$ juju run-action neutron-gateway/$L openstack-upgrade


.. warning:: If you get the following error::

juju run-action neutron-gateway-/$L openstack-upgrade
action-id: <id action>
message: exit status 1
status: failed

you need to switch on false and then again on true the action-managed-upgrade configuration parameter

Note upgrading charm neutron-gateway. We enabled neutron HA on our two neutron-gateway servers. After upgrading the charm the routers were not reachable any longer. The reason was that the routers were in STANDBY state on both servers:

$ neutron l3-agent-list-hosting-router admin-router-test
+--------------------------------------+------------+----------------+-------+----------+
| id                                   | host       | admin_state_up | alive | ha_state |
+--------------------------------------+------------+----------------+-------+----------+
| 775a4d38-d6b7-4f6d-a34b-e5327563e2e5 | pa1-r2-s09 | True           | :-)   | standby  |
| 9f9a2fe0-561f-43e8-aa94-1f3224266f06 | pa1-r1-s11 | True           | :-)   | standby  |
+--------------------------------------+------------+----------------+-------+----------+

We solved the issue disabling and then re-enabling again HA on the router:

$ neutron router-update admin-router-test --admin_state_up=false
$ neutron router-update admin-router-test --ha=false
$ neutron router-update admin-router-test --admin_state_up=true

$ neutron router-update admin-router-test --admin_state_up=false
$ neutron router-update admin-router-test --ha=true
$ neutron router-update admin-router-test --admin_state_up=true

OpenStack Dahsboard

When upgrading to xenial-ocata you need to do this first:

$ juju upgrade-charm openstack-dashboard

Upgrade openstack-dashboard:

$ juju config openstack-dashboard action-managed-upgrade=true
$ juju config openstack-dashboard openstack-origin=cloud:xenial-queens
$ for i {0..2}; do
   juju run-action openstack-dashboard/$i --wait pause;
   juju run-action openstack-dashboard/$i --wait openstack-upgrade;
   juju run-action openstack-dashboard/$i resume;
done

Apply this patch, that causes an error in Apache:

https://bugs.launchpad.net/charm-openstack-dashboard/+bug/1678014

Ceilometer

$ juju config ceilometer-agent action-managed-upgrade=true
$ juju config ceilometer-agent openstack-origin=cloud:xenial-queens
$ juju run-action ceilometer-agent/$L openstack-upgrade

$ juju config ceilometer action-managed-upgrade=true
$ juju config ceilometer openstack-origin=cloud:xenial-queens
$ juju run-action ceilometer/$L openstack-upgrade

gnocchi (csd-garr charm)

Gnocchi charm has the option openstack-origin but it does not have action-managed-upgrade option. As a consequence it is not possible to upgrade the charm to Pike.

General troubleshooting

In the following we outline problems that may appear on any services during maintenance operations.

Message queue (Rabbit)

The connection between a service and the message queue may break during the upgrade process. It is advisable to log on the service units after the upgrade and check the logs for error messages. Problems with the message queue are reported as ” … ERROR oslo_messaging … connection timed out | connection refused | host unreachable”.

In general the solution is logging on the rabbit units and restart rabbitmq_server:

$ service rabbitmq_service restart

Afterwards check that the rabbitmq cluster is active:

$ rabbitmqctl cluster_status

  Cluster status of node 'rabbit@juju-08eaf8-91-lxd-54' ...
  [{nodes,[{disc,['rabbit@juju-08eaf8-156-lxd-9','rabbit@juju-08eaf8-90-lxd-63',
           'rabbit@juju-08eaf8-91-lxd-54']}]},
  {running_nodes,['rabbit@juju-08eaf8-156-lxd-9',
            'rabbit@juju-08eaf8-90-lxd-63',
            'rabbit@juju-08eaf8-91-lxd-54']},
  {cluster_name,<<"rabbit@juju-08eaf8-128-lxd-48.maas">>},
  {partitions,[]}]

In this case we see 3 running nodes, i.e. the cluster is complete.

We once ran into a more complicated issue: rabbit lost the rabbit_user correspondent to a service (glance)! To check users and permissions run the following commands, of which we show the output in case of good health:

$ rabbitmqctl list_users
Listing users ...
cinder     []
glance     []
guest      [administrator]
nagios-rabbitmq-server-ct1-cl1-15  []
nagios-rabbitmq-server-ct1-cl1-16  []
nagios-rabbitmq-server-ct1-cl1-17  []
nagios-rabbitmq-server-ct1-cl1-18  []
nagios-rabbitmq-server-ct1-cl1-19  []
neutron    []
nova       []

$ rabbitmqctl list_user_permissions glance
 Listing permissions for user "glance" ...
 openstack .*      .*      .*

The solution in this case was to remove and add the relation between glance and rabbit, which reconfigured the services correctly.