February 8, 2019
As of OpenShift Container Platform 3.10 etcd is expected to run in static pods on the master nodes in the control plane. You may have a deployed an HA cluster with dedicated etcd nodes managed with systemd. How do you migrate the this new architecture?
Assumptions:
- You are running OCP 3.9
- You have multiple Master nodes
- You have dedicated Etcd nodes
- You are running RHEL, not Atomic nodes
Outline:
- Backup etcd
- Scale up Etcd cluster to include Master nodes
- Configure Openshift Masters to ignore the old Etcd nodes
- Scale down etcd cluster to remove old Etcd nodes
Detailed Steps
Follow along in this document https://docs.openshift.com/container-platform/3.9/admin_guide/assembly_replace-etcd-member.html You may find some etcd aliases handy before proceeding.
Create an
new_etcdansible group in your inventory file.Add the first Master node to this
new_etcdgroup for testing.Add
new_etcdgroup as a child to theOSEv3ansible group.Confirm your cluster health on the first etcd server.
#!/bin/bash
. /etc/etcd/etcd.conf
ENV=${OPENSHIFT_ENV:-test}
ENDPOINTS="https://ose-${ENV}-etcd-01.example.com:2379,https://ose-${ENV}-etcd-02.example.com:2379,https://ose-${ENV}-etcd-03.example.com:2379"
ETCDCTL_API=3 etcdctl \
--cert ${ETCD_PEER_CERT_FILE} \
--key ${ETCD_PEER_KEY_FILE} \
--cacert ${ETCD_PEER_TRUSTED_CA_FILE:-$ETCD_PEER_CA_FILE} \
--endpoints "$ENDPOINTS" \
endpoint health
[root@ose-test-etcd-01 bin]# ./etcd-health
https://ose-test-etcd-01.example.com:2379 is healthy: successfully committed proposal: took = 2.41743ms
https://ose-test-etcd-03.example.com:2379 is healthy: successfully committed proposal: took = 2.363286ms
https://ose-test-etcd-02.example.com:2379 is healthy: successfully committed proposal: took = 2.213456ms
- Create a backup of your etcd data and configuration.
Because of the migration during the upgrade to 3.7, I am assuming I do not need to back up v2 data. That is somewhat TBD, however.
#!/bin/bash
# https://docs.openshift.com/container-platform/latest/admin_guide/backup_restore.html
# https://access.redhat.com/solutions/1981013#comment-1257931
. /etc/etcd/etcd.conf
ETCD3="etcdctl --cert ${ETCD_PEER_CERT_FILE} \
--key ${ETCD_PEER_KEY_FILE} \
--cacert ${ETCD_PEER_TRUSTED_CA_FILE:-$ETCD_PEER_CA_FILE} \
--endpoints ${ETCD_ADVERTISE_CLIENT_URLS}"
BACKUP_DIR="/var/backup/etcd/$(date +%Y%m%d%H%M)"
mkdir -p ${BACKUP_DIR}/snap
cp -rp /etc/etcd "${BACKUP_DIR}/"
cp -p $0 "${BACKUP_DIR}/"
ETCDCTL_API=3 $ETCD3 \
snapshot save ${BACKUP_DIR}/snap/db
# Restore:
# . ${BACKUP_DIR}/etcd/etcd.conf
# ETCDCTL_API=3 $ETCD3 \
# --name $ETCD_NAME \
# --initial-cluster $ETCD_INITIAL_CLUSTER \
# --initial-cluster-token $ETCD_INITIAL_CLUSTER_TOKEN \
# --initial-advertise-peer-urls $ETCD_INITIAL_ADVERTISE_PEER_URLS \
# snapshot restore ${BACKUP_DIR}/snap/db
- Run the etcd scaleup playbook
#!/bin/bash
# https://docs.openshift.com/container-platform/3.9/admin_guide/assembly_replace-etcd-member.html
PLAYBOOK=/usr/share/ansible/openshift-ansible/playbooks/openshift-etcd/scaleup.yml
ansible-playbook -vvv \
-i hosts "$PLAYBOOK" \
| tee $(date +%Y%m%d-%H%M)-etcd-scaleup.log
In my case I found etcd had been accidentally started by hand with a default config file which listened on localhost. The config file was modified by the etcd role and the restart etcd handler was notified, but it was skipped. This caused the etcd cluster status check task to timeout, and subsequent steps in the playbook to fail.
After restarting etcd at 18:43 the cluster reports as healthy, and I re-ran the playbook successfully.
After the playbook has been run successfuly it can be seen that the master node has been added as an etcd endpoint in /etc/origin/master/master-config.yaml on every master node.
etcdClientInfo:
ca: master.etcd-ca.crt
certFile: master.etcd-client.crt
keyFile: master.etcd-client.key
urls:
- https://ose-test-etcd-01.example.com:2379
- https://ose-test-etcd-02.example.com:2379
- https://ose-test-etcd-03.example.com:2379
- https://ose-test-master-01.example.com:2379
This master is done. Move this first master from the
new_etcdto theetcdansible group. Leave it in any other groups it is already a member of of course.Remove old
ose-test-etcd-03node from theetcdansible group.Update the
master-config.yamlto include only the hosts remaining in theetcdansible group and restart api service.
I considered the modify_yaml but after noticing it inserted some nulls and converted some doule quotes to single quotes, I was happy to find the yedit module.
---
# playbook to replace currently configured master etcd URLs with
# the hosts found in ansible etcd group
- hosts: masters
vars:
openshift_master_fire_handlers: true
roles:
# https://github.com/openshift/openshift-ansible/tree/release-3.9/roles/lib_utils/library
- lib_utils
- openshift_facts
tasks:
- name: Gather Cluster facts
openshift_facts:
role: common
- name: Derive etcd url list
set_fact:
openshift_master_etcd_urls: "{{ groups['etcd'] | lib_utils_oo_etcd_host_urls(l_use_ssl, openshift_master_etcd_port) }}"
vars:
l_use_ssl: "{{ openshift_master_etcd_use_ssl | default(True) | bool}}"
openshift_master_etcd_port: "{{ etcd_client_port | default('2379') }}"
- name: Configure ectcd url list
yedit:
src: "{{ openshift.common.config_base }}/master/master-config.yaml"
key: etcdClientInfo.urls
value: "{{ openshift_master_etcd_urls }}"
backup: yes
notify: restart master api
handlers:
# https://github.com/openshift/openshift-ansible/blob/release-3.9/roles/openshift_master/handlers/main.yml
- import_tasks: /usr/share/ansible/openshift-ansible/roles/openshift_master/handlers/main.yml
Verify OpenShift operation
Remove old
ose-test-etcd-03node from etcd cluster.{% highlight plain %}{% raw %} [root@ose-test-master-01 etcd]# etcdctl3 member list 3cc657644e2e1080, started, ose-test-etcd-02.example.com, https://192.0.2.242:2380, https://192.0.2.242:2379 669fc09764815697, started, ose-test-etcd-03.example.com, https://192.0.2.243:2380, https://192.0.2.243:2379 dd1f136e71579ace, started, ose-test-etcd-01.example.com, https://192.0.2.241:2380, https://192.0.2.241:2379 eafa4cc2f9510e7b, started, ose-test-master-01.example.com, https://192.0.2.251:2380, https://192.0.2.251:2379
[root@ose-test-master-01 etcd]# etcdctl3 member remove 669fc09764815697 {% endraw %}{% endhighlight %}
Repeat for Masters 2 and 3 and etcd nodes 2 and 1.
You are now one step closer to OpenShift 3.10.
At this point etcd should be running only on the 3 Master nodes and not on the old Etcd nodes. All the masters should know this, and you are one step closer to being able to upgrade to OpenShift 3.10.
Problems
Solved
- As I mentioned I had accidentally started etcd with a default config and the scaleup playbook did not expect this condition.
Lingering
- I scaled up 2 masters as etcd nodes which got etcd 3.3.11 installed. When I went to scale up the 3rd master soon after, suddenly the newest etcd RPM was 3.2.22 which is incompatible. In fact OpenShift is not certified to work with etcd 3.3. Etcd 3.3 should be excluded in
yum.confbut it is not BZ 1672518! This KB points out a 3.2 etcd container image got a 3.3 etcd binary into it also! “ETCD hosts were upgraded to version 3.3.11.”. Here is what I did.