Testing logical replication with migrations
Introduction
ManageIQ uses logical replication to provide central administrative functions over objects in other database regions. In order to do this, postgresql’s logical replication is used to setup publications for specific tables in remote regions and subscriptions for each in the central or global region.
Note, this is completely different from HA and failover. Logical replication provides product features such as being able to start or stop a virtual machine centrally regardless of which database it resides in. HA, on the other hand, provides for failover by promoting a backup when a primary database fails.
This interconnectedness of multiple databases with publications and subscriptions with ManageIQ regions means that the upgrade process is a bit different. Basically, the global region cannot be migrated until all databases it’s subscribing to have been migrated. Fortunately, this logic should be mostly automatic.
This logic is what we’re trying to test and verify.
Pre-requisites
This was tested with 3 nightly appliances. They were setup to be at the Jansa codebase with replication. The appliances were then migrated to kasparov. This document could be used for different branches or tags.
The following region numbers were used and will be referenced throughout this document. Replace these values if different numbers are selected:
- 99 - global
- 1 - remote
- 2 - remote
It is assumed the application will be run in production mode on appliances. This is the default on appliances. If using a different mode or not appliances, RAILS_ENV will need to be exported:
export RAILS_ENV=production
Initial setup
- The existing manageiq code from the rpm in /var/www/miq/vmdb was renamed
- ManageIQ was cloned and checked out at the jansa branch
- Now, install all rpms required to build compiled gems:
These commands were found in the rpm_build Dockerfile.
dnf -y group install "development tools" && \
dnf config-manager --setopt=epel.exclude=*qpid-proton* --setopt=tsflags=nodocs --save && \
dnf -y install \
ansible \
cmake \
copr-cli \
glibc-langpack-en \
libcurl-devel \
libpq-devel \
librdkafka \
libssh2-devel \
libxml2-devel \
libxslt-devel \
nodejs \
openssl-devel \
platform-python-devel \
postgresql-server \
postgresql-server-devel \
qpid-proton-c-devel \
ruby-devel \
rubygem-bundler \
wget \
sqlite-devel && \
dnf clean all
Prepare for replication on jansa
The commands below will be run on each appliance and will do the following:
- Checkout the source version (jansa)
- Bundle the gem dependencies
- Initialize the encryption key and default database.yml
- Setup the region number
- Seed the database
Note: Setting RAILS_ENV is not required on appliances as they default to production, but is provided below for completeness.
export DISABLE_DATABASE_ENVIRONMENT_CHECK=1
vmdb
git checkout origin/jansa
bundle check || bundle update
mkdir log
cp -f certs/v2_key.dev certs/v2_key
cp -f config/database.pg.yml config/database.yml
bundle exec rake evm:db:region -- --region XXX
bundle exec rake --trace db:migrate
bundle exec rake --trace db:seed
Substitute XXX for the region number of this appliance:
- 99 - global
- 1 - remote
- 2 - remote
Configure replication for jansa
- Region 1 and 2:
- These will be remotes, meaning they will “publish” the tables to be replicated:
vmdb bin/rails r "MiqRegion.replication_type= :remote"
- Region 99 (global):
- The commands below will:
- Provide the other regions’ connection information
- Create the subscription for each remote region
bin/rails c
Subsitute the proper values below:
$ require 'miq_pglogical' host1 = 'x.x.x.x' host2 = 'y.y.y.y' port = '5432' user = 'root' password = 'zzz' sub1 = PglogicalSubscription.new(:host => host1, :port => port, :user => user, :dbname => 'vmdb_production', :password => password) sub2 = PglogicalSubscription.new(:host => host2, :port => port, :user => user, :dbname => 'vmdb_production', :password => password) MiqPglogical.save_global_region([sub1, sub2], [])
- The commands below will:
Now, replication can verified before moving on.
On the global:
bin/rails c
User.all.pluck(:id)
# This should show ids beginning with the 3 region numbers: 99, 1, 2.
Note, it may take a few minutes before these users are replicated.
Remove replication before upgrade
Whenever performing an upgrade, logical replication must be disabled on the old version and enabled again on the new version. Often, this is because appliances are replaced and therefore the publication and subscription are different. Even when not replacing appliances, it is still best to disable replication on the old version and enable it again on the new version so that the correct version of code is adding the publications and subscriptions.
On the global, assuming the remotes are region 1 and 2:
psql -U root vmdb_production -h global_ipaddress -c 'DROP subscription region_1_subscription;'
psql -U root vmdb_production -h global_ipaddress -c 'DROP subscription region_2_subscription;'
Note: Dropping the subscription on the global will also delete the replication slot on the remote and will do proper cleanup. The replication slot should not be manually removed.
Upgrade to kasparov
On region 1, 2, and 99:
vmdb
git checkout origin/kasparov
bundle check || bundle update
Configure replication for kasparov
Follow the same instructions as above
Attempt migration from global
vmdb
bin/rake db:migrate
Notice that the global will not migrate the database as it is waiting on region 1 to migrate first:
Waiting for remote region 1 to run migration 20200424183342
Leave region 1’s terminal waiting for the other regions to migrate…
Migrate database to kasparov on region 1
On region 1:
vmdb
bin/rake db:migrate
Now, region 99 (global) terminal shows:
Waiting for remote region 1 to run migration 20200424183342
Waiting for remote region 2 to run migration 20200424183342
It is now waiting on region 2 and will not complete yet.
Migrate database to kasparov on region 2
On region 2:
vmdb
bin/rake db:migrate
After region 2 migrates, region 99 will immediately migrate.