1. Overview

1.1. Tanzu RabbitMQ Active/Standby

To improve the availability and disaster recovery capabilities of a system running in your data center, you can implement an active/standby topology across two data centers.

Key benefits of implementing the active/standby architecture solution include:

  • achieving business continuity

  • without message loss

From a business continuity perspective, you need to ensure the following two prerequisites:

  • you need to invoke the promotion procedure on the standby site, and

  • redirect external traffic to the new active site

In order to achieve the lowest Recovery Time (a.k.a. RTO), applications can run on the standby site so that when the site is promoted to active they are warmed up and ready to connect. However, you need to ensure applications are resilient to connection failures.

1.2. Multi-cluster schema synchronization

Thanks to the schema synchronization plugin that comes with Tanzu RabbitMQ for kubernetes, every queue/exchange/binding (in addition to other RabbitMQ resources) created on the active site is replicated to the standby site.

Do not attempt to create any AMQP resource on the standby site because it will get deleted by the schema synchronization plugin.

1.3. Multi-cluster queue replication

Thanks to the multi-cluster queue replication capability of the Active/Standby, we can resume processing unprocessed messages which were generated from the old active site.

This is how it works:

  1. Developer/Operator configures which queues are replicated by creating the appropriate replication policy. Every queue matched, by at least one replication policy, is automatically replicated.

  2. Messages sent to a replicated queue and/or acknowledged are recorded

  3. When the standby cluster connects to the active cluster, it downloads all recorded messages and acknowledgements business as usual

  4. When the standby cluster is promoted to active, all unacknowledged messages are restored on their corresponding queue and AMQP traffic is restored.

promote standby to active

2. Operator Guide

2.1. Install Tanzu RabbitMQ bundle to private registry

If you haven’t already, download the Tanzu RabbitMQ bundle from Tanzu Network (network.pivotal.io/products/p-rabbitmq-for-kubernetes/).

You must then place this tar file on the filesystem of a machine within the network you are hosting your registry. On that machine, you can load the tarball into your registry by running:

imgpkg copy --to-repo your.company.domain/registry/tanzu-rabbitmq-bundle --tar tanzu-rabbitmq-1.1.0.tar

On the machine targeting the Kubernetes cluster, you can now use this bundle by running:

imgpkg pull -b your.company.domain/registry/tanzu-rabbitmq-bundle:1.1.0 -o /your/output/directory
cd /your/output/directory

2.2. Install Tanzu RabbitMQ Kubernetes operators

Follow the instructions in the bundle’s README.md in order to install the Cluster Operator and Messaging Topology Operator in the Kubernetes Cluster. Note you do not yet need to create a RabbitMQ cluster; this will be done in the next step.

2.3. Deploying your Active/Standby topology

Once the installation is complete, we can deploy an active/standby topology.

installation deploy

Navigate to the tech_preview/replicated-quorum-queues directory in the bundle. Populate the user-values.yaml file with the parameter relevant to your upstream and downstream clusters.

Here is a sample user-values.yaml for the two clusters, red being the active cluster and blue the standby one.

#@data/values
#@ load("@ytt:overlay", "overlay")
---
#@overlay/match missing_ok=True
upstream: (1)
  name: red (2)
  namespace: rabbitmq-red (3)
  registry_credentials_secret: (4)
    - name: my-docker-credentials
  external_name: red.rabbitmq-red.svc (5)
  external_amqp_port: "5672"
  external_streams_port: "5551"
  schema_sync_mode: upstream
  upstream: (6)
    external_name: blue.rabbitmq-blue.svc
    external_amqp_port: "5672"
    external_streams_port: "5551"
    name: blue
    namespace: rabbitmq-blue

#@overlay/match missing_ok=True
downstreams: (7)
  - name: blue (8)
    namespace: rabbitmq-blue (9)
    registry_credentials_secret: (4)
      - name: my-docker-credentials
    external_name: blue.rabbitmq-blue.svc (10)
    external_amqp_port: "5672"
    external_streams_port: "5551"
    schema_sync_mode: downstream
    upstream: (11)
      external_name: red.rabbitmq-red.svc
      external_amqp_port: "5672"
      external_streams_port: "5551"
      name: red
      namespace: rabbitmq-red
1 Under this section, we configure the active cluster
2 Name of the active RabbitMQ cluster
3 Namespace where the RabbitMQ cluster will be deployed to
4 Name of Secret with private registry credentials to pull images
5 Externally accessible DNS or IP to connect to the active RabbitMQ cluster. This is needed to synchronise data between clusters.
6 Under this sub-section, we configure the external_name, name and namespace of the standby cluster
7 Under this section, we configure the standby cluster
8 Name of the standby RabbitMQ cluster
9 Namespace where the standby RabbitMQ cluster will be deployed to
10 Externally accessible DNS or IP to connect to the standby RabbitMQ clusters. This is needed to synchronise data between clusters.
11 Under this sub-section, we configure the external_name, name and namespace of the active cluster

First, target your primary k8s cluster and deploy the Active RabbitMQ cluster.

ytt -f templates/rabbitmq.lib.yaml \
    -f templates/rabbitmq-dr.lib.yaml \
    -f templates/rabbitmq-topology.lib.yaml \
    -f configurations/internal-values.yaml \
    -f user-values.yaml \ (1)
    -f configurations/rabbitmq-active.yaml \
    -f configurations/rabbitmq-prp-active.yaml \
 | kbld -f ../../.imgpkg/images.yml -f- \
 | kapp -y deploy -a rabbitmq-upstream -f-
1 Populate the user-values.yaml file with the parameter relevant to your upstream and downstream clusters.

To check the status of the deployment, run the following command:

kapp -n default inspect -a rabbitmq-upstream --tree

The deployment is complete when all the status conditions, under the column 'Conds.' are true.

Second, target your secondary k8s cluster and deploy your Standby RabbitMQ cluster.

ytt -f templates/rabbitmq.lib.yaml \
    -f templates/rabbitmq-dr.lib.yaml \
    -f templates/rabbitmq-topology.lib.yaml \
    -f configurations/internal-values.yaml \
    -f user-values.yaml \  (1)
    -f configurations/rabbitmq-passive.yaml \
    -f configurations/rabbitmq-prp-passive.yaml \
 | kbld -f ../../.imgpkg/images.yml -f- \
 | kapp -y deploy -a rabbitmq-downstream -f-
1 Populate the user-values.yaml file with the parameter relevant to your upstream and downstream clusters.

2.4. Configure Queue Replication

To enable queue replication, you must configure at least one replication policy.

A replication policy is a yaml file like the one below:

Only durable, either quorum or classic, queues which matches the policy’s pattern are replicated. non-durable and/or exclusive and/or auto-delete queues are not replicated.
virtualHost: "/" (1)
pattern: "^rep-.*" (2)
1 Only default virtual host is currently supported
2 Replicates all queues which start with rep-
Only default virtual host is currently supported

The bundle automatically creates a policy to replicate all queues. To create, modify, or delete a replication policy, edit the ConfigMap named replication-policies. This ConfigMap is located in the same namespace as the Active-Passive components. The keys of this ConfigMap must meet the following criteria:

  • The key name starts with policy-*. E.g. policy-myqueues.yaml

  • The key name ends with *.yaml. E.g. policy-critical.yaml

If two policies conflict, the most permissive will take effect. For example, a policy with regular expression ^$ would replicate nothing (the regex doesn’t match any string), and a policy with regular expression .* would replicate all the queues. If both policies exist at the same time, the policy to replicate all queues will take effect, because it is more permissive than "policy nothing".

At present, you need to ensure that any changes applied to the replication-policies ConfigMap in one cluster (typically in the Active), must be applied to the other cluster. If we forget to apply the change to the other site, when the other site is ever restarted after being promoted to active, it will enforce a different set of replication policies.

2.5. Promote Standby cluster to Active

Standby cluster must be running while the active site is running in order replicated messages and acknowledgements.

promote standby to active

To promote a running standby cluster to active, follow this procedure:

All commands must be executed in the passive site.
# Stop the schema replication plugin
# Exec into a RabbitMQ Pod
kubectl exec "RABBITMQ_POD_NAME" -c rabbitmq -- rabbitmqctl disable_schema_replication

# Update the upstream endpoints to the passive site external name or IP
# The schema sync username and password can be customised in the values file
kubectl exec "RABBITMQ_POD_NAME" -c rabbitmq --  rabbitmqctl set_schema_replication_upstream_endpoints \
  "{\"endpoints\": [\"PASSIVE_SITE_EXTERNAL_NAME:5672\"], \"username\": \"SCHEMA_SYNC_USERNAME\", \"password\": \"SCHEMA_SYNC_PASSWORD\"}"

# Configure the schema sync plugin to operate as active
kubectl exec "RABBITMQ_POD_NAME" -c rabbitmq -- rabbitmqctl set_schema_replication_mode upstream

# Configure the Active-Passive components to operate as active
kubectl patch cm dr-components-mode --patch='{"data": {"mode.yaml": "mode: Active\n"}}'

The above procedure starts a "replay" phase. Any messages, that were not acknowledged in the active site, will be published to the queues, in the same order as they were published in the former active site.

As soon as the replay phase is completed, RabbitMQ cluster will be ready to accept AMQP client applications. To complete the recovery, start the client applications if they are not already running.

At some point, plan to update the configuration of RabbitMQ

The passive site was deployed with a configuration to operate as standby. This configuration was changed in the runtime of RabbitMQ, but it’s configuration file was not updated. Changing this configuration requires a rolling restart of RabbitMQ. Not changing this configuration file would risk, upon a RabbitMQ restart, starting with the wrong operation mode (i.e. as standby).

To update the configuration file of RabbitMQ, run the following command:

kubectl edit rmq RABBITMQ_CLUSTER_NAME
apiVersion: rabbitmq.com/v1beta1
kind: RabbitmqCluster
metadata:
  name: my-rabbitmq-cluster
  [...]
spec:
  rabbitmq:
    additionalConfig: |
      schema_definition_sync.operating_mode = downstream (1)
  [...]
1 Update this value to 'upstream'

3. Developer Guide

3.1. Using Tanzu RabbitMQ Active/Standby

Applications running in both the active and the standby site should use the amqp-proxy service to connect to RabbitMQ.

For example, if the active site is deployed to the rabbitmq-red namespace, applications running in the Kubernetes cluster should use amqp-proxy.rabbitmq-red.svc as the hostname for connecting to RabbitMQ.

3.1.1. Resilient applications

You should deploy applications which are resilient to connection failures so that they do not crash when they either cannot open the first connection or when the connection is dropped.

This will be important if you want to deploy applications on the DR site while it is on standby. The standby site will not allow connections until it is promoted to active. However, you can write your applications so that they retry rather than crash.

You may want to have some or all your applications running on the DR site in order to reduce the recovery time (a.k.a. RTO) when the standby site is promoted to active.

3.2. Configure Queue Replication

Permission to modify ConfigMap(s) is required to configure queue replication.

To enable queue replication, you must configure at least one replication policy.

A replication policy is a yaml file like the one below:

Only durable, either quorum or classic, queues which matches the policy’s pattern are replicated. non-durable and/or exclusive and/or auto-delete queues are not replicated.
virtualHost: "/" (1)
pattern: "^rep-.*" (2)
1 Only default virtual host is currently supported
2 Replicates all queues which start with rep-
Only default virtual host is currently supported

The bundle automatically creates a policy to replicate all queues. To create, modify, or delete a replication policy, edit the ConfigMap named replication-policies. This ConfigMap is located in the same namespace as the Active-Passive components. The keys of this ConfigMap must meet the following criteria:

  • The key name starts with policy-*. E.g. policy-myqueues.yaml

  • The key name ends with *.yaml. E.g. policy-critical.yaml

If two policies conflict, the most permissive will take effect. For example, a policy with regular expression ^$ would replicate nothing (the regex doesn’t match any string), and a policy with regular expression .* would replicate all the queues. If both policies exist at the same time, the policy to replicate all queues will take effect, because it is more permissive than "policy nothing".

At present, you need to ensure that any changes applied to the replication-policies ConfigMap in one cluster (typically in the Active), must be applied to the other cluster. If we forget to apply the change to the other site, when the other site is ever restarted after being promoted to active, it will enforce a different set of replication policies.

3.2.1. How to make sure that all messages are replicated

Every messages sent directly via the amqp-proxy kubernetes service on the AMQP port is replicated provided it is sent to one of the following exchanges:

  • default exchange

  • direct exchange

  • topic exchange

  • fanout exchange

Header exchange is not supported yet
Messages sent directly via the management ui/rest api are not replicated.

In order to move messages from a source queue into a target queue using Shovel, you should always use the amqp-proxy service on the AMQP port for both, the source and target address of the shovel.

3.2.2. How to make sure that all acknowledgements are replicated

All manual acknowledgements sent via the amqp-proxy service are automatically replicated.

Auto-acknowlegements are not replicated. Auto-acknowlegements do not guarentee the safety of message delivery and so must not be used if data safety is a priority.
AMQP supports acknowledging multiple messages at once. However, this mode is not supported yet by Active/Standby, only single message acknowledgements.

4. Appendix - Release History

v0.0.1

  • First release