Running Zookeeper in replicated mode (for Apache Kafka)

Written by Coforge-Salesforce BU | Aug 22, 2020 6:30:00 PM

In this blog, we will explore the recommendations for running Zookeeper in replicated mode and why this is the best method for using Zookeeper within your Apache Kafka implementation.

Zookeeper is a distributed, open-source coordination service developed by Apache that acts as a centralised service and is used to maintain naming and configuration data to provide flexible and robust synchronisation within distributed systems.

Kafka uses Zookeeper to manage service discovery for Kafka brokers that form the cluster. Zookeeper sends changes of the topology to Kafka, so each node in the cluster knows when a new broker joined, a broker died, a topic was removed or a topic was added, etc.

In summary, Zookeeper provides an in-sync view of the Kafka Cluster configuration.

Why is Zookeeper important for the Kafka Cluster?

Kafka relies on Zookeeper for the following key information about Kafka’s metadata:

1. Topic Registration Info

2. Partition State Info

3. Broker Registration Info

4. Controller EPOCH & Registration

5. Consumer Registration, Owner, Offset Info

6. Reassign partitions

7. Preferred Replica Election

Note: For more granular configuration details, please follow this link.

Why should you run Zookeeper in replicated mode?

Running ZooKeeper in standalone mode is convenient for evaluation, some development, and testing, but in production, you should run ZooKeeper in replicated mode. This is because single node Zookeeper is not considered fault-tolerant and this type of setup is not-at-all recommended in production as a fair co-ordination service.

When running in replicated mode you are working with a replicated group of servers in the same application - this is called a quorum.

Quorum is necessary for Zookeeper’s service and high availability in general for its clients. In this case, it’s the Kafka Cluster co-ordination service. In order to qualify as a reliable co-ordination service with high-availability and minimum fault-tolerance, it has to form a healthy ensemble of Zookeeper nodes.

To give you an idea of how many nodes you should have replicated, we have outlined below how number of nodes can affect tolerance level and high availability status of a Kafka Cluster:

What does the optimal 3 nodes Zookeeper quorum look like in Kafka?

Zookeeper quorum of 3 is minimum HA with tolerance level 1 node failure.
Zookeeper leader coordinates and maintains the Kafka Cluster state correctly until or before it goes offline.
When any Zookeeper is taken down for any upgrade, in 3 nodes quorum, then, further tolerance is zero (0) and there would be only two Zookeeper nodes remaining running in the Kafka cluster but with no co-ordination service. There would be only two Zookeeper nodes remained running in kafka cluster but with no co-ordination service.
If there is no Zookeeper coordination service, then no changes from Kafka cluster is co-ordinated and no further tolerance of DR events or any further failures expected in Zookeeper and Kafka cluster services.

When should 5 nodes Zookeeper quorum be used?

Zookeepers quorum of 5 is the minimum recommended average tolerance of 2 nodes down for production deployments to keep the Zookeeper service highly available for Kafka cluster coordination.
Operational Scenarios,
- Expected Outage - 1 node can be taken down for expected rolling upgrades or patches.
- Unexpected Outage - 1 node is for any unexpected tolerance of DR event / outage / failure.

For more information about the Confluent platform Zookeeper deployments, please check the following Confluent reference documentation:

Multi-node setup of Zookeeper for Kafka

Running Zookeeper in production

Final Thoughts

Having explored running Zookeeper within the Kafka Cluster in replicated mode and given our recommendations, we hope your questions around the optimal number of Zookeeper nodes were answered. If you have any other questions, as a preferred Confluent partner and Kafka expert, we can help - simply contact us here.

If you would like to find out how to become a data-driven organisation with event streaming, Kafka and Confluent, then give us a call or email us at Salesforce@coforge.com

Other useful links:

Journey to the event-driven business with Kafka

Coforge Expert Kafka Services

Coforge Confluent Services

View full post