How ZooKeeper Works from Start to Finish
-
ZooKeeper is a distributed coordination service - it helps multiple distributed systems agree on shared state (like who’s the leader, what configs to use, etc.).
-
You run a ZooKeeper ensemble - this is a group of odd-numbered ZooKeeper servers (3, 5, 7…) for fault tolerance.
-
One server becomes the leader, the rest are followers. Leader handles all writes, followers handle reads.
-
Clients (like Kafka, Hadoop, etc.) connect to any ZooKeeper node - they see a virtual filesystem called the znodes tree.
-
Znodes are like files/directories - they store small amounts of metadata or coordination data (not large blobs).
-
When a client reads or writes a znode:
- Reads can go to any follower.
- Writes go to the leader, which replicates the change to followers.
-
Every change (write) in ZooKeeper follows a consensus protocol called ZAB (ZooKeeper Atomic Broadcast):
- Leader proposes the update.
- Followers acknowledge.
- Once quorum (majority) confirms, the write is committed.
-
ZooKeeper guarantees strong consistency:
- All clients see the same state.
- Writes are linearized.
- There’s no split-brain - if a node loses quorum, it stops serving.
-
Znodes can be:
- Persistent → stays until explicitly deleted.
- Ephemeral → auto-deleted when the client disconnects (used for service registration).
- Sequential → gets a unique incrementing number when created (used for leader election).
-
ZooKeeper clients can watch znodes - when the znode changes, the client gets notified. This is used for event-driven coordination.
-
It’s used for:
- Leader election
- Service discovery
- Metadata sharing
- Locks and distributed queues
-
If the leader fails, the ensemble holds a leader election, and a new leader takes over.
-
ZooKeeper keeps a transaction log and snapshots to recover state in case of crashes.
-
You should never overload ZooKeeper with large data or frequent writes - it’s optimized for fast, small metadata coordination.
ZooKeeper Architecture Deep Dive
Server Roles and Responsibilities
Leader Server
The leader in a ZooKeeper ensemble has several critical responsibilities:
-
Request Processing:
- Handles all write requests from clients
- Proposes state changes to followers
- Maintains the order of all transactions
-
Quorum Management:
- Tracks active followers
- Ensures majority consensus before committing
- Manages the atomic broadcast protocol
-
Recovery Coordination:
- Brings lagging followers up to date
- Handles server joins and departures
- Coordinates state transfers to new servers
Follower Server
Followers perform several key functions:
-
Read Request Handling:
- Serve read requests directly to clients
- Provide scalability for read-heavy workloads
-
Write Request Forwarding:
- Forward write requests to the leader
- Participate in voting on proposals
-
Failure Detection:
- Send heartbeats to the leader
- Participate in leader election when needed
Observer Server (Optional)
ZooKeeper also supports observer nodes which:
- Don’t participate in voting
- Can serve read requests like followers
- Provide read scalability without affecting write performance
- Are ideal for cross-data-center deployments
The ZAB Protocol In Detail
ZooKeeper Atomic Broadcast (ZAB) has two main modes:
-
Recovery Mode:
- Activated when a leader fails or the ensemble starts
- Discovers the latest state among surviving servers
- Elects a new leader that has the most up-to-date state
- Synchronizes followers with the new leader
-
Broadcast Mode:
- Normal operation after recovery
- Two-phase commit process:
- Phase 1: Leader sends proposal with zxid (transaction ID)
- Phase 2: Leader commits after receiving majority acknowledgments
ZooKeeper Sessions and Watches
Session Management
ZooKeeper sessions provide client-server connection context:
-
Session Establishment:
- Client connects and receives a session ID
- Session timeout negotiated between client and server
-
Heartbeating:
- Client sends periodic pings to keep session alive
- If server doesn’t hear from client within timeout, session expires
- All ephemeral nodes created in session are automatically removed
-
Session Migration:
- Sessions can move between servers if a server fails
- Transparent to the client application
Watch Mechanics
Watches are one-time triggers that:
-
Registration Process:
- Set on reads (getData, getChildren, exists)
- Specify exactly what changes to watch for
-
Event Types:
- NodeCreated: Node comes into existence
- NodeDeleted: Node is removed
- NodeDataChanged: Node’s data is modified
- NodeChildrenChanged: Node’s children list changes
-
Delivery Guarantees:
- Events delivered in order relative to other changes
- Client sees the watch event before seeing new state
- One-time triggers that must be re-registered if needed again
ZooKeeper Data Model
Znode Structure
Each znode contains:
- Data: Up to 1MB of data (typically much smaller)
- ACLs: Access control lists defining permissions
- Stats:
- czxid: ID of the transaction that created the node
- mzxid: ID of the transaction that last modified the node
- ctime/mtime: Creation and modification timestamps
- version/cversion/aversion: Version counters for data, children, and ACL changes
- ephemeralOwner: Session ID of owner if ephemeral
- dataLength/numChildren: Size of data and number of children
Path Namespace
ZooKeeper’s hierarchical namespace:
- Starts with ”/” (root)
- Uses Unix-like paths (/app/region1/service1)
- Each path component has a 255 byte limit
- Total path limited to 1024 bytes
- Cannot use relative paths or ”.” and ”..” notation
ZooKeeper Consistency Guarantees
ZooKeeper provides specific ordering guarantees:
- Linearizable Writes: All servers process updates in the same order
- FIFO Client Order: All requests from a given client executed in order
- Sequential Consistency: Updates from a client are applied in the order submitted
- Sync Method: Forces server to apply all pending writes before processing reads
These guarantees enable implementing complex distributed algorithms reliably.
Common ZooKeeper Patterns
Leader Election Implementation
A robust leader election pattern:
-
Create Sequential Ephemeral Node:
- Each server creates a znode like
/election/node-{sequential-id}
- Gets back its sequence number
- Each server creates a znode like
-
Find Lowest Sequence:
- Server lists all election nodes
- Determines the lowest sequence number
-
Become Leader or Watch Predecessor:
- If own node has lowest sequence, become leader
- Otherwise, watch the next lower node
-
Handle Node Deletion:
- When watched node disappears, repeat process
- Ensures orderly succession if a server fails
Distributed Locks
ZooKeeper can implement fair, distributed locks:
-
Create Sequential Ephemeral Node:
- Client creates
/locks/lock-{sequential-id}
- Client creates
-
Check Position:
- Client lists all lock nodes
- If its node has lowest sequence, it holds the lock
- Otherwise, watch the next lower sequence node
-
Wait for Notification:
- When the watched node is deleted, client rechecks position
- Prevents herd effect (only one client wakes up)
-
Release Lock:
- Delete own node when done
- Next client in line gets notified
Configuration Management
For dynamic configuration:
-
Central Configuration Node:
- Store configuration in a persistent znode
- All clients watch this node
-
Configuration Updates:
- Administrator updates the configuration znode
- All watching clients get notified
-
Two-phase Commit Approach:
- Use a ready/active two-node approach for complex changes
- Clients first watch for “ready” notification with new config
- Then wait for “active” signal to apply changes simultaneously
Performance Characteristics
Read vs. Write Performance
ZooKeeper’s architecture creates asymmetric performance:
-
Reads: Very fast (can be served by any server)
- Local server memory access
- No network coordination required
- Scales horizontally by adding followers
-
Writes: Moderate speed (must go through leader)
- Requires network round-trip to leader
- Leader must get majority acknowledgment
- Limited by leader’s capacity
Scaling Considerations
Best practices for scaling ZooKeeper:
-
Ensemble Size:
- 3 servers: Tolerates 1 failure
- 5 servers: Tolerates 2 failures
- 7+ servers: Diminishing returns due to write performance impact
-
Memory Requirements:
- Entire data tree must fit in memory
- Typical production deployments: 8-32GB heap
-
Disk Performance:
- Write-ahead logging requires good sequential write performance
- SSD recommended for transaction log
- Separate disks for transaction log and snapshot storage
Real-World Deployment Tips
-
Memory Configuration:
- Set JVM heap appropriately (not too large to avoid GC pauses)
- Leave enough system memory for operating system page cache
-
Network Configuration:
- Dedicated NIC for ZooKeeper traffic in high-volume environments
- Low-latency connections between ensemble members
-
Monitoring Essentials:
- Watch for “fsync” latencies
- Monitor outstanding requests
- Track network latency between servers
- Set alerts on ensemble size changes
-
Operational Best Practices:
- Regular snapshot cleaning
- Rolling server upgrades
- Periodic ensemble expansion planning
- Testing leader failover scenarios
Understanding these deeper aspects of ZooKeeper helps in building and maintaining robust distributed systems that rely on its coordination capabilities. While ZooKeeper appears simple at first glance, its reliable distributed consensus properties make it a foundational component for many complex distributed systems.