System Design: WhatsApp / Real-Time Chat Application

What’s the Goal?

Enable real-time, end-to-end encrypted messaging between users (1-on-1 or in groups), with support for media, read receipts, message sync across devices, and offline delivery.

Core Components

a. User Service

  • Manages user profiles, devices, and online status
  • Handles authentication and session management
  • Stores user metadata (not message content)

b. Messaging Service

  • Routes messages between senders and receivers
  • Manages message delivery status
  • Ensures reliable message delivery

c. Connection Service

  • Manages WebSocket connections for real-time communication
  • Tracks online/offline status of users
  • Optimizes connection management for mobile devices

d. Storage Service

  • Stores message history and media
  • Maintains conversation threads
  • Handles message retention policies

e. Notification Service

  • Sends push notifications for offline users
  • Manages notification preferences
  • Integrates with platform-specific notification services (APNs, FCM)

f. Media Service

  • Processes uploads/downloads of images, audio, video
  • Optimizes media for different devices and connections
  • Manages efficient storage and delivery

g. Encryption Service

  • Ensures end-to-end encryption for all messages
  • Manages key exchange and verification
  • Protects metadata where possible

h. Group Service

  • Manages group membership and permissions
  • Handles message fanout to group members
  • Optimizes group message delivery

i. Sync Service

  • Keeps message history synced across user devices
  • Manages conflict resolution
  • Handles partial/selective sync

How Backend Processes a Message

Stage Details
1. Send Message Request
  • Client sends:
    • Sender ID
    • Receiver ID (or group ID)
    • Message content (already encrypted on device)
    • Timestamp & message ID (UUID)
  • API Gateway routes request to Messaging Service
2. Validate & Store
  • User Service validates sender's session/token
  • Messaging Service assigns internal message ID and stores metadata (not content) in DB
  • If media: Media URL stored separately, not in message body
  • Message is persisted temporarily (e.g., in Cassandra or DynamoDB)
3. Delivery Decision (Online vs Offline)
  • Connection Service checks:
    • Is the recipient online?
    • What device(s) are active?
  • If online:
    • Push message via WebSocket directly to client
  • If offline:
    • Store message in a Message Queue (like Kafka)
    • Trigger push notification
4. Receipt Updates
  • When message is delivered: backend sends delivery status back to sender
  • When recipient opens chat:
    • Client sends read-receipt
    • Backend updates message state to "read" and notifies sender (blue ticks)
5. Message Sync
  • If user logs in on another device:
    • Backend pulls messages from message DB
    • Replays message history to sync conversation
    • Can use Sync Service to ensure all devices stay in sync
6. Group Chat Delivery (Fanout Model)
  • For group messages:
    • Sender sends message once to backend
    • Backend creates N copies for each group member
    • Each message is routed via same logic (online vs offline)
  • Efficient fanout can be done using pub-sub or Kafka topic per group
7. End-to-End Encryption (E2EE)
  • Messages are encrypted on the sender's device
  • Backend never sees plaintext content
  • Only the receiver's device can decrypt using their private key
  • For groups, encryption is done using a group key shared among members

Scaling the System

Component Scaling Strategy
WebSocket Connections Use a pool of servers with sticky sessions + load balancer
Message Queues Kafka/SQS to queue undelivered messages and decouple services
Database Shard user data + messages by user ID or region
Media Storage Store large files in object stores (S3/GCS) + serve via CDN
Caching Redis for quick lookups (e.g., online status, recent messages)
Read Receipts Use event streams to update message states in real time
Notification Service Scale independently with mobile push (FCM/APNs)

Technical Challenges & Solutions

1. Connection Management

  • Challenge: Maintaining millions of concurrent WebSocket connections
  • Solution:
    • Connection pooling with specialized servers
    • Heartbeat mechanisms to detect stale connections
    • Graceful connection handling for mobile devices (battery optimization)

2. Message Ordering

  • Challenge: Ensuring messages appear in correct order across devices
  • Solution:
    • Lamport timestamps or vector clocks
    • Server-assigned sequence numbers per conversation
    • Client-side reordering when necessary

3. Offline Message Delivery

  • Challenge: Ensuring reliable delivery when users go offline/online
  • Solution:
    • Message queue for undelivered messages
    • Message retention policies
    • Delivery receipts and retry mechanisms

4. Media Handling

  • Challenge: Efficiently storing and delivering large media files
  • Solution:
    • Progressive upload/download
    • Multiple resolutions based on device/connection
    • Background compression and optimization
    • Separate storage path from message content

5. Multi-Device Sync

  • Challenge: Keeping conversation state consistent across devices
  • Solution:
    • Central message store with device-specific cursors
    • Conflict resolution strategies
    • Selective sync for older messages

Security Considerations

End-to-End Encryption Implementation

  • Each user generates public/private key pairs on device registration
  • Public keys are exchanged via the server
  • Messages are encrypted with recipient’s public key
  • Server only sees encrypted content
  • Key verification through QR codes or verification numbers

Data Protection

  • Message content encrypted in transit and at rest
  • Metadata minimization where possible
  • Automatic message expiry options
  • Secure deletion mechanisms

Privacy Features

  • Read receipt controls
  • Last seen privacy options
  • Profile photo visibility settings
  • Group invitation controls

Conclusion

Building a WhatsApp-like system requires careful consideration of real-time communication, encryption, offline capabilities, and scale. The architecture must balance immediate message delivery with reliability, while ensuring end-to-end encryption protects user privacy. By separating concerns into specialized services and implementing proper scaling strategies, you can create a messaging system that handles millions of concurrent users while maintaining performance and security.