System Design: Zoom / Video Conferencing Platform

Video conferencing platforms like Zoom have become ubiquitous for remote collaboration. Designing such a system requires balancing real-time audio and video delivery with reliability and scalability across the globe. This guide walks through the core components, data flow, and scaling strategies for a video conferencing platform capable of hosting millions of concurrent users.

What’s the Goal?

Start or join video meetings, whether authenticated or as a guest
Deliver real-time audio and video with minimal latency
Provide additional collaboration tools like chat, screen sharing, and reactions
Maintain connection stability across various network conditions
Allow recordings, mute control, and participant roles
Scale smoothly to millions of simultaneous users

Core Components

Component	What It Is	Why It's Needed
WebRTC	Real-time media exchange	For audio/video streaming
SFU	Selective media router	To scale group calls
TURN Server	Media relay server	Handles strict firewalls/NATs
STUN Server	Public IP discovery	To establish peer connections
Signaling Server	WebRTC metadata exchanger	For session setup handshake
UDP Protocol	Fast, connectionless transport	Low-latency stream delivery
ICE Candidates	Network path options	Picks best route for connection
Recording Service	Stream capture and storage	Save meetings for later use
Chat/Metadata Server	Event & message handler	For chat, raise hand, controls
Auth Service	User identity manager	Access control & meeting roles

Scaling Each Component

Component	Scale Strategy
Signaling Server	Stateless WebSocket servers behind a load balancer. Each client connects to any node; metadata is passed via pub/sub. Easy horizontal scaling as connections grow.
SFU (Media Server)	Geo-distributed SFUs (one per region) route streams between users. Stateless and horizontally scalable—spin up more nodes as users increase. Meeting-aware routing assigns participants to the same SFU.
TURN/STUN Servers	Deployed in multiple regions. Auto-scale TURN clusters to handle edge traffic. STUN is lightweight and easy to scale.
Metadata Server	Use pub/sub (Redis Streams or Kafka) or clustered gRPC servers to handle chat events, hand raise, and mute/unmute. Stateless and replicated for fault tolerance.
Recording Service	Runs asynchronously. Streams are copied from SFU and sent to recorder pods. Processed files are stored in S3 or GCS.

Step 1: User Starts or Joins a Meeting

Frontend: The user opens a meeting link like https://zoom.com/meet/abc123.
Authentication: If the user is signed in, a JWT or session token verifies identity. Guests may join with limited permissions.
Meeting Session: The backend creates or joins the meeting ID session and records the participant in the database.
Signaling Connection: The client establishes a WebSocket connection to the nearest signaling server to begin WebRTC negotiation.

Step 2: Connection Establishment at Scale

Geo-routing: The backend assigns the user to a nearby SFU based on IP or geolocation. This reduces latency by keeping traffic local.
WebSocket Connection: The browser maintains a persistent WebSocket to the signaling server cluster. The servers are stateless, so any node can handle the connection.
WebRTC Handshake: ICE candidates and Session Description Protocol (SDP) messages are exchanged over the signaling channel. Clients gather candidates from STUN/TURN and send them to the SFU.
Media Routing: Once negotiation completes, the client sends media streams over UDP to the SFU. The SFU forwards streams to other participants in the call.
TURN/STUN Fallback: If a participant’s network blocks direct traffic, the TURN server relays traffic through a public address.
Scalability Notes:
- Each SFU handles thousands of concurrent streams.
- Because SFUs are stateless, adding more nodes allows linear scaling.
- A load balancer or meeting-aware router ensures all participants in the same meeting connect to the same SFU for efficiency.

Step 3: Media Routing Layer

1-on-1 Calls: If there are only two participants, WebRTC can connect them directly peer to peer. However, many platforms still route through the SFU to simplify NAT traversal and allow features like recording.
Multi-user Meetings: For group calls, each participant sends a single stream up to the SFU. The SFU then selectively forwards streams to others, reducing upstream bandwidth requirements. This is more efficient than each user sending separate streams to every other participant.
Stream Forwarding: The SFU analyzes each participant’s network conditions and may adjust bitrate or resolution for optimal experience. It can also perform simulcast or SVC (scalable video coding) to adapt quality on the fly.
Low Latency: Using UDP and keeping SFU servers close to users ensures end-to-end latency is kept under a few hundred milliseconds.

Step 4: Chat, Reactions, and Controls

Audio and video aren’t the only data flowing during a meeting. Participants send chat messages, raise hands, mute/unmute themselves, and share screens. These non-media events typically travel over the WebSocket channel or a gRPC stream handled by the metadata server. By keeping this control layer separate from the heavy media streams, we can scale each piece independently and ensure reliability even if video quality fluctuates.

Step 5: Recording & Storage

SFU Copy: When recording is enabled, the SFU duplicates incoming media streams and forwards them to a dedicated recording service.
Processing: The recording service writes streams to disk, possibly transcodes them, and packages audio/video into a final file format (like MP4).
Cloud Storage: Completed recordings are uploaded to S3, GCS, or a similar object store. Links to these files are saved in a recordings table.
Access: Participants can later download or stream these recordings based on their permissions.

Step 6: Post-Meeting Tasks

After the meeting ends, asynchronous workers handle follow-up actions. A queue or message bus like Kafka triggers items such as:

Sending email summaries to attendees
Pushing notifications for shared recordings
Collecting feedback through surveys
Generating meeting insights or transcripts via AI services

These tasks are decoupled from the real-time meeting flow so they don’t impact call quality or latency.

Database Design (Simplified)

Even though most media is streamed through ephemeral servers, the platform still needs a database for persistent data like users, meetings, chat logs, and recording links. Here’s a simplified schema:

Users Table

user_id	name	email	auth_token	role

Meetings Table

meeting_id	host_id	start_time	status

Participants Table

meeting_id	user_id	join_time	left_time	is_muted

Messages Table

meeting_id	user_id	message	timestamp

Recordings Table

meeting_id	url	created_at	size

Scaling the System

To serve millions of users, each component must be able to scale horizontally and handle bursts of load. Here are some high-level strategies:

Load Balancers and Auto-scaling: Deploy SFUs, signaling servers, and TURN servers behind load balancers. Auto-scale pods or VMs based on CPU and network metrics.
Regional Clusters: Place clusters in multiple regions to keep latency low for global users. A front-end router directs users to the nearest region.
Message Queues: Use Kafka or RabbitMQ for asynchronous tasks like recording processing, chat persistence, and analytics aggregation.
Microservices Architecture: Break the platform into smaller services (auth, signaling, SFU, recording, metadata) so each can scale and deploy independently.
Stateless Services: Keep as many components stateless as possible. Persist session metadata in Redis or a distributed cache to allow servers to be replaced without losing state.
Monitoring and QoS: Track end-to-end latency, packet loss, and server health metrics. Dynamically adjust bitrate or switch to audio-only when networks degrade.
Security: Implement TLS for signaling channels, encrypt media using SRTP, and enforce authentication/authorization for meeting access and recording downloads.

Putting It All Together

When a user clicks a meeting link, a flurry of small actions occurs that most participants never notice:

The frontend authenticates the user (or marks them as a guest) and connects to the closest signaling server.
The signaling server orchestrates a WebRTC handshake, exchanging ICE candidates gathered from STUN/TURN.
The backend assigns all participants to the same regional SFU. Media streams flow through the SFU over UDP.
Chat messages, reactions, and metadata events flow over a separate WebSocket or gRPC channel. These are persisted for later viewing or auditing.
If recording is enabled, the SFU forks a copy of each track to the recording service, which writes the data to object storage.
When the meeting ends, asynchronous workers process the recording files, update the database, and send follow-up emails or analytics events.
Users can return later to watch recordings or review chat logs, thanks to persistent data stored in the database tables.

By carefully decoupling concerns and using a mixture of stateless services, message queues, and region-aware routing, a video conferencing platform can scale to handle large volumes of simultaneous meetings with minimal latency.

Challenges and Considerations

While the basic design above covers core functionality, real-world deployments face numerous challenges:

Network Variability: Participants may join from high-latency mobile networks or corporate VPNs. Adaptive bitrate, jitter buffers, and forward error correction are essential to maintain quality.
Firewall/NAT Traversal: Many enterprise networks block UDP or restrict outbound ports. TURN servers must be highly available in multiple regions to relay streams in these cases.

Example End-to-End Flow

Let’s walk through a typical meeting from start to finish, highlighting how each component interacts:

User Authentication: Alice logs in and clicks “Start Meeting.” The frontend sends her token to the auth service, which verifies it and returns a meeting ID.
Joining a Meeting: Bob receives a link to join. He opens it in his browser, which connects to the nearest signaling server over WebSocket.
Negotiation: Alice’s and Bob’s browsers gather ICE candidates using STUN. These candidates, along with SDP offers and answers, flow through the signaling server and reach the SFU.
Media Exchange: Both clients begin streaming audio and video to the SFU. The SFU forwards streams to each participant, optimizing bitrate based on network conditions.
Chat & Controls: Bob types a message in chat. It travels over the metadata channel to Alice’s client. Alice shares her screen; the SFU forwards this track just like camera video.
Recording: With recording enabled, the SFU duplicates streams to the recording service. When the meeting ends, the recording is stored in object storage. Throughout this flow, the system maintains low latency by keeping media paths short and relying on stateless servers whenever possible.

Conclusion

Designing a reliable video conferencing platform is a complex undertaking. At its core, the system relies on WebRTC for real-time media, SFUs for scalable routing, TURN/STUN for network traversal, and a variety of stateless services that can scale horizontally. Efficient scaling requires strategic placement of SFUs, auto-scaling of supporting services, and careful monitoring of network performance.

With these components in place, Zoom and similar platforms can deliver smooth, low-latency communication to users across the globe. The architecture can grow to support massive meetings, asynchronous features like recording and transcription, and advanced collaboration tools—all while keeping the user experience responsive.

NOTE: This overview focuses on the core components and scaling principles. Production deployments involve additional layers of security, monitoring, compliance, and operational tooling to ensure high availability and data privacy.