How MongoDB Works from Start to Finish
-
You install MongoDB, which runs a server called mongod - it listens for connections from clients.
-
Instead of rows and tables, MongoDB stores data in collections and documents.
- A collection is like a table.
- A document is a single record, stored in JSON-like format (called BSON = Binary JSON).
-
You connect to MongoDB using a client (CLI, driver, or GUI like Compass).
-
When you insert a document, MongoDB stores it as a BSON object inside the appropriate collection.
-
MongoDB is schemaless - you don’t need to define columns in advance. Each document in a collection can have different fields.
-
Documents are stored in memory-mapped files, and written to disk in the background. Data is grouped into storage units called “extents”.
-
Every document has a unique _id field (like a primary key), and you can query using that or any other field.
-
MongoDB maintains indexes to make queries faster - including default index on _id, and optional ones on other fields.
-
When you run a query (like find), MongoDB:
- Uses indexes if available.
- Otherwise, scans documents in the collection.
- Returns matching documents in JSON format.
-
For updates, MongoDB can modify a part of a document ($set, $inc, etc.) or replace the whole thing.
-
MongoDB uses Write-Ahead Logging (WiredTiger Journal) - before writing to disk, it writes changes to a journal for crash recovery.
-
MongoDB can be run in replica sets - where one node is primary (handles writes), and others are secondaries (replicas for reads or failover).
-
It supports automatic failover - if the primary goes down, a secondary becomes the new primary.
-
MongoDB supports sharding for horizontal scaling - it splits large collections across multiple servers (called shards) based on a shard key.
-
Clients send queries to a mongos router, which figures out which shard(s) have the data and routes the request accordingly.
-
MongoDB supports aggregation pipelines - a way to process and transform data like SQL GROUP BY, filters, joins ($lookup), etc.
-
Data consistency is eventually consistent in sharded and replicated setups, but you can control this using write concerns and read preferences.
-
MongoDB also supports transactions, both single-document (always atomic) and multi-document (since v4.0), though with some performance cost.
Deep Dive: MongoDB Architecture
Storage Engine
MongoDB’s storage engine (WiredTiger since version 3.0) is responsible for how data is stored on disk and in memory:
-
Document Storage: Documents are stored in collections which are physically implemented as B-trees.
-
Memory Management:
- WiredTiger uses an internal cache to keep frequently accessed data in memory
- MongoDB also leverages the operating system’s page cache
- Memory-mapped files provide a layer between the MongoDB process and physical storage
-
Compression:
- Documents are compressed using snappy or zlib algorithms
- Indexes can also be compressed
- This reduces storage requirements by 60-80% for many workloads
-
Journaling:
- Before any write is committed to the main data files, it’s recorded in the journal
- Journal entries include enough information to replay operations in case of a crash
- Journal files are synced to disk more frequently than data files (default: every 100ms)
Query Execution
When MongoDB processes a query, it goes through several stages:
-
Query Planning:
- The query optimizer analyzes the query and available indexes
- It generates multiple query plans and chooses the most efficient one
- Plan selection is cached for similar future queries
-
Execution:
- MongoDB retrieves documents either via index scans or collection scans
- For complex queries, it might use a combination of approaches
- Results are assembled in memory (with limits to prevent excessive memory usage)
-
Cursor Management:
- Results are returned via cursors, which allow batch processing
- Default batch size is 101 documents
- Cursors timeout after 10 minutes of inactivity
Replication Mechanics
MongoDB’s replica sets provide redundancy and high availability:
-
Oplog (Operations Log):
- All write operations on the primary are recorded in a special capped collection called the oplog
- Secondary nodes continuously pull operations from the primary’s oplog
- Each operation is idempotent (can be applied multiple times with same result)
-
Elections:
- When a primary becomes unavailable, remaining nodes hold an election
- Election factors include: node priority, replication lag, and connectivity
- A majority of nodes must participate for an election to succeed
-
Read Preferences:
- Applications can specify where reads should be routed:
primary
: Always read from primary (default, strongest consistency)primaryPreferred
: Primary if available, otherwise secondarysecondary
: Only read from secondariessecondaryPreferred
: Secondary if available, otherwise primarynearest
: Read from lowest-latency member
- Applications can specify where reads should be routed:
Sharding Internals
For horizontal scaling, MongoDB distributes data across multiple machines:
-
Chunks:
- Collections are divided into chunks (default 64MB)
- Each chunk contains a range of shard key values
- The balancer process migrates chunks between shards to ensure even distribution
-
Config Servers:
- Store metadata about the cluster’s sharded data
- Track which shard contains which chunks
- Implemented as a replica set for reliability
-
Shard Key Selection:
- Critical decision that affects performance and scalability
- Ideal shard keys have high cardinality and even distribution
- Poor shard key choice can lead to “jumbo chunks” that cannot be split
MongoDB vs. Traditional RDBMS
Understanding the key differences helps in designing optimal data models:
-
Schema Flexibility:
- RDBMS: Fixed schema, alterations require DDL statements
- MongoDB: Dynamic schema, no migrations needed
-
Query Language:
- RDBMS: SQL (declarative)
- MongoDB: Query language based on JSON (more imperative)
-
Relationships:
- RDBMS: Foreign keys, joins are server-side
- MongoDB: References or embedded documents, joins ($lookup) are less efficient
-
ACID Guarantees:
- RDBMS: Strong ACID at database level
- MongoDB: Document-level atomicity by default, transaction support added later
-
Scalability Approach:
- RDBMS: Primarily vertical scaling, complex horizontal scaling
- MongoDB: Built for horizontal scaling via sharding
Best Practices for MongoDB
To get the most out of MongoDB:
-
Document Design:
- Embed related data that is queried together
- Use references when data is accessed independently or is very large
- Keep document size below 16MB limit
-
Indexing Strategy:
- Index fields used in query criteria, sorting, and joining
- Create compound indexes for common query patterns
- Be cautious of index overhead on write operations
-
Query Optimization:
- Use the explain() method to analyze query performance
- Structure queries to utilize indexes
- Use projection to return only needed fields
-
Hardware Considerations:
- Prioritize RAM for working set
- Use SSDs for improved random I/O performance
- Configure appropriate write concern based on durability needs
-
Monitoring:
- Track key metrics: operation latency, connection count, queue size
- Monitor disk space and memory usage
- Use MongoDB Atlas or MongoDB Ops Manager for comprehensive monitoring
MongoDB’s architecture makes it particularly well-suited for applications with evolving schemas, high write loads, and requirements for horizontal scaling. Understanding these internals helps developers and architects make informed decisions about when and how to use MongoDB effectively.