How Elasticsearch Works from Start to Finish
-
You start an Elasticsearch cluster - it can be one or more servers (called nodes).
-
You create an index - like a database where your data will be stored.
-
You send your data (in JSON format) into that index using an API.
-
Elasticsearch stores your data as documents inside the index.
-
It automatically breaks the index into smaller pieces called shards so it can handle large data and distribute load.
-
It also makes copies (replicas) of shards so your data is safe even if one server fails.
-
When you insert a document, Elasticsearch figures out which shard to store it in using a hash function.
-
Before storing, it processes the text fields (breaks them into words, lowercases, etc.) so it can search them easily later.
-
Once stored, the document becomes searchable shortly after (usually within 1 second).
-
When you search, you send a query to Elasticsearch (like “find all users named John”).
-
The node you hit sends the query to all relevant shards in the cluster.
-
Each shard runs the query on its own data and returns results.
-
The main node collects results from all shards, combines and sorts them, then sends them back to you.
-
You get a list of matching documents, usually with relevance scores if it’s a text search.
-
You can also run filters, group data (aggregations), and sort results easily.
-
If you update a document, Elasticsearch deletes the old one and adds a new version behind the scenes.
-
If you delete, it marks it for deletion and removes it later during optimization.
-
Elasticsearch keeps track of changes and health automatically - if a server goes down, replicas take over.
-
You can use Kibana (a UI tool) to visualize and explore the data stored in Elasticsearch.
-
That’s it, you store, search, update, and analyze data, all in near real-time, distributed, and fast.
Elasticsearch Technical Deep Dive
Core Architecture
Elasticsearch’s distributed architecture consists of:
-
Node Types:
- Master-eligible nodes: Control the cluster, manage indices
- Data nodes: Store data and execute search/index operations
- Ingest nodes: Pre-process documents before indexing
- Coordinating nodes: Route requests, merge results
- Machine Learning nodes: Run ML jobs and analytics
-
Cluster Formation:
- Uses Zen Discovery for node discovery
- Implements split-brain prevention via quorum-based voting
- Cluster State contains metadata about indices, shards, mappings
- Managed by elected master node
-
Communication Protocol:
- Uses TCP for node-to-node communication
- REST API over HTTP for client-server interaction
- Binary protocol for internal transport
Index Structure
An Elasticsearch index is composed of:
-
Shards:
- Primary shards: Original data segments of an index
- Replica shards: Copies of primary shards
- Each shard is a self-contained Lucene index
- Default of 1 primary and 1 replica in newer versions
-
Segments:
- Immutable Lucene data files
- Small segments merge into larger ones (background process)
- Deleted documents marked in .del files
- Segment merging improves query performance
-
Mappings:
- Schema definition for documents
- Field types: text, keyword, numeric, date, geo, etc.
- Dynamic vs explicit mapping
- Multi-fields: Same data indexed multiple ways
Inverted Index
The heart of Elasticsearch’s search capability:
-
Structure:
- Terms dictionary: Vocabulary of all terms
- Postings list: Documents containing each term
- Term frequency: How often term appears in a document
- Position information: Where term occurs in document
-
Tokenization Process:
- Character filters: Pre-processing text
- Tokenizers: Breaking text into tokens
- Token filters: Transforming tokens (lowercasing, stemming)
- Analyzers: Combinations of the above
Example:
Original: "The Quick Brown Fox"
Tokenized: ["the", "quick", "brown", "fox"]
Inverted Index:
"the" → Doc1, Doc3, Doc7...
"quick" → Doc1, Doc5...
Document Storage
How Elasticsearch manages document data:
-
Document Routing:
- Default: shard = hash(id) % number_of_primary_shards
- Custom routing for data locality
- Consistent hashing ensures even distribution
-
Write Path:
- Document arrives at coordinating node
- Routed to primary shard
- Indexed in memory buffer
- Replicated to replica shards
- Acknowledged when replicas confirm
-
Refresh Process:
- Memory buffer flushed to segment (default: every 1s)
- New segment opened for writes
- Makes data searchable
- Controlled via refresh_interval setting
-
Flush Process:
- Segments committed to disk
- Translog cleared
- Ensures durability
Query Execution
The lifecycle of a search query:
-
Query Types:
- Match: Full-text search with analysis
- Term: Exact match without analysis
- Bool: Combine queries with must/should/must_not
- Range: Numeric/date ranges
- Geo: Spatial queries
-
Query Phases:
- Query Phase: Finds matching documents
- Fetch Phase: Retrieves documents
- DFS Query Phase: Optional pre-query for better scoring
-
Query Execution Flow:
- Coordinating node receives query
- Broadcasts to relevant shards
- Each shard executes locally
- Returns document IDs and scores
- Coordinating node merges results
- Requests full documents for top N results
- Returns final response
Aggregations Framework
Elasticsearch’s analytics engine:
-
Aggregation Types:
- Bucket: Group documents (terms, date histogram)
- Metric: Calculate metrics (avg, sum, stats)
- Pipeline: Process output of other aggregations
- Matrix: Operate on multiple fields
-
Execution Model:
- Per-bucket collection of documents
- Memory-bound operation
- Circuit breaker prevents OOM
- Doc values optimize for aggregations
Example:
{
"aggs": {
"by_country": {
"terms": { "field": "country" },
"aggs": {
"avg_price": { "avg": { "field": "price" } }
}
}
}
}
Distributed Operations
How Elasticsearch maintains consistency:
-
Versioning:
- Internal versioning: Auto-incremented
- External versioning: Client-provided
- Optimistic concurrency control
-
Replication Model:
- Primary-backup replication
- Primary accepts writes, replica serves reads
- Asynchronous replication by default
- Sync option available for higher consistency
-
Cluster Health:
- Green: All shards allocated
- Yellow: All primaries allocated, some replicas missing
- Red: Some primary shards not allocated
Performance Optimization
Key techniques for faster Elasticsearch:
-
Indexing Optimization:
- Bulk indexing for throughput
- Optimal refresh interval
- Update by query instead of reindexing
- Sparse fields in separate indices
-
Query Optimization:
- Filter context before query context
- Avoid wildcard prefixes
- Use routing for targeted queries
- Field data caching vs. doc values
-
Memory Management:
- Field data circuit breaker
- Doc values vs. in-memory field data
- Heap sizing (50% of RAM, max 32GB)
- Disk-based data structures
Advanced Features
-
Elasticsearch SQL:
- SQL interface to Elasticsearch
- JDBC/ODBC drivers available
- Translates to Elasticsearch queries
-
Cross-Cluster Search:
- Query across multiple clusters
- Federated search capabilities
- Remote cluster connections
-
Index Lifecycle Management:
- Automated index management
- Hot-warm-cold-delete phases
- Policy-based data retention
-
Searchable Snapshots:
- Search directly from backups
- Reduces storage costs
- Partial mounting capability