How Elasticsearch Works from Start to Finish

  1. You start an Elasticsearch cluster - it can be one or more servers (called nodes).

  2. You create an index - like a database where your data will be stored.

  3. You send your data (in JSON format) into that index using an API.

  4. Elasticsearch stores your data as documents inside the index.

  5. It automatically breaks the index into smaller pieces called shards so it can handle large data and distribute load.

  6. It also makes copies (replicas) of shards so your data is safe even if one server fails.

  7. When you insert a document, Elasticsearch figures out which shard to store it in using a hash function.

  8. Before storing, it processes the text fields (breaks them into words, lowercases, etc.) so it can search them easily later.

  9. Once stored, the document becomes searchable shortly after (usually within 1 second).

  10. When you search, you send a query to Elasticsearch (like “find all users named John”).

  11. The node you hit sends the query to all relevant shards in the cluster.

  12. Each shard runs the query on its own data and returns results.

  13. The main node collects results from all shards, combines and sorts them, then sends them back to you.

  14. You get a list of matching documents, usually with relevance scores if it’s a text search.

  15. You can also run filters, group data (aggregations), and sort results easily.

  16. If you update a document, Elasticsearch deletes the old one and adds a new version behind the scenes.

  17. If you delete, it marks it for deletion and removes it later during optimization.

  18. Elasticsearch keeps track of changes and health automatically - if a server goes down, replicas take over.

  19. You can use Kibana (a UI tool) to visualize and explore the data stored in Elasticsearch.

  20. That’s it, you store, search, update, and analyze data, all in near real-time, distributed, and fast.

NOTE: The content below is additional technical knowledge and not necessary for basic understanding. Feel free to stop here if you're looking for just the essential process.

Elasticsearch Technical Deep Dive

Core Architecture

Elasticsearch’s distributed architecture consists of:

  1. Node Types:

    • Master-eligible nodes: Control the cluster, manage indices
    • Data nodes: Store data and execute search/index operations
    • Ingest nodes: Pre-process documents before indexing
    • Coordinating nodes: Route requests, merge results
    • Machine Learning nodes: Run ML jobs and analytics
  2. Cluster Formation:

    • Uses Zen Discovery for node discovery
    • Implements split-brain prevention via quorum-based voting
    • Cluster State contains metadata about indices, shards, mappings
    • Managed by elected master node
  3. Communication Protocol:

    • Uses TCP for node-to-node communication
    • REST API over HTTP for client-server interaction
    • Binary protocol for internal transport

Index Structure

An Elasticsearch index is composed of:

  1. Shards:

    • Primary shards: Original data segments of an index
    • Replica shards: Copies of primary shards
    • Each shard is a self-contained Lucene index
    • Default of 1 primary and 1 replica in newer versions
  2. Segments:

    • Immutable Lucene data files
    • Small segments merge into larger ones (background process)
    • Deleted documents marked in .del files
    • Segment merging improves query performance
  3. Mappings:

    • Schema definition for documents
    • Field types: text, keyword, numeric, date, geo, etc.
    • Dynamic vs explicit mapping
    • Multi-fields: Same data indexed multiple ways

Inverted Index

The heart of Elasticsearch’s search capability:

  1. Structure:

    • Terms dictionary: Vocabulary of all terms
    • Postings list: Documents containing each term
    • Term frequency: How often term appears in a document
    • Position information: Where term occurs in document
  2. Tokenization Process:

    • Character filters: Pre-processing text
    • Tokenizers: Breaking text into tokens
    • Token filters: Transforming tokens (lowercasing, stemming)
    • Analyzers: Combinations of the above

Example:

Original: "The Quick Brown Fox"
Tokenized: ["the", "quick", "brown", "fox"]
Inverted Index:
"the" → Doc1, Doc3, Doc7...
"quick" → Doc1, Doc5...

Document Storage

How Elasticsearch manages document data:

  1. Document Routing:

    • Default: shard = hash(id) % number_of_primary_shards
    • Custom routing for data locality
    • Consistent hashing ensures even distribution
  2. Write Path:

    • Document arrives at coordinating node
    • Routed to primary shard
    • Indexed in memory buffer
    • Replicated to replica shards
    • Acknowledged when replicas confirm
  3. Refresh Process:

    • Memory buffer flushed to segment (default: every 1s)
    • New segment opened for writes
    • Makes data searchable
    • Controlled via refresh_interval setting
  4. Flush Process:

    • Segments committed to disk
    • Translog cleared
    • Ensures durability

Query Execution

The lifecycle of a search query:

  1. Query Types:

    • Match: Full-text search with analysis
    • Term: Exact match without analysis
    • Bool: Combine queries with must/should/must_not
    • Range: Numeric/date ranges
    • Geo: Spatial queries
  2. Query Phases:

    • Query Phase: Finds matching documents
    • Fetch Phase: Retrieves documents
    • DFS Query Phase: Optional pre-query for better scoring
  3. Query Execution Flow:

    • Coordinating node receives query
    • Broadcasts to relevant shards
    • Each shard executes locally
    • Returns document IDs and scores
    • Coordinating node merges results
    • Requests full documents for top N results
    • Returns final response

Aggregations Framework

Elasticsearch’s analytics engine:

  1. Aggregation Types:

    • Bucket: Group documents (terms, date histogram)
    • Metric: Calculate metrics (avg, sum, stats)
    • Pipeline: Process output of other aggregations
    • Matrix: Operate on multiple fields
  2. Execution Model:

    • Per-bucket collection of documents
    • Memory-bound operation
    • Circuit breaker prevents OOM
    • Doc values optimize for aggregations

Example:

{
  "aggs": {
    "by_country": {
      "terms": { "field": "country" },
      "aggs": {
        "avg_price": { "avg": { "field": "price" } }
      }
    }
  }
}

Distributed Operations

How Elasticsearch maintains consistency:

  1. Versioning:

    • Internal versioning: Auto-incremented
    • External versioning: Client-provided
    • Optimistic concurrency control
  2. Replication Model:

    • Primary-backup replication
    • Primary accepts writes, replica serves reads
    • Asynchronous replication by default
    • Sync option available for higher consistency
  3. Cluster Health:

    • Green: All shards allocated
    • Yellow: All primaries allocated, some replicas missing
    • Red: Some primary shards not allocated

Performance Optimization

Key techniques for faster Elasticsearch:

  1. Indexing Optimization:

    • Bulk indexing for throughput
    • Optimal refresh interval
    • Update by query instead of reindexing
    • Sparse fields in separate indices
  2. Query Optimization:

    • Filter context before query context
    • Avoid wildcard prefixes
    • Use routing for targeted queries
    • Field data caching vs. doc values
  3. Memory Management:

    • Field data circuit breaker
    • Doc values vs. in-memory field data
    • Heap sizing (50% of RAM, max 32GB)
    • Disk-based data structures

Advanced Features

  1. Elasticsearch SQL:

    • SQL interface to Elasticsearch
    • JDBC/ODBC drivers available
    • Translates to Elasticsearch queries
  2. Cross-Cluster Search:

    • Query across multiple clusters
    • Federated search capabilities
    • Remote cluster connections
  3. Index Lifecycle Management:

    • Automated index management
    • Hot-warm-cold-delete phases
    • Policy-based data retention
  4. Searchable Snapshots:

    • Search directly from backups
    • Reduces storage costs
    • Partial mounting capability