Search Engine Indexing

Indexing and search engines

Indexing is the process by which a search engine organizes and stores content from web pages so it can be retrieved quickly when users enter search queries. It happens after a page is crawled and involves parsing, processing, and storing the content in an efficient structure called an inverted index.

describes the individual steps in indexing, namely, crawling, parsing, tokenization, normalization, inverted index, and querying on the inverted index. In , we describe various techniques for efficient indexing, and how to optimize for faster queries. reasons why indexing is suited for querying large, unstructured dataset. lists down essential python libraries for developing indexing. Finally, we summarize our discussion in .

Steps involved in indexing

Indexing involves crawling, parsing, tokenization, normalization, and building an inverted index. We will discuss them individually in the following sections.

Crawling

Crawling is the first step in the indexing process. It refers to the process where search engines use automated programs called crawlers or bots (e.g., Googlebot) to discover publicly available web pages on the internet. The purpose of crawler is

To discover new web pages.
To detect updates or changes on existing pages.
To follow hyperlinks and find more content.

Crawlers follow instructions defined by website owners using the robots.txt file. This file specifies which pages can or cannot be crawled.

# Example robots.txt
User-agent: *
Disallow: /private/
Allow: /public/

Implementation of crawling

Seed URLs: The crawler starts with a list of known web page URLs (seed list).
Fetching Content: It sends HTTP requests to download the HTML of each page.
Extracting Links: It scans the page to find all hyperlinks (anchor tags).
URL Discovery: New links found on the page are added to the crawl queue if allowed.
Repeat: This process continues recursively, expanding the crawler's reach across the web.

Challenges in crawling

Duplicate content (same content at multiple URLs)
Dynamic content that requires JavaScript rendering
Limited crawl budget (number of pages crawled per site)
Blocked pages due to authentication or robots.txt

Parsing

Parsing is the process of analyzing the HTML content of a crawled web page to extract meaningful data for indexing. After the crawler downloads the page, the HTML is scanned and broken down into structured components.

Purpose of parsing

To extract the main textual content for indexing.
To identify metadata such as titles and descriptions.
To detect links that can be followed for further crawling.
To recognize structured data used for enhanced search results.

Common elements extracted during parsing

Title: Extracted from the <title> tag. Used as the clickable headline in search results.
Meta Description: Extracted from <meta name="description">. May be shown as the page summary in search listings.
Headings: <h1> to <h6> tags are parsed to understand the page structure and topics.
Body Content: The main text within <body> tags is extracted for keyword indexing.
Links: All <a href="..."> tags are collected to find related pages and for link analysis.
Alt Text: Extracted from alt attributes of <img> tags for image indexing and accessibility.
Structured Data: JSON-LD, Microdata, or RDFa blocks provide machine-readable information such as reviews, events, or products.

Example HTML content

<html>
  <head>
    <title>AI Tools Overview</title>
    <meta name="description" content="A guide to AI tools for various tasks.">
    <script type="application/ld+json">
    {
      "@context": "https://schema.org",
      "@type": "Article",
      "headline": "AI Tools Overview",
      "author": "Example Author"
    }
    </script>
  </head>
  <body>
    <h1>Popular AI Tools</h1>
    <p>This page lists various AI tools used in automation.</p>
    <a href="/more-tools">More Tools</a>
  </body>
</html>

Parsing challenges

JavaScript-rendered content may require headless browsers.
Malformed HTML may cause parsing errors.
Hidden or duplicate content can reduce parsing accuracy.

Tokenization

Tokenization is the process of breaking down the parsed textual content into individual units called tokens. Tokens are typically words or terms that are used to build the inverted index. This step prepares the text for efficient search and retrieval.

Purpose of tokenization

To convert raw text into searchable terms.
To enable faster lookups during search queries.
To provide a standardized input for indexing and ranking.

Example

Original Text:

"Search engines index web pages efficiently."

Tokens:

["search", "engines", "index", "web", "pages", "efficiently"]

Tokenization techniques

Whitespace Tokenization: Splits text on spaces. Simple and fast.
Regex-Based Tokenization: Uses regular expressions to handle punctuation, numbers, symbols.
Language-Specific Tokenization: Handles compound words, contractions, and scripts (e.g., Chinese, Arabic).

The tokenization algorithm also needs to handle edge cases. Some examples of edge cases are

Words with apostrophes (e.g., "don’t")
Hyphenated terms (e.g., "state-of-the-art")
URLs, email addresses, and numbers

Example code

import re

text = "Search engines index web pages efficiently."
tokens = re.findall(r'\b\w+\b', text.lower())
print(tokens)
# Output: ['search', 'engines', 'index', 'web', 'pages', 'efficiently']

Normalization

Normalization is the process of standardizing tokens to ensure consistent and meaningful indexing. After tokenization, normalization improves search accuracy by reducing different forms of the same word to a common format.

Purpose of normalization

To unify variations of the same word (e.g., "Running" → "run").
To remove inconsistencies in capitalization, punctuation, and spelling.
To reduce index size and improve match accuracy.

Common normalization steps

Lowercasing: Converts all tokens to lowercase to avoid case mismatches.
"Search" → "search"
Stemming: Reduces a word to its base or root form by removing suffixes.
"running", "runner", "ran" → "run"
Lemmatization: Converts a word to its dictionary form using grammatical rules.
"better" → "good", "was" → "be"
Removing punctuation and special characters:
"data-driven" → "datadriven" or "data", "driven"
Removing stopwords: Common but less informative words are discarded.
["the", "is", "and", "on"] are often removed

Challenges in normalization

Over-stemming: Different words may be reduced to the same root incorrectly.
Under-stemming: Variants may not reduce to a common form.
Language sensitivity: Words in different languages require different rules.

Python code for normalization

from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import re

text = "The running man quickly ran to the data-driven lab."
tokens = re.findall(r'\b\w+\b', text.lower())
filtered = [w for w in tokens if w not in stopwords.words('english')]

stemmer = PorterStemmer()
normalized = [stemmer.stem(w) for w in filtered]
print(normalized)
# Output: ['run', 'man', 'quickli', 'ran', 'data', 'driven', 'lab']

Building the inverted index

The inverted index is the core data structure used in search engines to make searching fast and efficient. It maps each term (word) to a list of documents (and positions) where the term occurs. This allows the search engine to quickly retrieve relevant pages for any search query.

Purpose of the inverted index

To store terms in a searchable format.
To link each term to the documents that contain it.
To support fast keyword lookups and query processing.

Basic structure

The inverted index is usually implemented as a dictionary or hash table:

{
  "search": [1, 3],
  "engine": [1],
  "index": [1, 2, 3],
  "fast": [3]
}

Here, the keys are terms, and the values are lists of document IDs where those terms appear.

Posting list details

Each entry in the posting list may include more than just the document ID:

Document ID – Unique identifier for the document.
Term frequency – Number of times the term appears in the document.
Positions – The word positions within the document.

Python code for generating inverted index

from collections import defaultdict

# Step 1: Sample document collection with IDs
documents = {
    1: "Search engines use an inverted index",
    2: "The index helps retrieve documents",
    3: "Fast search is made possible by the index"
}

# Step 2: Initialize an inverted index structure
# It maps each term to a list of (docID, position) pairs
inverted_index = defaultdict(list)

# Step 3: Loop through documents to build the index
for doc_id, text in documents.items():
    # Convert text to lowercase and split into words
    tokens = text.lower().split()

    # Step 4: Track the position of each term in the document
    for position, term in enumerate(tokens):
        inverted_index[term].append({
            "docID": doc_id,
            "position": position
        })

# Step 5: Print the inverted index
for term, postings in inverted_index.items():
    print(term, postings)

Output of the Python code for generating inverted index is


# Output of the program is
search [{'docID': 1, 'position': 0}, {'docID': 3, 'position': 1}]
engines [{'docID': 1, 'position': 1}]
use [{'docID': 1, 'position': 2}]
an [{'docID': 1, 'position': 3}]
inverted [{'docID': 1, 'position': 4}]
index [{'docID': 1, 'position': 5}, {'docID': 2, 'position': 1}, {'docID': 3, 'position': 7}]
the [{'docID': 2, 'position': 0}, {'docID': 3, 'position': 6}]
helps [{'docID': 2, 'position': 2}]
retrieve [{'docID': 2, 'position': 3}]
documents [{'docID': 2, 'position': 4}]
fast [{'docID': 3, 'position': 0}]
is [{'docID': 3, 'position': 2}]
made [{'docID': 3, 'position': 3}]
possible [{'docID': 3, 'position': 4}]
by [{'docID': 3, 'position': 5}]

Indexing optimization

Sorting: Document IDs in posting lists are sorted to enable fast merging during queries.
Compression: Posting lists are compressed using techniques like delta encoding or variable-byte encoding to reduce storage size.
Sharding: Large indexes are split across multiple machines (by term or document) for scalability.

Challenges

High memory and storage requirements for large web corpora.
Efficiently updating the index with new or deleted documents.
Handling synonyms, typos, or multilingual content.

Example: Querying an inverted index

This example explains how a search engine uses an inverted index to process a query and find relevant documents efficiently. Suppose the search engine has indexed the following documents:

Doc 1: How to build an inverted index in Python
Doc 2: Learn how search engines use indexing
Doc 3: Building an index for fast document retrieval

A user enters the query: build index. The engine tokenizes the query into ["build", "index"], normalizes it, and prepares to look up each term in its inverted index. The inverted index maps terms to document IDs like this:

{
"build":    [1, 3],
"index":    [1, 2, 3],
...
}

The engine retrieves:

build → [1, 3]
index → [1, 2, 3]

By intersecting the lists, it finds matching documents: [1, 3]. These are the ones that contain both terms. The engine then scores these documents based on relevance. For example, Document 1 may score higher because it has the exact phrase “build an inverted index”. Document 3 is still relevant, but its wording is slightly different (“building an index”).

The final results are shown in ranked order:

How to build an inverted index in Python (Document 1)
Building an index for fast document retrieval (Document 3)

If the user had typed "build an index", the word “an” (a stop word) would be ignored. If the query were in quotes ("build index"), the search would look for the exact phrase, possibly returning only Document 1.

Python code for querying on an inverted index

from collections import defaultdict
from nltk.stem import PorterStemmer
import nltk

# Uncomment below if running for the first time to download required data
nltk.download('punkt')

# Sample document collection
documents = {
    1: "How to build an inverted index in Python",
    2: "Learn how search engines use indexing",
    3: "Building an index for fast document retrieval"
}

# Initialize a Porter stemmer for normalization
stemmer = PorterStemmer()

# Create an empty inverted index using defaultdict
inverted_index = defaultdict(list)

# Step 1: Build the inverted index with normalized (stemmed) tokens
for doc_id, text in documents.items():
    # Convert text to lowercase and split into words
    words = text.lower().split()

    # Apply stemming to each word and remove duplicates using set
    normalized_words = set(stemmer.stem(word) for word in words)

    # Add each normalized word to the inverted index
    for word in normalized_words:
        inverted_index[word].append(doc_id)

# Step 2: Display the inverted index
print("Inverted Index (with stemming):")
for term in sorted(inverted_index):
    print(f"{term}: {inverted_index[term]}")

# Step 3: Get the user query
query = "build index"

# Convert the query to lowercase and split into individual terms
query_terms = query.lower().split()

# Apply stemming to query terms for normalization
normalized_query = [stemmer.stem(term) for term in query_terms]

# Step 4: Retrieve the posting lists (sets of document IDs) for each term
doc_lists = []
for term in normalized_query:
    if term in inverted_index:
        doc_lists.append(set(inverted_index[term]))
    else:
        doc_lists.append(set())  # If term not found, use empty set

# Step 5: Perform intersection to find documents containing all query terms (AND query)
if doc_lists:
    matching_docs = set.intersection(*doc_lists)
else:
    matching_docs = set()

# Step 6: Display search results
print("\nSearch Results for:", query)
if matching_docs:
    for doc_id in sorted(matching_docs):
        print(f"Doc {doc_id}: {documents[doc_id]}")
else:
    print("No matching documents found.")

Output of the Python code

Inverted Index (with stemming):
an: [1, 3]
build: [1, 3]
document: [3]
engin: [2]
fast: [3]
for: [3]
how: [1, 2]
in: [1]
index: [1, 2, 3]
invert: [1]
learn: [2]
python: [1]
retriev: [3]
search: [2]
to: [1]
use: [2]

Search Results for: build index
Doc 1: How to build an inverted index in Python
Doc 3: Building an index for fast document retrieval

Storage and optimization

After building the inverted index, search engines must store and manage it efficiently. Since the web contains billions of documents and trillions of words, optimizing the index for size, speed, and scalability is critical.

Storage format

In-memory structures: During indexing, terms and posting lists are stored in memory using hash tables or dictionaries.
On-disk storage: The final index is written to disk using efficient binary formats such as LevelDB, SSTables, or custom file formats.
Segmented indexes: The index is stored in segments and periodically merged to improve update performance and reduce fragmentation.

Compression techniques

Posting lists (the lists of document IDs and metadata) are often very large. Compression reduces their size without affecting lookup speed.

Delta encoding: Stores the difference between consecutive document IDs instead of full IDs. Example:
[101, 105, 109] → [101, 4, 4]
Variable-byte encoding: Uses fewer bytes for smaller numbers, making lists more compact.
Bit-level compression: Stores values in packed bit formats for extreme space efficiency.

Index partitioning (sharding)

To scale across many machines, large indexes are split into smaller parts called shards. Each shard handles a portion of the data.

By term: Each machine stores a subset of terms (term-partitioned).
By document: Each machine stores a subset of documents (doc-partitioned).
Hybrid: A mix of both term and document partitioning.

Caching

Frequently accessed parts of the index are kept in memory (RAM) to reduce disk reads and improve query response time.

Term cache: Keeps hot terms and their posting lists in memory.
Query cache: Stores results of frequent queries for instant retrieval.

Update and merge strategies

Because web content changes frequently, the index must support additions, deletions, and updates.

Write-ahead logs: Track updates before they are merged into the main index.
Segment merging: Combines smaller index segments periodically to keep the structure optimized.
Versioning: Keeps multiple versions of documents to support rollback and consistency.

Trade-offs

Speed vs. Space: More compression reduces disk space but may slow down decompression during search.
Freshness vs. Cost: Real-time indexing increases freshness but uses more resources.
Memory vs. Disk: Keeping more data in RAM is faster but costlier.

Why indexing makes search fast?

Indexing is what enables search engines to return relevant results in milliseconds. Instead of scanning every document at query time, the inverted index provides a direct mapping from query terms to documents. This significantly reduces computation and makes search scalable and efficient.

The inverted index allows the search engine to quickly find all documents containing a specific term without scanning entire files.

When a query contains multiple terms, the engine performs fast set operations (like intersection or union) on sorted posting lists.

AND (intersection): Finds documents that contain all terms.
OR (union): Finds documents that contain any of the terms.

This is much faster than checking each document manually.

During indexing, important ranking signals are calculated and stored with each posting:

Term frequency (TF): How often the term appears in a document.
Inverse document frequency (IDF): How rare the term is across all documents.
TF-IDF or BM25: Combined score used to rank results.
PageRank or link-based metrics: Scored and stored ahead of time.

Search engines store cached versions of frequent queries and load high-demand index segments into memory. This reduces disk I/O and improves response time.

Term cache: Keeps posting lists for frequent terms in memory.
Query cache: Saves complete results for common searches.

The inverted index can be sharded across multiple servers, allowing distributed search. A query is processed in parallel across nodes, and results are merged.

Horizontal scalability: More nodes can be added to handle more queries.
Partial indexes: Each server holds only a slice of the index.

Since most of the processing (tokenization, normalization, scoring) is done during indexing, very little computation is needed at query time. This enables sub-second responses even on large datasets.

Useful Python libraries for implementing search via inverted index

re for tokenization
nltk.corpus for stopwords removal
nltk.stem.PorterStemmer for normalization
collections.defaultdict for creating inverted index

Summary

Indexing is the foundation of how search engines organize, store, and retrieve web content efficiently. Without indexing, search engines would be forced to scan the entire web for each query, which is computationally impossible at scale.

Efficient indexing transforms unstructured web data into a structured and accessible format. It is the key reason why search engines are fast, relevant, and reliable. Every search query relies on the index to quickly locate and rank the best-matching results.

Author

Anurag Gupta is an M.S. graduate in Electrical and Computer Engineering from Cornell University. He also holds an M.Tech degree in Systems and Control Engineering and a B.Tech degree in Electrical Engineering from the Indian Institute of Technology, Bombay.

Comment

Past Comments

No comments yet. Be the first!