How Search Engines Work: From Crawling to Ranking

When you type a question into a search bar, what happens next feels like magic. Billions of pages exist on the web, yet somehow the best ones appear instantly in front of you. To understand this, let’s step into the world of crawling, indexing, and ranking—not as dry steps, but as a story of how machines turn the chaos of the internet into order.
Think of the internet as an infinite graph:
G=(V,E)G=(V,E)
where V
represents web pages (nodes) and E represents links (edges).
Search engines deploy “crawlers,” which move from node to node by following edges. Just like a random walker in graph theory, a crawler has a probability distribution of visiting a new page:
P(v∣u)=1/outdeg(u)
meaning the chance of visiting page v
from page uu depends on how many outgoing links u
has.
In practice, crawlers don’t move randomly—they prioritize freshness, authority, and importance of pages. But at its heart, crawling is the process of discovering nodes in this massive web graph.
Once discovered, each page must be stored and organized. Indexing transforms raw HTML into structured information. Imagine the index as a function:
I:word→{pages containing word}
For example:
I(“AI”)={p1,p2,p3,...}
This is called an inverted index—instead of mapping pages to words, it maps words to pages.
Just like a library catalog, it allows instant retrieval. When you search “how engines work,” the system intersects these sets:
I(“how”)∩I(“engines”)∩I(“work”)
and retrieves only the documents containing all three.
Now comes the art and math of ranking. Not all pages are equal. Some are more authoritative, others are more relevant. Ranking blends multiple signals into a scoring function:
S(d,q)=w1⋅R(d,q)+w2⋅A(d)+w3⋅U(d)
where:
R(d,q)
= Relevance of document dd to query qqA(d)
= Authority of document (based on backlinks, trust, etc.)U(d)
= User experience signals (speed, mobile-friendliness, engagement)w1,w2,w3
= weights tuned by the search engineOne famous early formula is PageRank. In simplified form, the importance of a page PP is given by:
PR(P)=1−d/N+d*∑L(Q)PR(Q) (Q∈M(P))
where:
d
= damping factor (usually ~0.85)N
= total number of pagesM(P)
= set of pages linking to P
L(Q)
= number of outbound links from page Q
This elegant formula captures the intuition that a page is important if other important pages link to it.
Early search engines relied heavily on link-based authority measures like PageRank, a mathematical model developed by Google’s founders in the late 1990s. Over time, these signals were enriched with more complex models that incorporated user behavior, semantic meaning, and real-time personalization.
Modern search engines use machine learning to estimate the probability that a document is relevant to a query:
P(relevant∣q,d)=fθ(q,d)
where fθ
is a neural scoring function (often powered by deep learning models such as transformers). Unlike static formulas, these models learn from billions of past queries, clicks, and satisfaction signals.
In practice, this means the search engine no longer just asks: “Does the page contain the right words?” Instead, it asks: “Does this page truly answer what the user means?”
We are now entering a new era: the shift from retrieval engines (like Google Search, Bing) to generative engines(like ChatGPT, Perplexity, or Google’s Gemini-powered Search).
A traditional search engine follows:
Query → Retrieve documents → Rank results
A generative engine follows instead:
Query → Retrieve knowledge + Model inference → Generate synthesized answer
The difference is crucial:
From a mathematical lens, a generative engine is not computing:
argmax P(relevant∣q,d)
but rather:
argmax P(answer=a∣q,K)
where K
is not just the index of documents, but also the latent knowledge encoded in the model’s parameters (Lewis et al., 2020 — Retrieval-Augmented Generation).
Generative Engine Optimization (GEO) — also known as Large Language Model Optimization (LLMO) — is the process of optimizing content to increase its visibility and relevance within AI-generated responses from tools like ChatGPT, Gemini, or Perplexity.
Unlike traditional SEO, which targets search engine rankings, GEO focuses on how large language models interpret, prioritize, and present information to users in conversational outputs. The goal is to influence how and when content appears in AI-driven answers.