How Search Engines Work: From Crawling to Ranking


When you type a question into a search bar, what happens next feels like magic. Billions of pages exist on the web, yet somehow the best ones appear instantly in front of you. To understand this, let’s step into the world of crawling, indexing, and ranking—not as dry steps, but as a story of how machines turn the chaos of the internet into order.
Think of the internet as an infinite graph:
G=(V,E)G=(V,E)where V represents web pages (nodes) and E represents links (edges).
Search engines deploy “crawlers,” which move from node to node by following edges. Just like a random walker in graph theory, a crawler has a probability distribution of visiting a new page:
P(v∣u)=1/outdeg(u)
meaning the chance of visiting page v from page uu depends on how many outgoing links u has.
In practice, crawlers don’t move randomly—they prioritize freshness, authority, and importance of pages. But at its heart, crawling is the process of discovering nodes in this massive web graph.
Once discovered, each page must be stored and organized. Indexing transforms raw HTML into structured information. Imagine the index as a function:
I:word→{pages containing word}For example:
I(“AI”)={p1,p2,p3,...}This is called an inverted index—instead of mapping pages to words, it maps words to pages.
Just like a library catalog, it allows instant retrieval. When you search “how engines work,” the system intersects these sets:
I(“how”)∩I(“engines”)∩I(“work”)and retrieves only the documents containing all three.
Now comes the art and math of ranking. Not all pages are equal. Some are more authoritative, others are more relevant. Ranking blends multiple signals into a scoring function:
S(d,q)=w1⋅R(d,q)+w2⋅A(d)+w3⋅U(d)where:
R(d,q) = Relevance of document dd to query qqA(d) = Authority of document (based on backlinks, trust, etc.)U(d) = User experience signals (speed, mobile-friendliness, engagement)w1,w2,w3 = weights tuned by the search engineOne famous early formula is PageRank. In simplified form, the importance of a page PP is given by:
PR(P)=1−d/N+d*∑L(Q)PR(Q) (Q∈M(P))
where:
d = damping factor (usually ~0.85)N = total number of pagesM(P) = set of pages linking to PL(Q) = number of outbound links from page QThis elegant formula captures the intuition that a page is important if other important pages link to it.
Early search engines relied heavily on link-based authority measures like PageRank, a mathematical model developed by Google’s founders in the late 1990s. Over time, these signals were enriched with more complex models that incorporated user behavior, semantic meaning, and real-time personalization.
Modern search engines use machine learning to estimate the probability that a document is relevant to a query:
P(relevant∣q,d)=fθ(q,d)
where fθ is a neural scoring function (often powered by deep learning models such as transformers). Unlike static formulas, these models learn from billions of past queries, clicks, and satisfaction signals.
In practice, this means the search engine no longer just asks: “Does the page contain the right words?” Instead, it asks: “Does this page truly answer what the user means?”
We are now entering a new era: the shift from retrieval engines (like Google Search, Bing) to generative engines(like ChatGPT, Perplexity, or Google’s Gemini-powered Search).
A traditional search engine follows:
Query    →    Retrieve documents    →    Rank resultsA generative engine follows instead:
Query    →    Retrieve knowledge + Model inference    →    Generate synthesized answerThe difference is crucial:
From a mathematical lens, a generative engine is not computing:
argmax  P(relevant∣q,d)but rather:
argmax  P(answer=a∣q,K)where K is not just the index of documents, but also the latent knowledge encoded in the model’s parameters (Lewis et al., 2020 — Retrieval-Augmented Generation).
Generative Engine Optimization (GEO) — also known as Large Language Model Optimization (LLMO) — is the process of optimizing content to increase its visibility and relevance within AI-generated responses from tools like ChatGPT, Gemini, or Perplexity.
Unlike traditional SEO, which targets search engine rankings, GEO focuses on how large language models interpret, prioritize, and present information to users in conversational outputs. The goal is to influence how and when content appears in AI-driven answers.
Agentic RAG represents a new paradigm in Retrieval-Augmented Generation (RAG).
While traditional RAG retrieves information to improve the accuracy of model outputs, Agentic RAG goes a step further by integrating autonomous agents that can plan, reason, and act across multi-step workflows.
This approach allows systems to:
In other words, Agentic RAG doesn’t just provide better answers, but it strategically manages the retrieval process to support more accurate, efficient, and explainable decision-making.
RAG (Retrieval-Augmented Generation) is a cutting-edge AI technique that enhances traditional language models by integrating an external search or knowledge retrieval system. Instead of relying solely on pre-trained data, a RAG-enabled model can search a database or knowledge source in real time and use the results to generate more accurate, contextually relevant answers.
For GEO, this is a game changer.
GEO doesn't just respond with generic language—it retrieves fresh, relevant insights from your company’s knowledge base, documents, or external web content before generating its reply. This means:
By combining the strengths of generation and retrieval, RAG ensures GEO doesn't just sound smart—it is smart, aligned with your source of truth.
GEO (Generative Engine Optimization) is not a rebrand of SEO—it’s a response to an entirely new environment. SEO optimizes for bots that crawl, index, and rank. GEO optimizes for large language models (LLMs) that read, learn, and generate human-like answers.
While SEO is built around keywords and backlinks, GEO is about semantic clarity, contextual authority, and conversational structuring. You're not trying to please an algorithm—you’re helping an AI understand and echo your ideas accurately in its responses. It's not just about being found—it's about being spoken for.
Generative Engine Optimization (GEO) and Answer Engine Optimization (AEO) are closely related strategies, but they serve different purposes in how content is discovered and used by AI technologies.
llms.txt) to guide how AI systems interpret and prioritize your content.In short:
AEO helps you be the answer in AI search results. GEO helps you be the source that generative AI platforms trust and cite.
Together, these strategies are essential for maximizing visibility in an AI-first search landscape.