Most articles about AI search tell you to “add structured data” and “write authoritative content” without ever explaining what’s happening on the other side of that advice. What does ChatGPT actually do with your page once it fetches it? Why does the same content get cited by Perplexity and ignored by Gemini? Why can a page rank #1 on Google and still be invisible to half of the AI engines that exist?

The answer lives in architecture, not marketing theory. Different AI systems use genuinely different pipelines to find, read, rank, and reassemble web content into an answer. And you need to understand this when you are looking for generative engine optimisation services in advance. This piece goes under the hood of that machinery – crawling, chunking, embeddings, reranking, and the specific quirks of how ChatGPT, Perplexity, Google AI Mode, Gemini, and Claude each handle it differently. If you understand the mechanics in this article, most “AI SEO” advice will start to look either obviously correct or obviously wrong.

1. Before retrieval: can the model even see your page?

This is the step almost every GEO guide skips, and it’s the one that silently breaks the most websites.

When an AI crawler visits a page, it doesn’t open a browser. It sends an HTTP request and reads whatever HTML comes back in that first response – nothing more. Independent analysis of hundreds of millions of crawler fetches has confirmed that none of the major AI crawlers – GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, and PerplexityBot – execute JavaScript by default. They download script files (Anthropic’s crawler fetches them at a notably high rate) but read them as plain text, not as code to run.

That has a blunt, practical consequence. If your page is a React, Vue, or Angular single-page application that builds its content client-side – the pattern where the initial HTML is little more than <div id="root"></div> and a script tag – every AI crawler that doesn’t render JavaScript sees an empty shell. Not thin content. No content. Your product descriptions, pricing tables, and FAQ answers can rank #1 on Google and simultaneously not exist for ChatGPT, Claude, or Perplexity.

There’s one consistent exception: Google’s own AI surfaces (Gemini, AI Overviews) inherit Googlebot’s rendering infrastructure, which has executed JavaScript for years. This creates a strange asymmetry – a JavaScript-heavy site can perform fine in Google’s AI Overviews while being structurally invisible everywhere else.

A second, less binary but still important point: not all AI crawlers behave identically, and reports disagree at the margins – some 2026 crawler analyses describe ClaudeBot as JavaScript-blind like the rest; others note Anthropic’s crawlers are fetching a disproportionate share of image assets, which hints at evolving multimodal crawling behaviour. Treat any single claim about a specific bot’s exact current capabilities as a snapshot, not a permanent spec – these crawlers are updated continuously and quietly.

The practical test, before anything else in this article, matters: run curl -A "Mozilla/5.0" https://yourpage.com from a terminal or simply view your page source (not the rendered DOM) in a browser. If your actual text content appears in that raw response, you’ve cleared the first gate. If you see an empty container and a script reference, no amount of GEO copywriting will help – the content has to exist in server-rendered HTML before any of the steps below can begin.

2. The architecture decision that splits every platform in two: does it retrieve at all?

The single biggest fork in how AI engines work is whether they search the live web for every query or answer from what they already learnt during training.

Closed-book by default – ChatGPT’s base mode and Claude’s base mode

In their default conversational state, both generate answers entirely from parametric knowledge: statistical patterns absorbed during training, frozen at a knowledge cutoff date. No URLs, no live lookup, no retrieval step. Browsing or search activates only when the system detects a need for current information or the user explicitly invokes it.

Open-book by design – Perplexity

Every single Perplexity query triggers a live retrieval-augmented generation (RAG) pipeline, with no exceptions and no “memory-only” fallback. This is the core product decision that defines Perplexity: it would rather be slower and always sourced than fast and unsourced.

Hybrid and increasingly retrieval-light – Gemini

Google’s models are unusual in that their very long context windows (up to roughly a million tokens) let them sometimes skip the RAG pipeline altogether for tasks where the relevant document can simply be pasted whole into context. Internal comparisons have found this can be both faster and more accurate than a traditional chunk-and-retrieve pipeline for shorter documents – though it doesn’t eliminate retrieval for open-web questions, where Google AI Mode’s separate fan-out-and-search architecture still applies.

This distinction matters enormously for content strategy. If a platform mostly answers from training data, your goal is to get into a future training run – which means a broad, citable, widely syndicated presence (Wikipedia, press, reference sites) long before any specific query happens. If a platform retrieves live every time, your goal is to be retrievable right now – fresh, well-indexed, and structurally clean at the moment someone asks.

3. Inside the retrieval step: how a page actually gets found

For any platform running live retrieval, the same general sequence happens, even though the specific implementation differs by vendor.

Step one: query parsing and fan-out

Modern AI search rarely searches for your exact words. Google’s AI Mode uses a dedicated Gemini model to decompose a single prompt into multiple parallel sub-queries – covering different facets, synonyms, and even questions the user hasn’t consciously asked yet. Google has described this as a deliberate architecture decision, not an incidental side effect: a query like “best CRM for small business” might fan out into sub-queries on pricing, integrations, user reviews, and comparison terms, each searched independently before being synthesised into one answer. ChatGPT’s search mode does something structurally similar when it browses, often appending commercial and temporal modifiers (“best,” “2026,” “comparison”) to its sub-queries even when the user didn’t ask for them.

The implication: a page optimised only for its primary keyword is optimised for a query type that increasingly doesn’t get asked literally. Content that thoroughly covers the adjacent sub-questions a fan-out process would generate has a structural advantage that simple keyword matching never required before.

Step two: candidate retrieval

Once sub-queries exist, the system needs candidates fast, across what is often a very large index. The dominant technique is hybrid retrieval – combining classic keyword-based search (BM25, which scores exact term overlap) with dense embedding search (which converts both the query and every page chunk into a numerical vector and measures semantic similarity, regardless of exact wording). Hybrid search exists because each method has a blind spot the other covers: pure keyword search misses paraphrases and synonyms; pure embedding search can miss precise, rare technical terms and exact match facts.

Step three: reranking

Initial retrieval is built for recall (don’t miss anything plausible), not precision (rank the best candidate first). It typically pulls back a wide net – Perplexity’s pipeline visits roughly ten candidate pages per query – and then a second, slower, much more accurate model called a cross-encoder reranker rescoring each query-document pair to push the genuinely best matches to the top before they ever reach the generation step. This is a meaningfully different and more computationally expensive process than the first-pass retrieval, which is exactly why it only runs on a small shortlist rather than the whole index.

4. Why your page gets cut into pieces: chunking, and why it matters more than word count

Almost no AI retrieval system reads your entire page as one block. Pages are split into chunks – typically somewhere in the range of 50 to 600 tokens (very roughly 35–450 words) – and each chunk gets its own embedding and is retrieved independently of the rest of the page.

This has a non-obvious consequence that most content advice ignores: your most important sentence can be retrieved completely divorced from the paragraph around it. If a chunk boundary falls in the middle of your key claim, or if your statistic and its supporting context land in two different chunks, the retrieval system may surface one without the other – or surface neither cleanly. Chunks that are too short lose context; chunks that are too long dilute the embedding’s specificity and increase the odds that the genuinely relevant sentence gets buried inside noise the embedding model has to average over.

The practical fix content teams rarely implement: write in self-contained units. Each H2/H3 section, each FAQ answer, and each key paragraph should make complete sense if it were extracted in total isolation – because, functionally, it might be. A claim that depends on three preceding paragraphs of setup is a claim that’s much harder for a chunk-based retrieval system to surface correctly.

5. “Lost in the middle”: the bias that explains why position on the page is not neutral

Here’s a finding from peer-reviewed NLP research that almost never makes it into marketing-facing GEO content, despite being directly actionable.

A widely replicated 2024 study (Liu et al., now published in Transactions of the Association for Computational Linguistics) tested how accurately large language models – including GPT-3.5, GPT-4, and Claude 1.3 – could retrieve a specific fact depending on where in the input context that fact was placed. The result was a strikingly consistent U-shaped curve: accuracy is highest when the relevant information sits at the very beginning or very end of the context window and drops by more than 30% when the same information is placed in the middle.

This pattern has since been replicated across additional model families and is believed to stem from the underlying transformer architecture itself – specifically, how positional encoding schemes like RoPE (Rotary Position Embedding) cause attention weight to decay for tokens that are “distant” in the sequence and how softmax normalisation in attention layers amplifies that decay by concentrating focus on whichever tokens already score highest.

What this means for content placed inside a multi-document AI answer: if your page is one of several sources an AI assembles into a single context window before generating a response, where your content lands in that assembled stack is not random and not neutral. Some production RAG systems now deliberately reorder retrieved chunks – placing the highest-confidence match first, the second-highest last, and lower-confidence material in between – specifically to work around this bias rather than fight it.

You can’t control where a generative engine places your content relative to a competitor’s. But you can control the structure of your own page so that the single most citable, fact-dense sentence in any given section sits at the start or end of that section – not buried in paragraph three of five.

6. Platform-by-platform: how the major engines actually differ

This is the part where “AI search” stops being one category. Below is a comparison built from independent crawler analyses, citation-pattern studies, and platform documentation through mid-2026. Treat exact percentages as directional – they shift release to release – but the structural differences are stable and meaningful.

	ChatGPT (Search/browsing mode)	Perplexity	Google AI Overviews / AI Mode	Gemini	Claude (web search)
Default behavior	Closed-book (training data); browses only when triggered	Always-on live retrieval for every query	Built directly on Google’s existing organic index	Long-context-first; retrieval for live/open-web queries	Closed-book by default; live search available as a tool
Primary index source	Bing	Own continuously-updated web index (50B+ pages claimed) + live crawl	Google’s own organic search index	Google’s index (for search-grounded queries)	Independent web search, not tied to one underlying index
JavaScript rendering	No (raw HTML only)	No (raw HTML only)	Yes (inherits Googlebot’s rendering service)	Yes, via Google infrastructure	Reported as ‘no’ by most analyses; some note partial/evolving behaviour
Query fan-out	Yes – 5-15 sub-queries per prompt, often with added commercial/temporal modifiers	Less formalized fan-out; relies more on dense real-time retrieval per query	Yes – explicit, Google-confirmed mechanism; can issue many sub-queries for complex prompts	Used for search-grounded responses, less so for long-context tasks	Less publicly documented; appears query-specific
Typical citations per response	3-6	15–20+ (most citation-dense major platform)	Multiple, drawn from existing SERP-style ranking	Varies by mode	Typically a handful of conservative citation behaviours
Freshness weighting	Moderate – depends on Bing’s index freshness	High – explicitly favors very recently published/updated content	High for live AI Mode queries	High for search-grounded queries	Moderate
Dominant cited source type	Encyclopedic / high-authority (Wikipedia heavily represented)	Community and discussion content (Reddit consistently prominent)	Distributed across existing top-ranking organic sources	Distributed, Google-index-dependent	Favors precise, technically detailed sources; comparatively conservative about citing
Overlap with Google’s organic top 10	Low in several 2026 analyses (roughly 7-12% in some studies)	Low – operates largely independently of Google rankings	High by definition – built on the same index	High where Google-index-grounded	Low – independent retrieval logic
Practical implication for content	Build encyclopedic/editorial third-party presence; ensure content sits in first 200-500 words	Publish and refresh frequently; build community/forum visibility	Classical SEO + clean, extractable passage structure	Comprehensive, complete documents reward long-context comprehension	Precision and technical accuracy reward citation; avoid unverifiable claims

A few things worth pulling out of that table explicitly, because they contradict a lot of conventional advice:

Ranking #1 on Google does not transfer automatically to most other platforms. Several independent 2026 analyses converge on the same uncomfortable number: only somewhere around 7-12% of ChatGPT’s cited sources also appear in Google’s organic top 10. Google AI Overviews is the one major exception, precisely because it’s built on the same underlying index rather than a separate retrieval system.
Freshness is not a universal signal with equal weight everywhere. It matters enormously for Perplexity and for live, search-grounded Google AI Mode queries. It matters much less for a model answering purely from frozen training data, where no amount of republishing today changes anything until the next training run.
“Authority” means different things on different platforms. ChatGPT’s apparent preference for encyclopaedic sources rewards a different kind of brand-building (notability, structured reference presence) than Perplexity’s apparent preference for community discussion (which rewards genuine engagement in forums and review platforms over polished owned-media content).

7. From retrieved chunks to a written answer: what synthesis actually does to your content

Once the system has its reranked, chunked, context-assembled evidence, the final step is generation: an LLM writes an answer constrained – in principle – to what was retrieved.

This is where “visibility” stops meaning anything like a search ranking and starts meaning something closer to share of synthesis. Your content might contribute one clause to a four-sentence answer that blends three sources. It might be the dominant voice in the response. It might be retrieved but ultimately not surface in the generated text at all, because the model judged a different chunk to be a better fit for the specific phrasing of the question. There is no simple position-1-through-10 ranking happening at this stage; there’s a probabilistic blending process, and the same query asked twice can produce two different blends.

This is also the stage where keyword-matching tactics from classical SEO actively misfire. A model performing synthesis isn’t scanning for exact phrase repetition – it’s already operating on semantic representations of meaning extracted during retrieval and reranking. Content engineered to repeat an exact-match phrase reads, to the underlying model, as exactly what it is: repetitive, low-information text that doesn’t add new evidence to synthesise. Peer-reviewed testing of this exact effect found that keyword stuffing measurably decreased a source’s visibility in generated answers relative to an unmodified baseline – the opposite of its intended effect and a direct casualty of treating a synthesis engine like a ranking algorithm.

8. What this actually means for how you write

Pulling the mechanics together into something usable:

Make sure the content exists before optimising it

Server-side render anything you want AI crawlers to see. This is upstream of every other tactic in this list and is the single most common reason technically excellent content remains invisible to half the AI ecosystem.

Write in self-contained, chunk-sized units

Each section should be a complete thought that survives being extracted alone – because functionally, it often is extracted alone.

Front-load and back-load your strongest evidence within each section

Given the lost-in-the-middle pattern, the first and last sentences of any block of text are disproportionately likely to be retained and surfaced. Don’t bury your best statistic in sentence four of six.

Treat platforms as separate channels, not one undifferentiated “AI search” category

A page built to win in Perplexity (fresh, conversational, community-corroborated) is not structurally identical to a page built to win in Google AI Overviews (clean extractable passages layered on classical ranking signals) or ChatGPT (encyclopedic, broadly corroborated, front-loaded answers).

Stop writing for keyword density and start writing for retrievability

Specific statistics, named sources, direct quotes, and complete factual sentences are what both the retrieval layer and the synthesis layer are actually built to recognise and reward. Repetition of a target phrase is not a retrieval signal in a vector space – it’s noise.

“How does the AI choose what to cite” doesn’t have one answer, because there isn’t one AI architecture behind the question. There’s a closed-book model and an always-on retrieval engine. There’s a system that renders JavaScript and four that don’t. There’s a reranking step that exists specifically to undo the damage of a U-shaped attention bias baked into the transformer architecture itself. None of this is visible from the outside – which is exactly why so much published advice about “ranking in ChatGPT” sounds like search-engine folklore dressed in new vocabulary.

Understanding the actual pipeline – crawl, parse, retrieve, rerank, chunk, assemble, synthesise – doesn’t just make the tactics make more sense. It tells you which tactics are placebos and which ones are physics.

#AI #AI optimisation #LMM results

How Different AI Models Actually Read the Web: Parsing, Retrieval, and Synthesis Explained