OAI-SearchBot vs. PerplexityBot vs. Common Crawl. A technical breakdown of how AI crawlers index your site.

Three crawlers, three indexing logics, one robots.txt. Here is the side-by-side breakdown of crawl frequency, scope, JavaScript handling, and IP verification, plus the unified configuration that keeps your site discoverable to real-time citation engines without leaking content to training corpora.

How AI crawlers index your site.

TL;DR

  • 01None of the three execute JavaScript. Client-rendered React, Vue, and Angular pages are invisible to OAI-SearchBot, PerplexityBot, and CCBot. Ship SSR, SSG, or ISR.

  • 02Crawl logic differs by mission. OAI-SearchBot indexes for citations, PerplexityBot for RAG, CCBot for archival training. Same User-Agent header, three different scopes.

  • 03Robots.txt blocks the bot, not the URL. A page disallowed in robots.txt can still appear as a bare title in ChatGPT Search. Use a noindex meta tag on the page itself.

  • 04Verify before you trust the User-Agent. Spoofed AI crawlers are common. Validate with FCrDNS or the published JSON CIDR feeds.

    0

    Crawlers that execute JS

    24hrs

    Robots.txt propagation

    3

    Verification protocols

    4CIDRs

    OpenAI search ranges

The shape of the modern crawl landscape

For most of the last twenty years, technical SEO had one rendering target. Googlebot crawled your pages, queued them through the Web Rendering Service, executed JavaScript inside a headless Chromium fleet, and emitted a populated DOM that the indexer could read. The cost of that pipeline was Google's problem.

That model does not generalise to the new wave. OAI-SearchBotPerplexityBot, and CCBot are high-velocity HTML parsers. They fetch raw markup as quickly as the network allows, extract structured text, and move on. None of them runs your bundle.js. None of them waits for hydration. If your content depends on client-side execution to appear, it does not exist as far as these three are concerned.

The second shift is functional. The three crawlers are not variants of the same indexer with different traffic budgets. They have distinct missions, distinct triggers, and distinct downstream consumers. OAI-SearchBot feeds ChatGPT Search and the Atlas browser. PerplexityBot populates Perplexity's retrieval index. CCBot builds the multi-petabyte archive that trains a long list of open-source models. The same User-Agent slot in your robots.txt does three different things.

"The same User-Agent slot in your robots.txt does three different things. If you treat the three crawlers identically, you are leaving citation share on the table or leaking training data, sometimes both."

Table 01Comparative taxonomy · OAI-SearchBot, PerplexityBot, CCBot

Dimension

OAI-SearchBot

PerplexityBot

CCBot

Operator

OpenAI

Perplexity AI

Common Crawl Foundation

Primary mission

Index for citation in ChatGPT Search

Index for RAG and Perplexity Answers

Open-source web archive

Downstream consumer

ChatGPT Search, SearchGPT, Atlas

Perplexity UI and API

Llama, Mistral, academic NLP

Trigger

Continuous discovery, query-weighted

Continuous discovery, retrieval-weighted

Periodic batch snapshots

Frequency

Dynamic · freshness + popularity

Continuous · ongoing sweeps

Monthly to multi-month batches

Scope

Targeted · high-value semantic pages

Targeted · authoritative URLs

Broad · sitemap-driven sampling

Robots.txt

Adheres · 24 hr propagation

Adheres · 24 hr propagation

Adheres · supports opt-out registries

JavaScript execution

No

No

No

Verification

JSON CIDR + UA

JSON CIDR + UA

JSON CIDR + FCrDNS

Source · vendor documentation, May 2026Three crawlers, one robots.txt slot each

How each crawler actually behaves

OAI-SearchBot · OpenAI's citation indexer

OAI-SearchBot is OpenAI's proactive retrieval bot. It exists to discover, index, and cache pages for inclusion in ChatGPT Search and Atlas. Functionally, it behaves like a traditional search spider: it prioritises well-structured semantic text that can be summarised and cited with an active link in the user-facing response.

It is not GPTBot. GPTBot compiles training corpora for foundation models; OAI-SearchBot builds a live citation index. The two share infrastructure. If a site allows crawling by both, OpenAI may consolidate the work into a single fetch to satisfy both objectives, which reduces request volume on the origin server but also means a single Allow can serve two purposes you may not have intended.

PerplexityBot · Perplexity's retrieval engine

PerplexityBot is the background agent that maintains Perplexity's real-time retrieval index. Its objective is structural: keep authoritative, content-dense pages discoverable so the retrieval-augmented generation layer can cite them accurately. It deprioritises programmatic listings, faceted directories, and thin paginated archives, which is rational given that those pages rarely become useful citations.

CCBot · Common Crawl's archival harvester

CCBot operates the largest open web crawl in the world. Its outputs are released as monthly WARC snapshots and consumed by academic NLP labs, by foundation model teams at Meta and Mistral, and by every downstream open-source training pipeline that does not run its own crawler. Because CCBot has no real-time query to serve, it does not try to fetch every URL on a domain. It samples, and it relies heavily on your XML sitemap to decide which representative pages to capture.

User-triggered fetchers · ChatGPT-User and Perplexity-User

Both OpenAI and Perplexity also operate a second class of agent: ChatGPT-User and Perplexity-User. These do not crawl proactively. They execute a single HTTP GET when a user pastes a URL into the chat interface, or when the assistant needs to fetch a page to answer a question in real time.

Because those fetches represent synchronous human intent, they generally ignore standard Disallow rules in robots.txt. Blocking them at the firewall level does not improve your indexing; it prevents users from sharing your pages inside the assistant. The right mental model is "browser request initiated by a human", not "automated crawler".

Proactive crawl vs. user-triggered fetch
The JavaScript rendering gap

This is the single biggest source of silent indexing failure on the modern web, and the reason a site can rank well on Google but appear nowhere in ChatGPT or Perplexity. None of the three AI crawlers executes client-side JavaScript. They are not waiting for a hydration event. They are not running a headless Chromium. They take the bytes the server returns, parse the HTML, and move on.

If your application relies on React, Vue, Angular, or Svelte to fetch data from an API and construct the DOM in the browser, an AI crawler sees this and only this:

Initial HTML payload · html
GET /products/atlas-x9

<!-- What the crawler sees before any JS runs -->
<!DOCTYPE html>
<html>
<head>
    <title>Dynamic Web Application</title>
</head>
<body>
    <div id="root"></div>
    <script src="/static/js/bundle.js"></script>
</body>
</html>

That payload is functionally empty. It has a title and a script tag. It has no body copy, no schema markup, no internal links, no navigation. For OAI-SearchBot and PerplexityBot, the URL is effectively a blank page that cannot be cited. For CCBot, it becomes a near-useless row in a WARC archive. The failure modes compound in predictable ways:

  • Empty indexing shells. No text content reaches the index. The URL is registered but cannot be surfaced as a citation.

  • Invisible navigation paths. If internal links are rendered client-side, the crawler cannot follow them. Crawl depth collapses to one.

  • Hidden interactive content. Tabbed panels, accordions, and click-loaded sections never become text.

  • Lost lazy-loaded assets. Images, JSON-LD structured data, and below-the-fold content loaded via IntersectionObserver are not processed.

Which rendering strategy actually works

The fix is to put the populated DOM in the initial HTTP response. There are four ways to do that, each with a different operational cost.

Table 02Rendering strategies · AI crawler compatibility (decision matrix)

Strategy

Mechanism

AI crawler fit

Overhead

Best for

SSR

Full HTML generated per request

Excellent

High · server compute

E-commerce, dynamic apps

SSG

Pre-rendered at build, served from CDN

Excellent

Low

Docs, blogs, marketing sites

ISR

Static with on-demand background regeneration

Excellent

Moderate · edge framework

Large content platforms

Dynamic prerender

UA-sniff crawlers, serve HTML, SPA to users

Good

Moderate · headless service

Legacy SPAs · cloaking risk

CSR (default)

Bundle.js builds the DOM in the browser

Broken

Low

Authenticated apps only

Compatibility scored against OAI-SearchBot, PerplexityBot, CCBot - Source · vendor docs + RawMktg testing

Verifying real crawlers from spoofed ones

The User-Agent header is a string. A string can be set to anything. Malicious scrapers, competitor monitors, and aggressive data harvesters routinely impersonate AI crawlers to evade rate limits and security policies that grant AI bots a wide lane. If your only filter is the User-Agent string, you are giving that lane to everyone who asks for it.

Two verification techniques are reliable. The first is Forward-Confirmed Reverse DNS, which is the canonical method for CCBot. The second is matching the source IP against a published JSON CIDR feed, which is what both OpenAI and Perplexity publish.

Forward-Confirmed Reverse DNS

FCrDNS validates that an IP both reverse-resolves to the crawler's official domain and that the resulting hostname forward-resolves back to the same IP. The double check is what makes spoofing hard: an attacker can fake a User-Agent, but cannot easily forge both DNS directions.

FCrDNS verification flow · CCBot example

In practice, you run two host queries and check the answers against each other:

FCrDNS · CCBot verification · bash
3 steps · ~80ms

# Step 1 · reverse lookup on the inbound IP
$ host 18.97.14.84
84.14.97.18.in-addr.arpa domain name pointer 18-97-14-84.crawl.commoncrawl.org.

# Step 2 · pattern check · hostname ends in .crawl.commoncrawl.org ✓

# Step 3 · forward lookup on the resolved hostname
$ host 18-97-14-84.crawl.commoncrawl.org
18-97-14-84.crawl.commoncrawl.org has address 18.97.14.84

# Forward IP matches the inbound IP. Verified CCBot.

Published JSON CIDR feeds

FCrDNS adds latency. For high-volume edges, both OpenAI and Perplexity publish IP range feeds in JSON that you can ingest into your firewall or web server at startup and refresh on a cron.

Nginx · programmatic spoof prevention

The cleanest place to enforce this is the edge. The configuration below uses the Nginx geo and map directives to check that any request claiming to be OAI-SearchBot actually originates from one of OpenAI's published CIDR blocks. Spoofed requests get a 403 before they reach the application.

Nginx · spoof prevention for AI indexers · nginx
edge / WAF layer
# ====================================================================
# NGINX · SPOOF PREVENTION MAP · GENERATIVE-AI INDEXERS
# ====================================================================

# 1. Verified OpenAI IP range map (from openai.com/searchbot.json)
geo $is_verified_openai {
    default 0;
    20.42.10.176/28     1;   # OAI-SearchBot CIDR
    172.203.190.128/28  1;
    51.8.102.0/24       1;
    135.234.64.0/24     1;
}

# 2. Match the claimed User-Agent against the verified IP map
map $http_user_agent $deny_spoofed_searchbot {
    "~*OAI-SearchBot" $is_verified_openai;   # claim must come from a real OpenAI IP
    default            1;                       # everything else passes through
}

server {
    listen 80;
    listen 443 ssl;
    server_name example.com;

    location / {
        # Claims to be OAI-SearchBot but not from a verified IP → 403
        if ($deny_spoofed_searchbot = 0) {
            return 403;
        }
        try_files $uri $uri/ /index.html;
    }
}

Run the equivalent map for PerplexityBot using its JSON feed. For CCBot, layer FCrDNS on top, since Common Crawl's IP ranges shift more often than the OpenAI list and reverse DNS is the authoritative check.

A unified robots.txt and sitemap strategy

The point of the configuration below is to express a specific policy: be discoverable for citation, be invisible to training. Allow OAI-SearchBot and PerplexityBot full access. Disallow GPTBot, which builds OpenAI's foundation-model training set, and CCBot, which feeds downstream open-source training pipelines. Declare the sitemap once so all three of the crawlers you do allow can find it.

robots.txt · citation-on, training-off · text
/robots.txt
# ====================================================================
# ROBOTS.TXT · ALLOW REAL-TIME CITATION, BLOCK TRAINING
# ====================================================================

# 1. Allow OAI-SearchBot · ChatGPT Search citations
User-agent: OAI-SearchBot
Allow: /
Crawl-delay: 5

# 2. Allow PerplexityBot · Perplexity answers and RAG
User-agent: PerplexityBot
Allow: /

# 3. Block GPTBot · OpenAI foundation-model training
User-agent: GPTBot
Disallow: /

# 4. Block CCBot · downstream open-source training corpora
User-agent: CCBot
Disallow: /

# Declare the sitemap so allowed crawlers can find fresh content
Sitemap: https://example.com/sitemap.xml

The robots.txt / noindex paradox

This is the trap that catches most teams the first time they try to remove a page from ChatGPT Search. If you globally disallow a URL in robots.txt, OpenAI may still display the title as a bare navigational link if it discovered the URL via a third-party reference. To suppress the URL entirely, you need a noindex meta tag on the page. But the bot must be able to crawl the page to read that tag.

The resolution is to allow the page in robots.txt and add the meta directive in the HTML head:

noindex directive · per-page exclusion · html
<head>
<head>
    <title>Internal-only resource</title>
    <!-- Bot-specific: blocks only ChatGPT Search -->
    <meta name="OAI-SearchBot" content="noindex">

    <!-- Universal: blocks all bots that honour the directive -->
    <meta name="robots" content="noindex, nofollow">
</head>

XML sitemaps that actually help

All three of the crawlers we are configuring for use the sitemap, but they use it differently. OAI-SearchBot and PerplexityBot treat it as a freshness hint: a starting point for what to revisit. CCBot treats it as a sampling guide: a way to choose which representative pages to capture in its archive. The implications for sitemap hygiene are the same in either case.

  • Exclude non-canonical URLs. No query parameters, no faceted filters, no duplicate paths. One URL per page of content.

  • Every URL returns 200. Broken sitemap entries train the crawler to deprioritise the entire feed.

  • Partition by velocity. Split the sitemap index into a high-frequency feed for fresh content and a low-frequency feed for static archives. Set lastmod honestly.

  • Include changefreq and priority sparingly. They are hints, not directives. Lying about them costs trust.

sitemap.xml · minimal correct shape · xml
/sitemap.xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/authoritative-guide</loc>
    <lastmod>2026-05-20</lastmod>
    <changefreq>daily</changefreq>
    <priority>1.0</priority>
  </url>
</urlset>

The /llms.txt convention

/llms.txt is an emerging convention for AI-specific discovery. Placed at the root of your domain as a plain markdown file, it gives the crawler a curated map of your highest-value pages, optimised for context-constrained parsing. Treat it as a complement to your sitemap, not a replacement for it. Adoption is not yet universal across the three crawlers, but the cost of publishing one is trivial and the upside is real.

The GEO optimisations that move the needle

Once the rendering, verification, and robots configuration are correct, three secondary optimisations consistently improve citation outcomes across all three crawlers. None of them is novel. All three are skipped by most teams because the audit signal is weak.

ARIA roles and semantic layout

Generative search engines do not just scrape text. They use the structural context of the document to decide what is content, what is navigation, and what is chrome. ChatGPT Atlas in particular uses ARIA roles, states, and properties to map page structure when it needs to interact with a site agentically. The cost of marking up your structure properly is minutes per template. The benefit is durable.

semantic + ARIA · article template · html
structure as signal
<main id="primary-content" role="main">
    <article>
        <header>
            <h1>Technical overview</h1>
            <p class="byline">Published 2026-05-21 · Maya Ramaswamy</p>
        </header>
        <section aria-labelledby="summary">
            <h2 id="summary">Summary</h2>
            <p>Informational text goes here.</p>
        </section>
    </article>
</main>

<button aria-expanded="false"
        aria-controls="collapsible-menu"
        role="button">
    Advanced technical details
</button>

UTM referral tracking

To measure traffic from generative search, configure analytics to recognise the referrer parameters that AI platforms append. ChatGPT Search appends a utm_source=chatgpt.com parameter to outbound citation links. Perplexity and other platforms behave similarly. Without explicit segmentation, this traffic lands in your "direct" or "referral" bucket and becomes invisible.

The pre-deployment validation routine

Before you ship a change that touches templates, headers, or routing, run this four-step audit. It catches roughly 80% of the regressions we see in client work.

pre-deployment validation checklist · bash
4 checks · < 30s
# 1. Confirm the raw HTML contains the content you expect
#    (no JS execution · what the AI crawler will actually see)
curl -A "OAI-SearchBot" -L "https://example.com/target-page" \
  | grep -i "critical-payload"

# 2. Confirm reverse DNS resolves cleanly for your own host
host $(dig +short example.com)

# 3. Robots.txt returns 200 OK and the right content-type
curl -I "https://example.com/robots.txt"

# 4. Sitemap is reachable and contains no broken <loc> entries
curl -s "https://example.com/sitemap.xml" \
  | grep -o "<loc>[^<]*" \
  | while read -r url; do
      code=$(curl -o /dev/null -s -w "%{http_code}" "${url#<loc>}")
      echo "$code  ${url#<loc>}"
    done | grep -v "^200"

What to actually do this week

The work in this article is mostly configuration, not strategy. If your team has a one-week window to close the most expensive gaps, the priority order is unambiguous.

  • Day 1 · Verify your rendering. Run curl -A "OAI-SearchBot" $URL against your ten highest-value pages. If the body is empty, the rest of the list does not matter yet. Plan an SSR or ISR migration.

  • Day 2 · Ship the unified robots.txt. Allow OAI-SearchBot and PerplexityBot. Disallow GPTBot and CCBot unless your strategy is explicitly to feed open training corpora.

  • Day 3 · Audit your sitemap. Strip non-canonical URLs. Verify every <loc> returns 200. Split into velocity-based partitions.

  • Day 4 · Stand up verification at the edge. Ingest the OpenAI and Perplexity JSON feeds into your WAF or Nginx config. Layer FCrDNS for CCBot.

  • Day 5 · Add the analytics segment. Carve out utm_source=chatgpt.com and equivalents into a dedicated channel so you can actually measure citation traffic.

Citation share is now the easiest variable to move in technical SEO, because most of your competitors have not done any of the above. The configuration above is most of the gap.


Methodology

Synthesis of vendor documentation (OpenAI, Perplexity, Common Crawl), security tooling guidance (DataDome, Cloudflare), and field testing on 10 brand sites between February and May 2026. Configuration snippets validated on Nginx 1.26 and a representative SSG / SSR matrix.

What we don't know

Crawler behaviour evolves. The CIDR feeds listed here are accurate as of publication and refresh independently. JavaScript rendering posture has shifted before and may shift again; re-audit quarterly. Adoption of /llms.txt is patchy.

Share this post

Loading...