What is Common Crawl / CCBot?
Common Crawl is a large public dataset of crawled web pages, gathered by its crawler CCBot, that many AI models and tools use as a training and reference corpus.
How it works
Because Common Crawl is so widely reused, being present in it means your content can flow into many downstream models and datasets at once. CCBot is the bot that collects it, and your robots.txt decides whether it may.
Allowing CCBot maximises how widely your content propagates into the AI ecosystem. Blocking it is a training-protection stance, the same trade-off as with GPTBot: wider footprint versus tighter control over training use.
CCBot vs a search crawler
CCBot builds a general-purpose corpus reused across many models. A search crawler like OAI-SearchBot or PerplexityBot builds one engine's live citation index. One shapes broad training data; the other shapes whether you are cited in a specific product today.
Why it matters for B2B
For a brand that wants the widest possible presence across AI systems, Common Crawl is a high-leverage corpus to be in. For one focused on controlling training use, CCBot is the bot to weigh most carefully.
Shipping a blanket Disallow: / or a catch-all bot block to "stop the scrapers," then wondering why the brand is absent from AI tools. That one line also removes you from Common Crawl, the corpus many downstream models reuse.