robots.txt

A plain-text file at the root of a domain that tells crawlers which paths they may or may not request.

Definition

robots.txt is a plain-text file located at the root of a domain (`/robots.txt`) that uses the Robots Exclusion Protocol to tell well-behaved crawlers which paths they should and shouldn't fetch.

robots.txt controls crawling, not indexing. A page disallowed in robots.txt can still appear in Google's index if other pages link to it — Google may simply show a generic snippet because it never fetched the content. To keep a page out of the index, a `noindex` meta tag (on a page Google is allowed to crawl) is the appropriate tool. Each subdomain needs its own robots.txt; rules on `example.com/robots.txt` do not affect `shop.example.com`.

Examples

Blocking a directory
A site adds `User-agent: *` and `Disallow: /admin/` to its robots.txt so general crawlers skip the admin URLs. The admin pages are not crawled, though links to them elsewhere on the web could still cause Google to list the URL in results.
Pointing crawlers at the sitemap
A site adds `Sitemap: https://example.com/sitemap.xml` at the bottom of its robots.txt so any crawler that fetches the file knows where to find the canonical URL list.

Sources

Related terms

Where QueryCatch uses this

Last updated: 2026-05-10

Glossary

Definition

Examples

Sources

Related terms

Where QueryCatch uses this