robots.txt
A plain-text file at the root of a domain that tells crawlers which paths they may or may not request.
Definition
robots.txt is a plain-text file located at the root of a domain (`/robots.txt`) that uses the Robots Exclusion Protocol to tell well-behaved crawlers which paths they should and shouldn't fetch.
robots.txt controls crawling, not indexing. A page disallowed in robots.txt can still appear in Google's index if other pages link to it — Google may simply show a generic snippet because it never fetched the content. To keep a page out of the index, a `noindex` meta tag (on a page Google is allowed to crawl) is the appropriate tool. Each subdomain needs its own robots.txt; rules on `example.com/robots.txt` do not affect `shop.example.com`.
Examples
Blocking a directory
A site adds `User-agent: *` and `Disallow: /admin/` to its robots.txt so general crawlers skip the admin URLs. The admin pages are not crawled, though links to them elsewhere on the web could still cause Google to list the URL in results.
Pointing crawlers at the sitemap
A site adds `Sitemap: https://example.com/sitemap.xml` at the bottom of its robots.txt so any crawler that fetches the file knows where to find the canonical URL list.
Sources
Related terms
- SitemapA file, usually XML, that lists URLs on a site so search engines can discover and crawl them more efficiently.
- Crawl BudgetThe number of URLs a search engine crawler will fetch and the rate at which it fetches them on a given site.
- IndexingThe process by which a search engine analyses a fetched page and stores information about it so the page can later be returned in search results.
Where QueryCatch uses this
Last updated: 2026-05-10