Your robots.txt file helps search engines crawl the right pages, conserve crawl budget, and keep low-value or sensitive areas out of the index. Here’s a practical, modern setup for WordPress and WooCommerce — plus what to avoid, how to test, and FAQs.
What is a robots.txt file?
robots.txt is a plain text file placed in the root of your domain (e.g., https://yourdomain.co.za/robots.txt).
It provides instructions (called “directives”) to search engine crawlers about which paths they may crawl and which they should avoid.
While it doesn’t guarantee that every bot will obey, major engines like Google and Bing respect it and use it to plan their crawl.
- Location: must be at the root of the hostname you want to control.
- Format: UTF-8 plain text. No fancy characters needed.
- Purpose: guide crawling (not indexing). For de-indexing, use
noindexor removals in Search Console.
Why robots.txt matters for SEO
- Conserves crawl budget: Stop bots from wasting time on admin areas, plugin files, and URL parameters that don’t need crawling.
- Protects sensitive areas: Directories like
/wp-admin/shouldn’t be in search results. - Keeps the index clean: Reduces duplicate/thin URLs created by parameters such as
?limit=or cart actions. - Faster discovery of key content: Pair robots.txt with your XML sitemap so bots find your important pages first.
Copy-paste robots.txt for WordPress (+ WooCommerce)
Use this as a safe starting point. It’s grounded in real-world WordPress setups and mirrors the rules I use on production sites.
Recommended template
User-agent: *
Allow: /wp-admin/admin-ajax.php
Allow: /wp-content/uploads/
Disallow: /wp-content/plugins/
Disallow: /wp-admin/
Disallow: /?add-to-cart*
Disallow: /?limit*
Disallow: /?add-to-wishlist=*
Disallow: /?wordfence_lh=*
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
User-agent: AhrefsBot
Allow: /
User-agent: SemrushBot
Allow: /
User-agent: MajesticBot
Allow: /
Sitemap: https://yourdomain.co.za/sitemap_index.xml
What each rule does:
Allow: /wp-admin/admin-ajax.php— keeps AJAX endpoints working for themes/plugins.Allow: /wp-content/uploads/— lets your images and media be crawled and indexed.Disallow: /wp-content/plugins/and/wp-admin/— hides sensitive plugin and admin files from crawlers.- Parameter blocks (
?add-to-cart,?limit,?add-to-wishlist,?wordfence_lh) — reduces duplicate pages and low-value parameter URLs. Sitemap:— points bots to your XML sitemap index for faster discovery of new/updated content.
Note: If your sitemap lives at a different path (e.g., Rank Math or Yoast), update the Sitemap: URL accordingly.
How to customise for your stack
Use the template above as the baseline, then adapt.
- WooCommerce stores: Keep the cart and filter parameter blocks. If your theme creates additional filter params (e.g.,
?orderby=), consider blocking those too if they generate thin/duplicate pages. - Membership / LMS: If you gate content behind logins, robots.txt won’t protect it — use authentication and ensure private URLs are not linked publicly. Consider
noindexwhere appropriate. - Staging & dev: Use HTTP auth or block the whole site with
Disallow: /(and remove before launch!). I prefer HTTP auth to avoid accidental indexing. - Images & CSS/JS: Do not block resources needed for rendering. Google needs your CSS/JS to render pages correctly. Leaving
/wp-content/uploads/allowed is essential.
Common robots.txt mistakes to avoid
- Accidental site-wide block:
User-agent: * Disallow: /This tells all bots to avoid your site entirely.
- Relying on robots.txt to hide confidential content: It only requests bots not to crawl; it does not secure URLs.
- Blocking vital resources: If CSS/JS is blocked, Google may render a broken page and judge quality incorrectly.
- Forgetting the XML sitemap: Always include the
Sitemap:directive to help with discovery. - Trying to “noindex” in robots.txt: Google no longer supports
noindexin robots rules. Use a<meta name="robots" content="noindex">tag or HTTP header instead.
When to use noindex vs Disallow
Disallow prevents crawling, but a URL can still be indexed if it’s discovered via links.
noindex tells search engines not to include the page in results (even if it’s crawlable).
- Use
noindexfor thin tag archives, internal search results, thank-you pages, and utility pages you never want in SERPs. - Use
Disallowfor admin directories and repetitive parameter URLs to save crawl budget. - Use both (carefully) when a page should not be indexed and is not worth crawling.
Testing & monitoring
- Live check: Visit
/robots.txtin your browser to verify it’s accessible. - Robots Tester: In Google Search Console, use the robots tester to see if specific URLs are allowed or blocked.
- Crawl stats: Review GSC » Settings » Crawl stats to confirm bots are spending time on the right sections.
- Cache clear: If you use Cloudflare/WP Rocket, purge cache after updates so bots receive the latest file.
Robots.txt FAQ
Do I really need a robots.txt file?
Yes. While a site can function without it, a well-configured robots.txt helps guide crawlers, protect admin areas, and conserve crawl budget.
Where do I put my robots.txt file?
At the domain root, e.g., https://yourdomain.co.za/robots.txt. If you have multiple subdomains, each subdomain needs its own file.
Can I block a specific bot?
Yes. Add a user-agent section with Disallow: /. Be cautious: some bots ignore robots.txt entirely.
Should I include my XML sitemap?
Absolutely. The Sitemap: directive accelerates discovery and helps search engines find your most important URLs.
Is robots.txt enough to remove a page from Google?
No. Use noindex (meta tag or HTTP header), or request removal in Google Search Console. Robots.txt only governs crawling, not indexing.
Wrap-up
Keep your robots.txt simple, readable, and aligned with how your site actually works. Start with the template in this post, then adjust for your plugins, parameters, and sitemap location. Pair it with sensible use of noindex, canonicals, and XML sitemaps, and you’ll keep crawlers focused on what matters most: your revenue-driving pages.
