Security

How to Block Bad Bots: Protect Your Server from Scrapers

May 15, 2026

Back to Blog

The Bot Problem: Invisible Traffic Draining Your Server

Bots account for nearly half of all internet traffic, and a significant portion of that is malicious. Content scrapers steal your articles and republish them on spam sites. Credential stuffing bots try thousands of stolen username/password combinations against your login page. SEO spam bots probe for vulnerable forms to inject backlinks. Vulnerability scanners relentlessly probe your server for known exploits. And all of them consume your server's CPU, bandwidth, and memory — resources that should be serving your real users.

The challenge is that not all bots are bad. Googlebot needs to crawl your site for search indexing. Uptime monitors check that your site is available. Payment webhooks callback from Stripe or PayPal. You need to block the bad bots while letting the good ones through, and the line between them is not always obvious.

This guide covers practical strategies for identifying and blocking malicious bots: user-agent analysis, behavioral detection, Nginx blocking rules, ModSecurity WAF rules, fail2ban custom filters, honeypot traps, and Cloudflare Bot Management.

What you will learn: How to distinguish good bots from bad bots, analyze access logs for bot patterns, block by user-agent string, implement rate limiting, create honeypot traps, configure ModSecurity rules, set up fail2ban bot filters, and integrate with Cloudflare Bot Management.

Good Bots vs Bad Bots

Before you start blocking, you need to understand what you are dealing with. Here is how to categorize the bots hitting your server:

Bot TypeExamplesIntentAction
Search Engine CrawlersGooglebot, Bingbot, YandexBotIndex your content for searchAllow
Social Media BotsFacebookbot, Twitterbot, LinkedInBotGenerate link previewsAllow
Monitoring BotsUptimeRobot, PingdomCheck site availabilityAllow
AI Training BotsGPTBot, ClaudeBot, CCBotScrape content for AI trainingYour choice
Content ScrapersMJ12bot, AhrefsBot, SemrushBotScrape content/analyze competitorsRate limit or block
Vulnerability ScannersNikto, Nessus, sqlmapFind exploitable vulnerabilitiesBlock
Credential StuffingCustom scriptsTry stolen login credentialsBlock
Spam BotsVariousSubmit spam forms, inject linksBlock

Identifying Bad Bots in Your Access Logs

The first step is understanding what bots are currently hitting your server. Your Nginx or Apache access logs contain a wealth of information.

Analyzing User-Agent Strings

# Top 20 user agents by request count awk -F'"' '{print $6}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20 # Find requests with suspicious user agents grep -iE "(python|curl|wget|scrapy|bot|spider|crawl)" /var/log/nginx/access.log | wc -l # Find requests with empty user agents awk -F'"' '$6 == "-" || $6 == ""' /var/log/nginx/access.log | wc -l

Behavioral Analysis

User-agent strings can be spoofed, so behavioral analysis is often more reliable. Look for these patterns:

Signs of Bad Bot Activity

  • Hundreds of requests per second from one IP
  • Hitting pages in rapid sequence without loading CSS/JS/images
  • Accessing /wp-login.php, /xmlrpc.php, /admin repeatedly
  • Probing for non-existent files (.env, wp-config.php.bak, .git/HEAD)
  • Empty or fake user-agent strings
  • No cookie support (every request starts a new session)

Signs of Legitimate Traffic

  • Requests load CSS, JS, images alongside HTML
  • Reasonable time between page views (reading time)
  • Follows internal links naturally
  • Accepts and returns cookies
  • Supports JavaScript execution
  • Comes from known IP ranges (Google, Bing, etc.)
# Top IPs by request count (potential bots make hundreds/thousands) awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20 # IPs hitting sensitive paths grep -E "(wp-login|xmlrpc|\.env|\.git|wp-config)" /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -20 # IPs generating 404 errors (probing for vulnerabilities) awk '$9 == 404 {print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20

Why robots.txt Is Not Enough

Many people assume that robots.txt blocks bots, but it does not. The robots.txt file is a polite request, not an enforcement mechanism. Good bots (Googlebot, Bingbot) respect it. Bad bots ignore it entirely. In fact, some malicious bots specifically read robots.txt to find the paths you are trying to hide — then they target those paths.

robots.txt is for good bots only. It tells search engines which pages not to index. It does not prevent access to anything. Never put sensitive paths in robots.txt thinking it will protect them — you are actually advertising them to attackers.

Nginx User-Agent Blocking

The most straightforward defense is blocking bots by their user-agent string. While sophisticated bots spoof their user-agent, many bad bots still identify themselves honestly (or use obviously fake agents).

# /etc/nginx/conf.d/bad-bots.conf # Map bad user agents to block status map $http_user_agent $bad_bot { default 0; # Vulnerability scanners ~*nikto 1; ~*nessus 1; ~*sqlmap 1; ~*nmap 1; ~*masscan 1; ~*dirbuster 1; ~*gobuster 1; # Content scrapers ~*MJ12bot 1; ~*DotBot 1; ~*BLEXBot 1; ~*ZoominfoBot 1; # Spam and abuse bots ~*Xenu 1; ~*Baiduspider 1; ~*360Spider 1; # Generic automation tools ~*python-requests 1; ~*scrapy 1; ~*java/ 1; ~*httpclient 1; # Empty user agent "" 1; }

Apply the block in your server configuration:

server { # Block bad bots if ($bad_bot) { return 444; # Close connection without response } }
Why return 444? The 444 status code is an Nginx-specific response that closes the connection immediately without sending any data back. This is more efficient than returning a 403 because it wastes fewer server resources on the bot — no response body to generate and send.

Blocking AI Training Bots

If you want to prevent AI companies from scraping your content for training data, add their known user agents to your block list:

# AI training bots (add to the map block above) ~*GPTBot 1; ~*ChatGPT-User 1; ~*CCBot 1; ~*anthropic-ai 1; ~*ClaudeBot 1; ~*Google-Extended 1; ~*Bytespider 1;

Rate Limiting Per IP

Even bots with legitimate-looking user agents can be detected by their request volume. Nginx's rate limiting module is your first line of defense:

# Define rate limit zones limit_req_zone $binary_remote_addr zone=bot_limit:10m rate=10r/s; limit_conn_zone $binary_remote_addr zone=bot_conn:10m; server { # Limit requests per second per IP limit_req zone=bot_limit burst=20 nodelay; # Limit concurrent connections per IP limit_conn bot_conn 20; # Return 429 Too Many Requests (not default 503) limit_req_status 429; limit_conn_status 429; }

Honeypot Traps: Catching Bots Red-Handed

A honeypot is a trap that looks like a real resource to a bot but is invisible to human visitors. When a bot accesses the honeypot, you know it is malicious (because no human would find it), and you can automatically block its IP.

The Hidden Link Honeypot

Add an invisible link to your HTML pages that only bots will follow. Real users with CSS enabled will never see it:

<!-- Hidden honeypot link — bots follow this, humans cannot see it --> <a href="/totally-not-a-trap" style="position:absolute;left:-9999px;opacity:0;" tabindex="-1" aria-hidden="true">Do not click</a>

Then configure Nginx to block any IP that accesses the honeypot URL:

location = /totally-not-a-trap { access_log /var/log/nginx/honeypot.log; return 444; }

Combine this with a fail2ban filter that monitors the honeypot log and bans any IP that appears:

# /etc/fail2ban/filter.d/nginx-honeypot.conf [Definition] failregex = ^ -.*"(GET|POST) /totally-not-a-trap.*"$ ignoreregex =
# /etc/fail2ban/jail.d/nginx-honeypot.conf [nginx-honeypot] enabled = true filter = nginx-honeypot logpath = /var/log/nginx/honeypot.log maxretry = 1 findtime = 3600 bantime = 86400 action = nftables-multiport[name=honeypot, port="80,443"]

With maxretry = 1, any single hit to the honeypot results in a 24-hour ban. This is safe because no legitimate user or good bot should ever access this URL.

ModSecurity WAF Rules for Bot Detection

ModSecurity with the OWASP Core Rule Set provides sophisticated bot detection that goes beyond simple user-agent matching:

# Block requests with no user-agent SecRule REQUEST_HEADERS:User-Agent "^$" \ "id:900001, phase:1, deny, status:403, \ msg:'Empty User-Agent blocked'" # Block known vulnerability scanners SecRule REQUEST_HEADERS:User-Agent \ "@rx (nikto|sqlmap|nmap|masscan|dirbuster|gobuster|wpscan)" \ "id:900002, phase:1, deny, status:403, \ msg:'Known scanner blocked'" # Block requests probing for common vulnerable files SecRule REQUEST_URI "@rx \.(env|git|svn|bak|old|orig|save|sql|tar|zip)$" \ "id:900003, phase:1, deny, status:404, \ msg:'Sensitive file probe blocked'" # Block requests with suspicious query strings SecRule ARGS "@rx (union\s+select|sleep\(\d|benchmark\()" \ "id:900004, phase:2, deny, status:403, \ msg:'SQL injection attempt blocked'"

Fail2ban Custom Filters for Bot Detection

Fail2ban can automatically ban IPs based on behavioral patterns detected in your logs. Here are custom filters tailored for bot detection:

Excessive 404 Errors (Vulnerability Probing)

# /etc/fail2ban/filter.d/nginx-404-scanner.conf [Definition] failregex = ^ -.*" (GET|POST|HEAD) .* HTTP/[12].?" 404 ignoreregex = \.(css|js|png|jpg|jpeg|gif|ico|svg|woff|woff2|ttf)$
# /etc/fail2ban/jail.d/nginx-404-scanner.conf [nginx-404-scanner] enabled = true filter = nginx-404-scanner logpath = /var/log/nginx/access.log maxretry = 30 findtime = 60 bantime = 7200 action = nftables-multiport[name=404-scanner, port="80,443"]

This bans any IP that generates more than 30 404 errors in 60 seconds for 2 hours. The ignoreregex excludes common static assets that might legitimately return 404 on page loads.

WordPress-Specific Protection

# /etc/fail2ban/filter.d/nginx-wp-scanner.conf [Definition] failregex = ^ -.*"(GET|POST) /(wp-login\.php|xmlrpc\.php|wp-config\.php).*"$ ignoreregex =

JavaScript Challenges

Most malicious bots do not execute JavaScript. By requiring a JavaScript challenge before granting access, you can filter out the majority of automated traffic while being transparent to real users:

<!-- Simple JS challenge: set a cookie that the server checks --> <script> // Calculate a proof-of-work token const timestamp = Date.now(); const token = btoa(timestamp.toString(36)); document.cookie = `bot_check=${token}; path=/; max-age=3600; SameSite=Strict`; // Reload the page — server checks for the cookie if (!document.cookie.includes('bot_verified')) { document.cookie = 'bot_verified=1; path=/; max-age=3600'; location.reload(); } </script>

This is a simplified example. In practice, Cloudflare's "Under Attack Mode" and "Bot Fight Mode" implement far more sophisticated JavaScript challenges that are difficult for headless browsers to solve.

Cloudflare Bot Management

Cloudflare provides the most comprehensive bot management available. It operates at the edge (before traffic reaches your server), uses machine learning to classify bot traffic, and offers multiple levels of challenge:

Request hits Cloudflare
ML bot score (1-99)
Low score = likely bot
Challenge or block

Cloudflare Firewall Rules for Bot Blocking

RuleExpressionAction
Block known bad botscf.client.bot AND NOT cf.bot_management.verified_botBlock
Challenge suspicious trafficcf.bot_management.score lt 30JS Challenge
Block empty user agentshttp.user_agent eq ""Block
Challenge high-threat IPscf.threat_score gt 40CAPTCHA
Allow verified botscf.bot_management.verified_botAllow
Panelica tip: Panelica provides ModSecurity WAF with OWASP CRS rules, Fail2ban auto-blocking, and Cloudflare integration to detect and block malicious bots automatically from the panel. Configure bot protection rules without ever touching the command line.

Verifying Legitimate Bot Identity

Sophisticated bad bots spoof their user-agent to pretend they are Googlebot or Bingbot. You can verify a bot's identity by performing a reverse DNS lookup:

# Verify Googlebot: reverse DNS should resolve to *.googlebot.com or *.google.com host 66.249.66.1 1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com. # Forward DNS should resolve back to the same IP host crawl-66-249-66-1.googlebot.com crawl-66-249-66-1.googlebot.com has address 66.249.66.1 # If reverse DNS does not match *.googlebot.com, it is fake host 203.0.113.50 50.113.0.203.in-addr.arpa domain name pointer some-random-host.example.com. # This is NOT Googlebot despite claiming to be!

Monitoring Bot Traffic

Set up ongoing monitoring to track bot activity trends and catch new threats early:

#!/bin/bash # bot-report.sh — Daily bot traffic summary LOG="/var/log/nginx/access.log" echo "=== Bot Traffic Report ===" echo "" echo "Total requests: $(wc -l < $LOG)" echo "Bot requests: $(grep -ciE 'bot|spider|crawl|scraper' $LOG)" echo "Empty UA: $(awk -F'\"' '$6 == \"-\"' $LOG | wc -l)" echo "" echo "=== Top 10 Bot User Agents ===" grep -iE 'bot|spider|crawl' $LOG | awk -F'"' '{print $6}' | sort | uniq -c | sort -rn | head -10 echo "" echo "=== Top 10 IPs Blocked by Fail2ban ===" fail2ban-client status nginx-http-flood 2>/dev/null | grep "Banned IP"

Bot Defense Checklist

  • Access logs analyzed for bot patterns
  • Nginx user-agent blocking configured for known bad bots
  • Empty user-agent requests blocked
  • Rate limiting enabled per IP (requests and connections)
  • Fail2ban monitoring access logs for excessive requests
  • Fail2ban monitoring for excessive 404 errors
  • WordPress xmlrpc.php blocked or rate-limited
  • Honeypot trap deployed and monitored
  • ModSecurity WAF rules active for bot detection
  • Sensitive file probing blocked (.env, .git, backups)
  • Cloudflare Bot Management enabled (if using Cloudflare)
  • Legitimate bot identity verified via reverse DNS
  • Daily/weekly bot traffic reports configured

Wrapping Up

Blocking bad bots is an ongoing battle, not a one-time configuration. New bots appear constantly, user-agent strings change, and attack patterns evolve. The key is building a layered defense that catches bots at multiple levels: user-agent filtering for known bad actors, rate limiting for aggressive requesters, honeypots for stealthy crawlers, ModSecurity for request content inspection, fail2ban for behavioral pattern detection, and Cloudflare for large-scale bot management.

Start with the highest-impact measures: block known bad user agents, enable rate limiting, and set up fail2ban for excessive 404 errors and request floods. These three steps alone will eliminate the majority of malicious bot traffic. Then add more sophisticated measures like honeypot traps and WAF rules as needed.

The goal is not to block every single bot — that is impossible. The goal is to make your server an unattractive target by increasing the cost for attackers while keeping the door wide open for legitimate users and the good bots that help your site thrive in search results.

Share:
Security, built-in.