Security

How to Block Bad Bots: Protect Your Server from Scrapers

May 15, 2026

Security May 15, 2026

The Bot Problem: Invisible Traffic Draining Your Server

Bots account for nearly half of all internet traffic, and a significant portion of that is malicious. Content scrapers steal your articles and republish them on spam sites. Credential stuffing bots try thousands of stolen username/password combinations against your login page. SEO spam bots probe for vulnerable forms to inject backlinks. Vulnerability scanners relentlessly probe your server for known exploits. And all of them consume your server's CPU, bandwidth, and memory — resources that should be serving your real users.

The challenge is that not all bots are bad. Googlebot needs to crawl your site for search indexing. Uptime monitors check that your site is available. Payment webhooks callback from Stripe or PayPal. You need to block the bad bots while letting the good ones through, and the line between them is not always obvious.

This guide covers practical strategies for identifying and blocking malicious bots: user-agent analysis, behavioral detection, Nginx blocking rules, ModSecurity WAF rules, fail2ban custom filters, honeypot traps, and Cloudflare Bot Management.

What you will learn: How to distinguish good bots from bad bots, analyze access logs for bot patterns, block by user-agent string, implement rate limiting, create honeypot traps, configure ModSecurity rules, set up fail2ban bot filters, and integrate with Cloudflare Bot Management.

Good Bots vs Bad Bots

Before you start blocking, you need to understand what you are dealing with. Here is how to categorize the bots hitting your server:

Bot Type	Examples	Intent	Action
Search Engine Crawlers	Googlebot, Bingbot, YandexBot	Index your content for search	Allow
Social Media Bots	Facebookbot, Twitterbot, LinkedInBot	Generate link previews	Allow
Monitoring Bots	UptimeRobot, Pingdom	Check site availability	Allow
AI Training Bots	GPTBot, ClaudeBot, CCBot	Scrape content for AI training	Your choice
Content Scrapers	MJ12bot, AhrefsBot, SemrushBot	Scrape content/analyze competitors	Rate limit or block
Vulnerability Scanners	Nikto, Nessus, sqlmap	Find exploitable vulnerabilities	Block
Credential Stuffing	Custom scripts	Try stolen login credentials	Block
Spam Bots	Various	Submit spam forms, inject links	Block

Identifying Bad Bots in Your Access Logs

The first step is understanding what bots are currently hitting your server. Your Nginx or Apache access logs contain a wealth of information.

Analyzing User-Agent Strings

# Top 20 user agents by request count awk -F'"' '{print $6}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20 # Find requests with suspicious user agents grep -iE "(python|curl|wget|scrapy|bot|spider|crawl)" /var/log/nginx/access.log | wc -l # Find requests with empty user agents awk -F'"' '$6 == "-" || $6 == ""' /var/log/nginx/access.log | wc -l

Behavioral Analysis

User-agent strings can be spoofed, so behavioral analysis is often more reliable. Look for these patterns:

Signs of Bad Bot Activity

Hundreds of requests per second from one IP
Hitting pages in rapid sequence without loading CSS/JS/images
Accessing /wp-login.php, /xmlrpc.php, /admin repeatedly
Probing for non-existent files (.env, wp-config.php.bak, .git/HEAD)
Empty or fake user-agent strings
No cookie support (every request starts a new session)

Signs of Legitimate Traffic

Requests load CSS, JS, images alongside HTML
Reasonable time between page views (reading time)
Follows internal links naturally
Accepts and returns cookies
Supports JavaScript execution
Comes from known IP ranges (Google, Bing, etc.)

# Top IPs by request count (potential bots make hundreds/thousands) awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20 # IPs hitting sensitive paths grep -E "(wp-login|xmlrpc|\.env|\.git|wp-config)" /var/log/nginx/access.log | awk '{print $1}' | sort | uniq -c | sort -rn | head -20 # IPs generating 404 errors (probing for vulnerabilities) awk '$9 == 404 {print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -rn | head -20

Why robots.txt Is Not Enough

Many people assume that robots.txt blocks bots, but it does not. The robots.txt file is a polite request, not an enforcement mechanism. Good bots (Googlebot, Bingbot) respect it. Bad bots ignore it entirely. In fact, some malicious bots specifically read robots.txt to find the paths you are trying to hide — then they target those paths.

robots.txt is for good bots only. It tells search engines which pages not to index. It does not prevent access to anything. Never put sensitive paths in robots.txt thinking it will protect them — you are actually advertising them to attackers.

Nginx User-Agent Blocking

The most straightforward defense is blocking bots by their user-agent string. While sophisticated bots spoof their user-agent, many bad bots still identify themselves honestly (or use obviously fake agents).

# /etc/nginx/conf.d/bad-bots.conf # Map bad user agents to block status map $http_user_agent $bad_bot { default 0; # Vulnerability scanners ~*nikto 1; ~*nessus 1; ~*sqlmap 1; ~*nmap 1; ~*masscan 1; ~*dirbuster 1; ~*gobuster 1; # Content scrapers ~*MJ12bot 1; ~*DotBot 1; ~*BLEXBot 1; ~*ZoominfoBot 1; # Spam and abuse bots ~*Xenu 1; ~*Baiduspider 1; ~*360Spider 1; # Generic automation tools ~*python-requests 1; ~*scrapy 1; ~*java/ 1; ~*httpclient 1; # Empty user agent "" 1; }

Apply the block in your server configuration:

server { # Block bad bots if ($bad_bot) { return 444; # Close connection without response } }

Why return 444? The 444 status code is an Nginx-specific response that closes the connection immediately without sending any data back. This is more efficient than returning a 403 because it wastes fewer server resources on the bot — no response body to generate and send.

Blocking AI Training Bots

If you want to prevent AI companies from scraping your content for training data, add their known user agents to your block list:

# AI training bots (add to the map block above) ~*GPTBot 1; ~*ChatGPT-User 1; ~*CCBot 1; ~*anthropic-ai 1; ~*ClaudeBot 1; ~*Google-Extended 1; ~*Bytespider 1;

Rate Limiting Per IP

Even bots with legitimate-looking user agents can be detected by their request volume. Nginx's rate limiting module is your first line of defense:

# Define rate limit zones limit_req_zone $binary_remote_addr zone=bot_limit:10m rate=10r/s; limit_conn_zone $binary_remote_addr zone=bot_conn:10m; server { # Limit requests per second per IP limit_req zone=bot_limit burst=20 nodelay; # Limit concurrent connections per IP limit_conn bot_conn 20; # Return 429 Too Many Requests (not default 503) limit_req_status 429; limit_conn_status 429; }

Honeypot Traps: Catching Bots Red-Handed

A honeypot is a trap that looks like a real resource to a bot but is invisible to human visitors. When a bot accesses the honeypot, you know it is malicious (because no human would find it), and you can automatically block its IP.

The Hidden Link Honeypot

Add an invisible link to your HTML pages that only bots will follow. Real users with CSS enabled will never see it:

<a href="/totally-not-a-trap" style="position:absolute;left:-9999px;opacity:0;" tabindex="-1" aria-hidden="true">Do not click</a>

Then configure Nginx to block any IP that accesses the honeypot URL:

location = /totally-not-a-trap { access_log /var/log/nginx/honeypot.log; return 444; }

Combine this with a fail2ban filter that monitors the honeypot log and bans any IP that appears:

# /etc/fail2ban/filter.d/nginx-honeypot.conf [Definition] failregex = ^ -.*"(GET|POST) /totally-not-a-trap.*"$ ignoreregex =

# /etc/fail2ban/jail.d/nginx-honeypot.conf [nginx-honeypot] enabled = true filter = nginx-honeypot logpath = /var/log/nginx/honeypot.log maxretry = 1 findtime = 3600 bantime = 86400 action = nftables-multiport[name=honeypot, port="80,443"]

With maxretry = 1, any single hit to the honeypot results in a 24-hour ban. This is safe because no legitimate user or good bot should ever access this URL.

ModSecurity WAF Rules for Bot Detection

ModSecurity with the OWASP Core Rule Set provides sophisticated bot detection that goes beyond simple user-agent matching:

Fail2ban Custom Filters for Bot Detection

Fail2ban can automatically ban IPs based on behavioral patterns detected in your logs. Here are custom filters tailored for bot detection:

Excessive 404 Errors (Vulnerability Probing)

# /etc/fail2ban/filter.d/nginx-404-scanner.conf [Definition] failregex = ^ -.*" (GET|POST|HEAD) .* HTTP/[12].?" 404 ignoreregex = \.(css|js|png|jpg|jpeg|gif|ico|svg|woff|woff2|ttf)$

# /etc/fail2ban/jail.d/nginx-404-scanner.conf [nginx-404-scanner] enabled = true filter = nginx-404-scanner logpath = /var/log/nginx/access.log maxretry = 30 findtime = 60 bantime = 7200 action = nftables-multiport[name=404-scanner, port="80,443"]

This bans any IP that generates more than 30 404 errors in 60 seconds for 2 hours. The ignoreregex excludes common static assets that might legitimately return 404 on page loads.

WordPress-Specific Protection

# /etc/fail2ban/filter.d/nginx-wp-scanner.conf [Definition] failregex = ^ -.*"(GET|POST) /(wp-login\.php|xmlrpc\.php|wp-config\.php).*"$ ignoreregex =

JavaScript Challenges

Most malicious bots do not execute JavaScript. By requiring a JavaScript challenge before granting access, you can filter out the majority of automated traffic while being transparent to real users:

<script> // Calculate a proof-of-work token const timestamp = Date.now(); const token = btoa(timestamp.toString(36)); document.cookie = `bot_check=${token}; path=/; max-age=3600; SameSite=Strict`; // Reload the page — server checks for the cookie if (!document.cookie.includes('bot_verified')) { document.cookie = 'bot_verified=1; path=/; max-age=3600'; location.reload(); } </script>

This is a simplified example. In practice, Cloudflare's "Under Attack Mode" and "Bot Fight Mode" implement far more sophisticated JavaScript challenges that are difficult for headless browsers to solve.

Cloudflare Bot Management

Cloudflare provides the most comprehensive bot management available. It operates at the edge (before traffic reaches your server), uses machine learning to classify bot traffic, and offers multiple levels of challenge:

Request hits Cloudflare

→

ML bot score (1-99)

→

Low score = likely bot

→

Challenge or block

Cloudflare Firewall Rules for Bot Blocking

Rule	Expression	Action
Block known bad bots	`cf.client.bot` AND NOT `cf.bot_management.verified_bot`	Block
Challenge suspicious traffic	`cf.bot_management.score lt 30`	JS Challenge
Block empty user agents	`http.user_agent eq ""`	Block
Challenge high-threat IPs	`cf.threat_score gt 40`	CAPTCHA
Allow verified bots	`cf.bot_management.verified_bot`	Allow

Panelica tip: Panelica provides ModSecurity WAF with OWASP CRS rules, Fail2ban auto-blocking, and Cloudflare integration to detect and block malicious bots automatically from the panel. Configure bot protection rules without ever touching the command line.

Verifying Legitimate Bot Identity

Sophisticated bad bots spoof their user-agent to pretend they are Googlebot or Bingbot. You can verify a bot's identity by performing a reverse DNS lookup:

# Verify Googlebot: reverse DNS should resolve to *.googlebot.com or *.google.com host 66.249.66.1 1.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-1.googlebot.com. # Forward DNS should resolve back to the same IP host crawl-66-249-66-1.googlebot.com crawl-66-249-66-1.googlebot.com has address 66.249.66.1 # If reverse DNS does not match *.googlebot.com, it is fake host 203.0.113.50 50.113.0.203.in-addr.arpa domain name pointer some-random-host.example.com. # This is NOT Googlebot despite claiming to be!

Monitoring Bot Traffic

Set up ongoing monitoring to track bot activity trends and catch new threats early:

#!/bin/bash # bot-report.sh — Daily bot traffic summary LOG="/var/log/nginx/access.log" echo "=== Bot Traffic Report ===" echo "" echo "Total requests: $(wc -l < $LOG)" echo "Bot requests: $(grep -ciE 'bot|spider|crawl|scraper' $LOG)" echo "Empty UA: $(awk -F'\"' '$6 == \"-\"' $LOG | wc -l)" echo "" echo "=== Top 10 Bot User Agents ===" grep -iE 'bot|spider|crawl' $LOG | awk -F'"' '{print $6}' | sort | uniq -c | sort -rn | head -10 echo "" echo "=== Top 10 IPs Blocked by Fail2ban ===" fail2ban-client status nginx-http-flood 2>/dev/null | grep "Banned IP"

Bot Defense Checklist

Access logs analyzed for bot patterns
Nginx user-agent blocking configured for known bad bots
Empty user-agent requests blocked
Rate limiting enabled per IP (requests and connections)
Fail2ban monitoring access logs for excessive requests
Fail2ban monitoring for excessive 404 errors
WordPress xmlrpc.php blocked or rate-limited
Honeypot trap deployed and monitored
ModSecurity WAF rules active for bot detection
Sensitive file probing blocked (.env, .git, backups)
Cloudflare Bot Management enabled (if using Cloudflare)
Legitimate bot identity verified via reverse DNS
Daily/weekly bot traffic reports configured

Wrapping Up

Blocking bad bots is an ongoing battle, not a one-time configuration. New bots appear constantly, user-agent strings change, and attack patterns evolve. The key is building a layered defense that catches bots at multiple levels: user-agent filtering for known bad actors, rate limiting for aggressive requesters, honeypots for stealthy crawlers, ModSecurity for request content inspection, fail2ban for behavioral pattern detection, and Cloudflare for large-scale bot management.

Start with the highest-impact measures: block known bad user agents, enable rate limiting, and set up fail2ban for excessive 404 errors and request floods. These three steps alone will eliminate the majority of malicious bot traffic. Then add more sophisticated measures like honeypot traps and WAF rules as needed.

The goal is not to block every single bot — that is impossible. The goal is to make your server an unattractive target by increasing the cost for attackers while keeping the door wide open for legitimate users and the good bots that help your site thrive in search results.