Mostly visitors to news i8 are bots. My top two visitors are Amazonbot and PetalBot. These bots are hungry, with amazon bot visiting my 1,700 pages over half a million times (10,248 times for one IP address) since I started logging. 216.244.66.236 (Visits: 10248).
Today I have over 5,000 visits from Amazonbot. I only have under 1800 web pages on this site, so what the heck? Is this a swarm of LLM trainers? Real people in stealth mode using “Amazonbot” in their User Agent strings to avoid tracking? Hmm…
User Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36 (Visits: 541,614).
541,614 visits from Amazonbot? That’s nuts. Nothing personal against Jeff, I think Amazon is one of the best web sites ever made and I shop on it often, but there is not really a shopping option on news i8. Plus I have no ads, to Amazon or anywhere else. What good is Amazon bot doing for me at this time? I’ll admit that I think news i8 is one of the most interesting sites on the Internet and that I think it will have a potentially huge historical impact, perhaps in 500 years, but I don’t see the point of all that traffic. It must be capturing every little change. (I make a lot of changes as I work to keep my facts on here accurate.)
Should I ban Amazon Bot and (OTHER) Bot?
Banning Amazonbot and PetalBot from crawling your website can be a consideration based on several factors. Here are some reasons to weigh before making that decision:
Reasons to Consider Banning
1. Resource Consumption: Both Amazonbot and PetalBot can consume significant server resources, especially if they crawl aggressively. This can lead to slower performance for regular users or even downtime if your server becomes overwhelmed[2][6].
2. Impact on SEO: If these bots are not contributing positively to your site’s visibility or SEO, blocking them might help preserve bandwidth for more beneficial crawlers like Googlebot. While crawlers help index your site, unnecessary traffic from less relevant bots can dilute your site’s performance metrics[3][4].
3. Potential Overload: Reports indicate that aggressive crawling by bots like PetalBot can lead to performance issues on websites, particularly for e-commerce sites where uptime is critical[2]. If you notice significant traffic spikes from these bots, it may be prudent to restrict their access.
4. Privacy and Security Concerns: Although both bots are designed to operate responsibly, there is always a risk that they could access sensitive information if not properly managed. Blocking them can help mitigate this risk[1][5].
Reasons Not to Ban
1. Indexing Benefits: Both Amazonbot and PetalBot play roles in indexing content for their respective platforms (Amazon services and Huawei’s Petal Search). Blocking them might prevent your site from being indexed in these systems, potentially limiting exposure to users who rely on these services[6][8].
2. Respect for Robots.txt: These bots typically respect the rules set in the `robots.txt` file, which allows you to control their access without outright banning them. By configuring this file properly, you can limit their crawling rate or specific areas of your site they can access without completely blocking them[1][3].
3. Evolving Needs: As search engines and services evolve, having these bots crawl your site could provide future benefits that are currently unforeseen. They may help improve your visibility in different search contexts over time[4].
As a test I’m trying this in my robots.txt file based on high traffic.
User-agent: Amazonbot Disallow: /
It is Nice to Just Slow Them
Instead of blocking all bots, I just ask them to slow down. One page visit per 20 seconds. Put this in your robots.txt file for example and the behaving bots will slow down.
User-agent: * Crawl-Delay: 20
But Some Bots Don’t Listen
When I use one of these User agent strings the site is still returned by curl. This is because curl does not check the robots.txt file first.
curl -A "Amazonbot" http://yourwebsite.com
This should, however, return 403 forbidden if you set .htaccess to block this string in the User Agents. (Make sure you have mod_rewrite on. It always will be if you have WordPress, I believe.)
- Ineffective Blocking:
- User-agent blocking is not foolproof, as malicious bots can easily spoof their user agents. This means that while your configuration may block known bad bots, it might not prevent all unwanted traffic.
- Potential for 403 Forbidden Errors:
- If a legitimate bot (like Googlebot) inadvertently matches your blocking conditions due to misconfiguration or overlapping conditions, it could result in a 403 Forbidden response.
If mod_rewrite is on but was not working for the rules you added, check on this.
- If your server configuration does not allow overrides through
.htaccess
, then none of the rewrite rules will work. Check ifAllowOverride All
is set in your Apache configuration.
Testing LATEX
Like here is a change I made to this web page because I don’t want to put up an entire new page. Does this code work to generate something nice looking or not?
or
It does! Cool. Now I can add formulas.
Conclusion
Deciding whether to ban Amazonbot (or PetalBot or whatever) involves balancing the potential resource strain against the indexing benefits they provide. If you experience significant negative impacts from their crawling activities, implementing a ban or adjusting `robots.txt` settings may be warranted. Conversely, if their presence is manageable and could enhance your site’s visibility on various platforms, it might be beneficial to allow them access while monitoring their activity closely.
Read More
[1] https://friendlycaptcha.com/wiki/what-is-petalbot/
[2] https://www.hypernode.com/en/blog/huawei-aspiegelbot-is-increasingly-impacting-european-online-stores/
[3] https://www.keycdn.com/blog/web-crawlers
[4] https://www.botify.com/insight/ai-crawler-bots
[5] https://datadome.co/bots/hjxo7g21/
[6] https://datadome.co/bots/amazonbot/
[7] https://weblynx.pro/bots-you-should-block-to-protect-your-content/
[8] https://connect.iftas.org/library/tools-resources/web-crawlers-and-scrapers/
1 comment
PS. Another question is if it is even possible to ban Amazonbot. Won’t it just start using crawlers with deceptive user agents if it gets blocked?