Should I Ban Amazon Bot and Other Bots from My Site?

Mostly visitors to news i8 are bots. My top two visitors are Amazonbot and PetalBot. These bots are hungry, with amazon bot visiting my 1,700 pages over half a million times (10,248 times for one IP address) since I started logging. 216.244.66.236 (Visits: 10248).

Today I have over 5,000 visits from Amazonbot. I only have under 1800 web pages on this site, so what the heck? Is this a swarm of LLM trainers? Real people in stealth mode using “Amazonbot” in their User Agent strings to avoid tracking? Hmm…

User Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36 (Visits: 541,614).

541,614 visits from Amazonbot? That’s nuts. Nothing personal against Jeff, I think Amazon is one of the best web sites ever made and I shop on it often, but there is not really a shopping option on news i8. Plus I have no ads, to Amazon or anywhere else. What good is Amazon bot doing for me at this time? I’ll admit that I think news i8 is one of the most interesting sites on the Internet and that I think it will have a potentially huge historical impact, perhaps in 500 years, but I don’t see the point of all that traffic. It must be capturing every little change. (I make a lot of changes as I work to keep my facts on here accurate.)

Should I ban Amazon Bot and (OTHER) Bot?

Banning Amazonbot and PetalBot from crawling your website can be a consideration based on several factors. Here are some reasons to weigh before making that decision:

Reasons to Consider Banning

1. Resource Consumption: Both Amazonbot and PetalBot can consume significant server resources, especially if they crawl aggressively. This can lead to slower performance for regular users or even downtime if your server becomes overwhelmed[2][6].

2. Impact on SEO: If these bots are not contributing positively to your site’s visibility or SEO, blocking them might help preserve bandwidth for more beneficial crawlers like Googlebot. While crawlers help index your site, unnecessary traffic from less relevant bots can dilute your site’s performance metrics[3][4].

3. Potential Overload: Reports indicate that aggressive crawling by bots like PetalBot can lead to performance issues on websites, particularly for e-commerce sites where uptime is critical[2]. If you notice significant traffic spikes from these bots, it may be prudent to restrict their access.

4. Privacy and Security Concerns: Although both bots are designed to operate responsibly, there is always a risk that they could access sensitive information if not properly managed. Blocking them can help mitigate this risk[1][5].

Reasons Not to Ban

1. Indexing Benefits: Both Amazonbot and PetalBot play roles in indexing content for their respective platforms (Amazon services and Huawei’s Petal Search). Blocking them might prevent your site from being indexed in these systems, potentially limiting exposure to users who rely on these services[6][8].

2. Respect for Robots.txt: These bots typically respect the rules set in the `robots.txt` file, which allows you to control their access without outright banning them. By configuring this file properly, you can limit their crawling rate or specific areas of your site they can access without completely blocking them[1][3].

3. Evolving Needs: As search engines and services evolve, having these bots crawl your site could provide future benefits that are currently unforeseen. They may help improve your visibility in different search contexts over time[4].

As a test I’m trying this in my robots.txt file based on high traffic.

User-agent: Amazonbot
Disallow: /

It is Nice to Just Slow Them

Instead of blocking all bots, I just ask them to slow down. One page visit per 20 seconds. Put this in your robots.txt file for example and the behaving bots will slow down.

User-agent: * 
Crawl-Delay: 20

But Some Bots Don’t Listen

When I use one of these User agent strings the site is still returned by curl. This is because curl does not check the robots.txt file first.

curl -A "Amazonbot" http://yourwebsite.com

This should, however, return 403 forbidden if you set .htaccess to block this string in the User Agents. (Make sure you have mod_rewrite on. It always will be if you have WordPress, I believe.)

Ineffective Blocking:
- User-agent blocking is not foolproof, as malicious bots can easily spoof their user agents. This means that while your configuration may block known bad bots, it might not prevent all unwanted traffic.
Potential for 403 Forbidden Errors:
- If a legitimate bot (like Googlebot) inadvertently matches your blocking conditions due to misconfiguration or overlapping conditions, it could result in a 403 Forbidden response.

If mod_rewrite is on but was not working for the rules you added, check on this.

If your server configuration does not allow overrides through .htaccess, then none of the rewrite rules will work. Check if AllowOverride All is set in your Apache configuration.

Testing LATEX

Like here is a change I made to this web page because I don’t want to put up an entire new page. Does this code work to generate something nice looking or not?

$e^{\i \pi} + 1 = 0$ or $e^{\i \pi} + 1 = 0$

It does! Cool. Now I can add formulas.

Conclusion

Deciding whether to ban Amazonbot (or PetalBot or whatever) involves balancing the potential resource strain against the indexing benefits they provide. If you experience significant negative impacts from their crawling activities, implementing a ban or adjusting `robots.txt` settings may be warranted. Conversely, if their presence is manageable and could enhance your site’s visibility on various platforms, it might be beneficial to allow them access while monitoring their activity closely.

What Does Having Low Nitric Oxide on a Saliva Test Strip Mean?

A Deep Dive on Angiogenesis: Biology, History, and Therapeutic Promotion

To Fight Neurological Issues from Possible Lyme Disease, Try These Things

Ten Fundamental Understandings Essential for Grasping Science

The Science Whereby Electromagnetic Radiation (EMF) Causes Organ Damage

Dirty Electricity From Solar-Generation: Scope and Impact

Could Three Dimensions of Time Solve Physics’ Greatest Puzzle?

Physicists Edge Closer to Discovering a Possible Fifth Force Inside Atoms

MIT’s “Bubble Wrap” Harvests Drinking Water from Thin Air, Even in Dry Deserts

Vacuum Airships (Nullships): Could They Revolutionize Space Launch?

News i8