Reading Time: 5 minutes

When you self-host your own website or have it hosted on someone else’s hardware, it takes up a small part of your brain. I chose to put my websites on other people’s equipment because that mental occupation was a bit more than I wanted to handle: was it running, how was it being accessed, did I have enough resources, did I secure and update it properly. So I was a little frustrated after moving to a web host recently to find my website become unavailable. After all, the whole point was to pay someone else so as not to have to worry about those nuts and bolts. But it shows that, even in a hosted situation, you may need to take steps to help them support you. In this case, I needed to control some overzealous access of my website.

I ran an uptime monitor to let me know if this website is unreachable. I like Uptime Robot and have used it for years but there are many other tools. It’s a simple ping to determine whether a site is responding or not. I have gotten into the habit that, if I don’t see an uptime warning, I assume the site is working.

Too Many Requests, One IP Address

I was a bit surprised, then, when I attempted to access it yesterday—first from the Matomo analytics page that I used and then, when that wouldn’t load, from the home page—to find that it was not responding. Or, rather, it was responding at a machine level but not for visitors. I could tell this because, in addition to Uptime, I could ping my own server and see that it was answering.

When I logged into the CPanel dashboard, I could immediately see something was wrong although I couldn’t see what. This was the second time in a month this had occurred: my entry processes were maxing out. Now, while I love my website and I appreciate that people will read some of the posts, I know that it is not a high traffic site. At any given moment, I would expect to see single-digit entry processes. They represent the computer process that occurs when someone connects to my website and requests information. Once the request is processed, the number of entry processes drops by 1 until it’s zero.

The dashboard was showing entry processes at 100 of 100 allowed, then the number would ebb for a bit and then it would ramp back up and max out again. I got on a chat with technical support and, other than trying to upsell me to a dedicated server rather than the shared server I’m on, could only reset and restart my account. You cannot reboot a shared server when it’s just your account that is at issue, but this had the same effect.

This time, I was determined to understand what was going on. I am not sure if I missed it last time but this time I checked the visitor logs and saw that an IP address was making a lot of requests prior to the error starting. The requests continued after the errors started occurring and the site became unreachable. When I chatted with technical support this time, they saw the same thing and suggested limiting the address. I was a bit surprised because the IP was in a range belonging to my website host but they confirmed to me that it was another customer and not an administrative function.

Web Application Firewall to the Rescue

It can be frustrating chasing all of these access issues down. It is one reason I use Cloudflare because you can funnel all website requests into their system and block inappropriate requests before they get to your server. A main benefit is that I should be able to preserve my limited resources for people who are, well, people and not just bots.

Cloudflare has a bunch of built-in rules to block known bots and rules to engage in a bot fight if someone’s requests act like a bot. In addition, I have a bunch of rules set up to block other agents or access systems and to try to ensure that people are using the site, forcing them to undergo a challenge to get to the site. The managed challenge option is great because it means you can spread your net a bit more widely but anyone caught in it should be able to get themselves out of it. When you start to see managed challenges pile up from a given location or agent, you can tell that, although it is acting like a person, it probably is not.

One of the easiest rules to put in place was one that forces a lot of countries to go through a managed challenge, because I can see that they are often a source for automatic access checks. You can limit by the country code the request is coming from and then ask them to verify they are human. Since some of these requests are known automated systems, like RSS feed readers, the rule has to be tweaked to allow those requests to come in, even if they’re coming in from a country I wouldn’t normally expect.

The challenge has been to deal with the rise in scraping for what I am guessing is artificial intelligence. I had started my own list of user agents to block, using the Dark Visitors list as a starting point. This looks at the technology used by the requester and should tell you a bit about who they are. Search engines like Google and Bing use bots that are clearly marked: Googlebot and Bingbot. As other search options have arisen, some have been honest about their bots (like Mojeek) and others have not (like Brave, who masquerades as Google; ironically, they blocked the ability to save that linked page in the Internet Archive). This matters because search engines are becoming the gateway to AI training content and it will mean website owners have to balance being found by indexes they want with having their content indexed by unknown third-party AI developers.

Cloudflare has created its own AI block lists and manages those, taking a bit of the load off the individual users. I still maintain my list but it’s now supplemental. Cloudflare has also just announced a paid AI crawler access program that would allow sites behind the Cloudflare firewall to charge crawlers to be allowed through. As I said at the top, my site is not big enough to need that. But it is an interesting development for large content sites. I wouldn’t mind the harvesting if it didn’t put my server resources and, consequently, my ability to afford to run a website, at risk.

If it is not actual companies obscuring their user agents, like Brave, then it is unknown people trying to scrape on their own hook. Increasingly I’m seeing IP addresses owned by Microsoft and Amazon trying to scrape content. I do not think these are corporate addresses for those companies but rather reflect someone on a Microsoft Azure or Amazon Web Services resource. This means adding them to whatever blocking I am doing.

I dread chasing IP addresses. They change from time to time and, particularly with individuals, they may move to a new platform but your block remains. You end up accumulating an extensive block list of odds and ends of IP addresses that may or may not still do what you wanted them to.

Fortunately, Cloudflare has an IP list option that should make this a bit easier. I had not seen it before but I received a prompt when I was tweaking my rules. It allows me to create a list of IP addresses in a single place, then re-use that list in my rules. Otherwise, the IP addresses take up space in my rule and reduce my character limit for other parts of the rule.

Weirdly, this list is not connected to the individual sites in your account but to your account itself. This must make sense in someone else’s approach but I would not have looked for this resource next to my billing details. In any event, I now know how to find it and can add IP ranges when I feel like I’m finding particular problems. I’m still not sure how often I want to revisit them. On the one hand, I can export the list and re-import it—you can upload a list of IP ranges as a CSV file to populate the list, which can handle 11,000 entries—on a periodic basis. Or I can just watch for blocks from the list and make an educated guess about whether they are people or automated requests. For now, I am only going to reference the list from rules that have a managed challenge option. In that way, I should only stop bots.

It’s a frustrating amount of time to play cat and mouse. I really don’t mind making all of my information freely accessible. I do not want to monetize it. But at the same time, putting information out for free has a cost, to me, and these scrapers are using up finite resources at everyone else’s expense.