Website scraping, or simply web scraping is an act of extracting data on the web automatically, usually by using a series of automated requests generated by a program. This program is often called scraping bot.
It’s important to note that not all web scraping is coherently bad. Googlebot and Bingbot, for example, are technically scraping your website and then analyzing the scraped data to consider your website’s appropriate ranking on the SERPs. However, it is true that many malicious bots are scraping your website for negative purposes, like replicating your website’s content to post on other sites, finding vulnerabilities to allow further attacks, and so on.
So, in this guide, we will discuss how we can prevent website scraping from bad bots, while still allowing beneficial scrapers to access our site.
Basic Process of Web Scraping
Although there can be various techniques and methods used to extract a website’s data, typically the attacker program will send periodic HTTP requests to your server, and your server will respond by sending the web page or web files to the bot/program.
The attacker can then parse the HTML file and extract only the required data, and would repeat this process for thousands or even millions of different pages when required until the bot accomplishes its task.
However, this process might not be illegal, since technically the bot only extracts the information that the website made public (unless the bot extracts hidden files). So, web scraping can be considered a grey area, legal-wise.
Why Web Scraping is Bad For You?
As mentioned above, there are web scraping activities that are actually beneficial for you. However, even if it’s not totally illegal, web scraping done by automated bots can slow down your website.
The idea is fairly simple, your website’s and web server’s resources are limited, and every request from these scraper bots eat these resources. So, uncontrolled website scraping might translate into a massive, overwhelming number of requests that might provide an effect similar to DDoS (Distributed Denial of Service), and might crash your server.
Also, there are web scraping activities that are downright illegal (or at least, unethical). For example, the web scraper might publish your content in another low-quality website, creating a duplicate content issue. Scraper can also capture classified information like your pre-launched product’s price, which is then exploited by your competitors.
This is why detecting these web scrapers and managing them are very important if you want to maintain a competitive advantage.
In short, there are three main reasons
How To Prevent Web Scraping Activities
Below, we will discuss some important methods to prevent web scraping:
You can state something like:
- You may not mirror any material contained on this website
- You may only reproduce or use the content contained on this website for your non-commercial and/or personal use
Doing so would indicate that scrapers cannot extract and use the content of the website for commercial purposes. However, let us take a look at more practical ways of preventing web scraping.
2. Implementing CAPTCHA
Most of us should be familiar with the concept of CAPTCHA, and chances are, we have encountered at least one CAPTCHA test in the past.
CAPTCHA stands from Completely Automated Public Turing to tell Computers and Humans Apart. So, as the name suggests, it is technically a Turing test and the main purpose is to differentiate between human users and computers (well, bots).
The general rule of thumb is that a CAPTCHA should be easy and fast enough to be answered by any humans with average intelligence but at the same time should be very hard to solve by computers. However, there are two things to consider when implementing CAPTCHAs:
- Too many CAPTCHAs on your site can annoy users, so use them sparingly
- Modern bots are getting better in mimicking human behaviors, and also in voice and image recognition. So, more sophisticated CAPTCHAs might be necessary.
Here Are Some Tips If You Want To Implement Captcha On Your Site:
- You don’t have to create them from scratch. Google’s reCaptcha is trustworthy, reliable, and user-friendly and should fit most of your needs.
- A pretty common mistake is to include the solution to the CAPTCHA in the HTML markup, which can be scraped by bots.
3. Monitor and Control Your Logs and Traffic Pattern
Monitor your traffic logs regularly. In the case of unusual activities such as sudden spikes, increase (and decrease) in bounce rate, and increased bandwidth usage in general, you can block and limit access.
Here are some tips for this:
Don’t only monitor IP addresses. Modern web scraper bots are pretty sophisticated and can rotate its UA to use various IP addresses every minute or even every few seconds. Instead, you should monitor other indicators like:
- Behavioral analysis like mouse movements (linear vs non-linear), how fast they fill out forms, where they click, etc.
- Fingerprinting their browser and/or device. You can gather a lot of information like resolution, time zone, screen size, etc. You can use this information to differentiate between human users and bots
- Check for the presence of headless browsers (like PhantomJS) which can be a major red flag for bots
- Check for unusual activity: For example, if you have repeated requests from an IP address, it is a major sign of a web scraping bot. Other indicators are when a user performs an unusual number of searches or checking an excessive number of pages. Also, especially in web scraping activities, pay attention to outbound movements of traffic.
- Rate limiting: A common and effective approach is to limit users (and therefore, bots) to perform specific actions in a certain time period (i.e. x login attempts, x amount of searches per second, etc. ). You can, for example, show a CAPTCHA when actions are completed too fast.
4. Don’t Put Confidential Information On Your Website
This one might seem obvious but often overlooked. If you are worried about some information falling into the wrong hands, then it shouldn’t have been made public in the first place.
In general, don’t provide a way for web scraper bots to get all your dataset. For example, if your site is a blog with hundreds of articles on site, you should make these articles available only by searching via the on-site search. That is, you shouldn’t list these articles with all the URLs anywhere on your site.
Meaning, if the scraper still wants all your content, the bot must search for all the possible phrases and find the articles one by one. This can be time-consuming and inefficient, and hopefully, it will deter the bot.
5. Change Your HTML Markup Regularly
Web scraper bots rely on finding patterns and vulnerabilities in the site’s HTML markup. So, if your HTML markup changes frequently enough (or inconsistent, pattern-wise) you can confuse the scraper bots and might discourage them to spend their resources on your site.
This doesn’t need to be overly complicated. You can, for example, change the id and class in your HTML and CSS files, which should be enough to disrupt the bot scrapers’ activities. However, this can be a tedious process and difficult to maintain.
Our tip is to this sparingly enough and not too often. While you can automate this process to an extent, it can hinder caching.
6. Honeypot Page
We can think of honeypot pages as a trap for web scrapers. This method involves putting a link on your site (usually on the home page) that is invisible to humans, but not to bots. If the bot operates by clicking on all links on the website, it will click this honeypot URL.
This allows us to be quite sure that this is a bot and not a human user, and can start rate-limiting or blocking all requests from this specific user. This way, we can effectively stop this bot’s scraping activity.
We’ve discussed some simple but practical tips you can use to prevent web scraping activities on your website. While there are numerous techniques you can use to combat web scrapers, it’s important to note that there’s no one-size-fits-all technique that offers 100% protection. However, by combining a few, we can create a pretty reliable prevention system to protect our website.
Arguably, however, the best approach is to avoid putting sensitive, and confidential information accessible on your website in the first place.