It becomes fairly easy to gather information from the Internet with all the advanced scrapers. However, avoiding getting detected and, consequently, blocked still remains quite a serious issue. Along with the scraping process getting more available for non-technicians, website owners come up with new techniques to protect their content.
We will not get into the ethical side of scraping and legal details in this guide. We’re leaving all that to your judgment and conscience. In this article, we will focus solely on how one can avoid bans while scraping or crawling through websites.
Manage your User-Agent and other headers
User-Agent contains information about you and your device. It is basically your face you show the destination server when you visit a website. Servers or technicians who manage them constantly monitor the activity of visitors considering their User Agents. If one of them sends too many requests – so many that there is no way it is a generic user – the system gets suspicious and blocks said Agent.
So it’s only logical to invest some effort into managing your User-Agent if you want to scrape websites successfully. You can switch them manually or set up the scraper so that it selects different headers randomly from a list of common User Agents. This is one of the most efficient ways to prevent bans and gain consistency in web scraping.
Besides User-Agents, there are other headers that can sabotage your work. Scrapers and crawlers quite often send accept and accept-encoding headers that differ from ones a real browser would send. So take your time and modify all the headers to make sure your requests don’t look like they’re sent by a bot.
Maintain a reasonable pace
It’s easy to get detected if your crawler is sending requests like crazy. If the real person is browsing the website, the number of requests from this IP address will be relatively low, and there will be at least a couple of seconds between them. So when you set up your software to gather data as quickly as possible, the destination server will quickly spot this as suspicious activity and block the access for your bot. Websites are very cautious about large numbers of requests because that’s what a DDOS attack looks like. However, they’re not too happy about getting scraped either.
It’s better to take a slower pace for scraping. Of course, the process will take longer. But if you will keep getting banned, it will take you forever to gather the information you need. Set up your bot so that it waits for a couple of seconds before sending another request – then chances are high the destination server will not detect your activity.
Proxies are remote servers that serve as a medium between you and the destination website. When connected to a proxy, you pick up its IP address to cover up your authentic one. Thus, once you reach the destination server, it will see the data of the proxy, not your real information. It allows you to pretend to be different users when you scrape – that’s how you can bypass a lot of restrictions.
While it sounds very easy, in reality, proxies require you to have some knowledge about them. So let’s take a quick look at this technology. Looking for proxies you will find a lot of free ones – stay away from them because they’re not reliable. They’re used by different people who not always have the best intentions, and as a result, most of the free IP addresses are already blacklisted. Therefore, such proxies will not help you. In fact, they can even make things worse.
It’s much more efficient to get paid proxies, especially considering that the cost is not too high. Most providers offer different plans and types of proxies, so that’s another thing we need to make clear. Let’s take Infatica for example as it’s a provider with a large pool of IP addresses. Popping on the website of this vendor you will see that it offers three types of proxies: datacenter, residential, and mobile. You should choose residential ones for web scraping, and here is why.
Datacenter proxies are shared servers, and many users connect to them to change their IP addresses. Since you will share the same IP with other people, the chance to get detected becomes higher. However, it’s still less likely that you will get banned using datacenter proxies than if you’re not changing your data at all. Considering that such type of proxies is the cheapest one, you can use them for scraping if your budget is tight. But you still might experience issues.
Residential proxies, on the other hand, are real devices you get connected to. You will be the only one connected to the gadget, and the latter will have a real domestic IP address issued by ISP. Therefore, utilizing such proxies you will appear as another generic user who is just browsing the website casually. Choosing residential proxies you will get access to a pool of IP addresses you can utilize.
Residential type is a bit more expensive than datacenter but it’s more reliable. So it’s worth a bit higher costs. Also, residential proxies usually come with some kind of rotation pattern, established by the provider. Although, often it’s not enough for scraping, so set up your tool so that it manages IP addresses properly.
Mobile proxies are basically residential ones with the only difference – all IP addresses belong only to mobile devices. This is the most expensive service, and it’s fairly too much for scraping. You can stick to cheaper residential proxies and feel totally fine.
Take a breather
Set up your bot so that it backs off if it gets 403 and 503 errors. If the scraper proceeds pushing requests, it will get banned very quickly. Moreover, if you’re using proxies, and your bot keeps trying to reach the destination page, you will get proxy IPs blocked as well. So make sure your program stops if it encounters the error and leaves the restricted or hidden page behind.
Remember about robots.txt
Not many websites block an IP address that tries to send requests to pages that are blocked by robots.txt file. Yet, it is an existing restriction you should take care of. Obey robots.txt files, and that will help you scrape efficiently.
If the pages you need are blocked by this file, you will have to ignore robots.txt. But we highly recommend to respect the rules a webmaster sets and try your best to obey this file. Doing so you can avoid possible legal issues.
Sometimes you need to collect cookies
For example, if you’re scraping search engines, they will detect that something is off if you don’t send any cookie data. It’s an unusual behavior a generic user doesn’t show. So it might be a good move to allow the scraper to keep and send cookies with each request. But remember that you will receive personalized results in Google in this case.
The main boss – CAPTCHAs
It’s one of the most annoying things on the internet that both robots and humans hate. Since people are also not very fond of CAPTCHAs, there are much fewer of them online today than it was a couple of years ago. However, they still exist, and you need to do something about them. If you search the Internet, you will find quite many solutions that will help you bypass CAPTCHAs. For example, there are many instruments that use Tesseract OCR to recognize text. Or you can try API services that will help your scraper to solve CAPTCHAs with the help of humans. As machine learning and image recognition improves, we can expect future scrapers to be able to crack CAPTCHAs easily.
Of course, it’s hard to cover every single issue you might encounter when scraping the Internet. Yet, we did our best to tell you about the most popular anti-scraping measures and ways to bypass them. Hopefully, this guide was helpful for you. And, of course, if you have any questions left, or you want to share your experience – leave a comment below.