Mikhail Sisin Follow Co-founder of cloud-based web scraping and data extraction platform Diggernaut. Over 10 years of experience in data extraction, ETL, AI, and ML.

Automated CloudFlare challenge solution with Golang

May 18, 2018 1 min read

There is a new version of Surf library for Golang has been pushed. This version can bypass a fresh version of CloudFlare protection. We are using this library in our engine so our users can feel all benefits. Library bypass protection in automated mode, so you don’t need to do anything extra. You are just loading a page as usual, and if there is CloudFlare challenge, library resolves it automatically it, and you get content of the page you requested.

You are free to use Surf library from our repo for your projects, it’s under MIT license and is forked from headzoo/surf. However, we are using own version that fit needs of our web scraping engine.

How to test if it works. You can try to load some page which is under protection. This site is under CloudFlare. Let’s try to use following digger config to get this page and extract website URL:

---
config:
    debug: 2
    agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36
do:
- walk:
    to: https://www.g2crowd.com/products/essbase/details
    do:
    - find:
        path: div.company-info
        do:
        - object_new: item
        - find:
            path: dl > dt:contains("Vendor") + dd
            do:
            - parse
            - space_dedupe
            - trim
            - object_field_set:
                object: item
                field: vendor
        - find:
            path: dl > dt:contains("Description") + dd
            do:
            - parse
            - space_dedupe
            - trim
            - object_field_set:
                object: item
                field: description
        - find:
            path: dl > dt:contains("Company Website") + dd>a
            do:
            - parse:
                attr: href
            - space_dedupe
            - trim
            - object_field_set:
                object: item
                field: website
        - object_save:
            name: item

Data we get will looks like:

{
  item : {
    website :  "https://www.oracle.com/index.html",
    vendor :  "Oracle",
    description :  "Oracle Corporation develops, manufactures, markets, hosts, and supports database and middleware software, applications software, and hardware systems."
  }
}

Mikhail Sisin Follow Co-founder of cloud-based web scraping and data extraction platform Diggernaut. Over 10 years of experience in data extraction, ETL, AI, and ML.

« How to collect data from Instagram business profiles

Geospatial data in the modern world »

Automated CloudFlare challenge solution with Golang

Json to XML, or “transform in 6 seconds.”

Leave a Reply Cancel reply