Mikhail Sisin Follow Co-founder of cloud-based web scraping and data extraction platform Diggernaut. Over 10 years of experience in data extraction, ETL, AI, and ML.

How to bypass captcha on Diggernaut web scraping platform: Solving Google reCaptcha v2

February 28, 2018 4 min read

Google reCaptcha v2 is not a problem for our users anymore. We recently implemented integration with Death by Captcha service, and you can easily bypass such captcha now.

Let’s see how reCaptcha v2 looks like:

If you see such captcha on the website you need to scrape, you can continue reading as we are going to give you guide how to implement it in your web scraper. Moreover, we are going to use a real example. We are planning to scrape this website: http://www.receita.fazenda.gov.br/PessoaJuridica/CNPJ/cnpjreva/Cnpjreva_Solicitacao2.asp.

To use this solution you need to have an account with Death by Captcha service. This service is not free, solving 1000 such captchas going to cost you approx 2.89$ (at least this is an actual price on 02.28.2018).

Diggernaut solve captcha automatically. You need to load the page with captcha and call special command captcha_resolve with specific parameters:

provider: captcha solution provider, should be set to deathbycaptcha.com
type: captcha type, should be set to nocaptchav2
username: your death by captcha account username
password: your death by captcha account password

Also please NOTE! To have such captcha successfully resolved, persons who resolve your captcha should be able to connect page with captcha using the same IP as your web scraper. So, the only way to achieve it is to use YOUR OWN PROXY SERVER in the digger configuration. Any IP outside of our network cannot access our rotating proxies, so they cannot be used for this operation.

So base code for our scraper to resolve captcha would be:

---
config:
    debug: 2
    agent: Firefox
    proxy: PUT YOUR OWN PROXY HERE
do:
# We are going to retry calls until we get successfully resolved captcha (yes its not 100% successful rate), so we set variable we going to use in the walk command
- variable_set:
    field: repeat
    value: "yes"
# Loading page with captcha
- walk:
    to: http://www.receita.fazenda.gov.br/PessoaJuridica/CNPJ/cnpjreva/cnpjreva_solicitacao2.asp
    repeat: <%repeat%>
    do:
        # Resolving captcha
    - captcha_resolve:
        provider: deathbycaptcha.com
        type: nocaptchav2
        username: YOUR DBC USERNAME HERE
        password: YOUR DBC PASSWORD HERE

Don’t run this code yet. If the captcha has been successfully resolved, you get token value in the captcha variable. So the first thing we are going to do after captcha resolve process is to check if we have some value in the variable and depending on check results either retry call or submit token to the server and get the results.

---
config:
    debug: 2
    agent: Firefox
    proxy: PUT YOUR OWN PROXY HERE
do:
# We are going to retry calls until we get successfully resolved captcha (yes its not 100% successful rate), so we set variable we going to use in the walk command
- variable_set:
    field: repeat
    value: "yes"
# Loading page with captcha
- walk:
    to: http://www.receita.fazenda.gov.br/PessoaJuridica/CNPJ/cnpjreva/cnpjreva_solicitacao2.asp
    repeat: <%repeat%>
    do:
        # Resolving captcha
    - captcha_resolve:
        provider: deathbycaptcha.com
        type: nocaptchav2
        username: YOUR DBC USERNAME HERE
        password: YOUR DBC PASSWORD HERE
        # Switching to the body block
    - find:
        path: body
        do:
                # Reading captcha variable value to the register
        - variable_get: captcha
                # Checking if register has not empty value
        - if:
            match: \w+
            do:
                        # If its not empty, turn off repeat mode for walk command
            - variable_set:
                field: repeat
                value: "no"
                        # Submit token along with other form data we need to submit to get information we need: we are getting information about the company using its CNPJ number
            - walk:
                to:
                    post: http://www.receita.fazenda.gov.br/PessoaJuridica/CNPJ/cnpjreva/valida_recaptcha.asp
                    data:
                        origem: comprovante
                        cnpj: 05754558000186
                        g-recaptcha-response: <%captcha%>
                        submit1: Consultar
                        search_type: cnpj
                do:
                - find:
                    path: 'div#principal'
                    do:
                    - object_new: item
                    - find:
                        path: td:haschild(font:contains('NÚMERO DE INSCRIÇÃO')) b
                        slice: 0
                        do:
                        - parse
                        - space_dedupe
                        - trim
                        - object_field_set:
                            object: item
                            field: registration_number
                    - find:
                        path: td:haschild(font:contains('DATA DE ABERTURA')) b
                        slice: 0
                        do:
                        - parse
                        - space_dedupe
                        - trim
                        - object_field_set:
                            object: item
                            field: registration_date
                    - find:
                        path: td:haschild(font:contains('NOME EMPRESARIAL')) b
                        slice: 0
                        do:
                        - parse
                        - space_dedupe
                        - trim
                        - object_field_set:
                            object: item
                            field: company_name
                    - find:
                        path: td:haschild(font:contains('CÓDIGO E DESCRIÇÃO DA ATIVIDADE ECONÔMICA PRINCIPAL')) b
                        slice: 0
                        do:
                        - parse
                        - space_dedupe
                        - trim
                        - object_field_set:
                            object: item
                            field: primary_code
                    - find:
                        path: td:haschild(font:contains('CÓDIGO E DESCRIÇÃO DAS ATIVIDADES ECONÔMICAS SECUNDÁRIAS')) b
                        do:
                        - parse
                        - space_dedupe
                        - trim
                        - object_field_push:
                            object: item
                            field: secondary_codes
                    - find:
                        path: td:haschild(font:contains('CÓDIGO E DESCRIÇÃO DA NATUREZA JURÍDICA')) b
                        slice: 0
                        do:
                        - parse
                        - space_dedupe
                        - trim
                        - object_field_set:
                            object: item
                            field: legal_code
                    - find:
                        path: td:haschild(font:contains('LOGRADOURO')) b
                        slice: 0
                        do:
                        - parse
                        - space_dedupe
                        - trim
                        - object_field_set:
                            object: item
                            field: street
                    - find:
                        path: td:haschild(font:contains('BAIRRO/DISTRITO')) b
                        slice: 0
                        do:
                        - parse
                        - space_dedupe
                        - trim
                        - object_field_set:
                            object: item
                            field: district
                    - find:
                        path: td:haschild(font:contains('MUNICÍPIO')) b
                        slice: 0
                        do:
                        - parse
                        - space_dedupe
                        - trim
                        - object_field_set:
                            object: item
                            field: municipal
                    - find:
                        path: td:haschild(font:contains('TELEFONE')) b
                        slice: 0
                        do:
                        - parse
                        - space_dedupe
                        - trim
                        - object_field_set:
                            object: item
                            field: phone
                    - find:
                        path: td:haschild(font:contains('E-MAIL')) b
                        slice: 0
                        do:
                        - parse
                        - space_dedupe
                        - trim
                        - object_field_set:
                            object: item
                            field: email
                    - object_save:
                        name: item

You should get the following data when you run it.

[{
    "item": {
        "company_name": "LIST - LOGISTICA INTEGRADA, SERVICOS E TRANSPORTES LTDA",
        "district": "COROADO",
        "legal_code": "206-2 - Sociedade Empresária Limitada",
        "municipal": "MANAUS",
        "phone": "(92) 3622-7885",
        "primary_code": "52.12-5-00 - Carga e descarga",
        "registration_date": "26/06/2003",
        "registration_number": "05.754.558/0001-86",
        "secondary_codes": [
            "49.30-2-02 - Transporte rodoviário de carga, exceto produtos perigosos e mudanças, intermunicipal, interestadual e internacional",
            "77.39-0-99 - Aluguel de outras máquinas e equipamentos comerciais e industriais não especificados anteriormente, sem operador",
            "77.19-5-99 - Locação de outros meios de transporte não especificados anteriormente, sem condutor",
            "52.29-0-99 - Outras atividades auxiliares dos transportes terrestres não especificadas anteriormente",
            "49.30-2-01 - Transporte rodoviário de carga, exceto produtos perigosos e mudanças, municipal",
            "52.50-8-03 - Agenciamento de cargas, exceto para o transporte marítimo"
        ],
        "street": "R PROFESSORA RAYMUNDA MAGALHAES"
    }
}
]

This particular solution can be used in “Data on demand” scenario when someone enters a number on your website. Your website sends a command to the Diggernaut API to run digger with a specific incoming parameter. API then return results to your website and its render results to your user. Please note, that captcha resolve process is not very fast. For example, resolve process for reCaptcha v2 may take more than a minute.

However, in any case, we are hoping that this article was useful for you and you now know, how to bypass recaptcha v2 with Diggernaut and Death by Captcha. We are planning to implement other captcha solutions soon, so please keep tuned.

Mikhail Sisin Follow Co-founder of cloud-based web scraping and data extraction platform Diggernaut. Over 10 years of experience in data extraction, ETL, AI, and ML.