Captcha

Bypassing Captcha

In the process of web scraping you probably encountered restrictions imposed by webmasters on automated usage of website resources. One such method is the use of captcha. Capcha is an automated Turing test to determine if user is a human or a robot. Usually, the user is shown an image with letters and numbers and the system asks you to enter these characters to the field, or a series of images where user have to select only images with a predefined thematic. For example, only those that show cars or road signs.

There are many services and software products that allow the webmaster to implement captcha on the site. The most famous are Google ReCaptcha and Funcaptcha services. If Captcha is complex and our OCR functionality does not help you to bypass it, then specialized services with manual CAPTCHA solution will come to your aid. We have implemented integration with two such services, and your diggers can easily pass captcha to resolve to these services and receive in response a special token or manually recognized letters and digits from the picture in automatic mode.

To use this functionality, you will need to have an account with one of these services. Since they are paid, you pay their charges yourself using your own account.

Diggernaut is our service for solving a graphic captcha on specific sites. This service is free for all our users and does not require the connection of additional services. This functionality is available for use only in the cloud. It doesn't work in compiled diggers. At the moment, the captcha solution works for the following websites: Amazon.

2Captcha - service for solving the image and Google ReCaptcha v2 captcha. For Google ReCaptcha v2 the "proxyless" mode can be used (in normal mode you should use your own password protected proxy for scraping as 2Captcha workers will need to access website with captcha using your proxy server). This mode can be useful to those who do not have own proxy servers and does not want to use them, preferring to use our proxy network for scraping. Its only service so far for solving Google ReCaptcha v3.

AntiCaptcha - service for solving the image and Google ReCaptcha v2 captcha. For Google ReCaptcha v2 the "proxyless" mode can be used (in normal mode you should use your own password protected proxy for scraping as AntiCaptcha workers will need to access website with captcha using your proxy server). This mode can be useful to those who do not have own proxy servers and does not want to use them, preferring to use our proxy network for scraping.

DeathByCaptcha - can do almost the same as AntiCaptcha, except for the "proxyless" mode. Also for this provider we have integrated only the service for the Google ReCaptcha v2. Therefore, if you want to use the "proxyless" mode or solve image captcha - use the AntiCaptcha service.

Command for resolving captcha captcha_resolve may be used in the block or page contexts. The process is fully automated for you, but since this is a manual job (3rd party captcha resolving process), the process can last from 20 seconds to 2 minutes. By completion of the command, the result of recognition (or a token for ReCaptcha v2) will be saved to the captcha variable.

This variable can then be read into the register and sent with a form of validation, or with a request to the server. Below we give you an example of how to correctly code the logic of work to resolve the ReCaptcha v2 and the standard image captcha.

The command uses the following parameters:

Parameter	Description
provider	Mandatory parameter. Indicates provider you are going to use for solving the captcha. At this moment the following providers are supported: deathbycaptcha.com, anticaptcha and 2captcha.
type	Type of captcha. At this moment the following options are supported: image to resolve image captcha (works only for AntiCaptcha provider), recaptchav2 to resolve Google ReCaptcha v2, proxyless_recaptchav2 to resolve Google ReCaptcha v2 in the "proxyless" mode (works only wuth AntiCaptcha and 2Captcha providers) and recaptchav3 to resolve Google ReCaptcha v3 (works only with 2Captcha provider). When using the Diggernaut provider, you need to specify one of the supported website identifiers in this field: amazon.
image	If captcha type is image or amazon, this parameter is used to pass an image with captcha, encoded to the base64 format.
username	If you are using Death By Captcha provider, this parameted should has username for your account on deathbycaptcha.com platform.
password	If you are using Death By Captcha provider, this parameted should has password for your account on deathbycaptcha.com platform.
apikey	If you are using AntiCaptcha or 2Captcha provider, this parameter should has your API key for the AntiCaptcha/2Captcha platform.
sitekey	Site key is unique identifier of website where ReCaptcha v2 or v3 is used. Usually its retrieved automatically, but if it cannot be extracted for some reason, you can set it manually as parameter.
action	Special action parameter that used for ReCaptcha v3. Usually its retrieved automatically, but if it cannot be extracted for some reason, you can set it manually as parameter.

Below you can see the sample on how to solve Amazon captcha easily:

                        # SET VARIABLE TO USE IT IN THE WALK COMMAND
- variable_set:
    field: "repeat"
    value: "yes"
# OPEN PAGE IN REPEAT MODE (AS WE NEED TO RELOAD PAGE IF THERE IS CAPTCHA)
- walk:
    to: https://www.amazon.com
    repeat: <%repeat%>
    do:
    # SWITCH THE THE BODY BLOCK
    - find:
        path: body
        do:
        - parse
        # CHECK IF THERE IS A CAPTCHA
        - if:
            match: Type the characters you see in this image
            do:
            # THERE IS A CAPTCHA
            - variable_set:
                field: "repeat"
                value: "yes"
            # COLLECT ALL REQUIRED PARAMETERS FROM THE PAGE
            # SAVE THEM TO VARIABLES
            - find:
                path: input[name="amzn"]
                do:
                - parse:
                    attr: value
                - normalize:
                    routine: urlencode
                - variable_set: amzn
            - find:
                path: input[name="amzn-r"]
                do:
                - parse:
                    attr: value
                - normalize:
                    routine: urlencode
                - variable_set: amznr
            # SWITCH TO THE BLOCK WITH CAPTCHA IMAGE
            - find:
                path: div.a-row>img
                do:
                # PARSE URL TO THE IMAGE
                - parse:
                    attr: src
                # LOAD THE IMAGE
                - walk:
                    to: value
                    do:
                    # SWITCH TO THE BLOCK WITH THE BASE64 ENCODED IMAGE
                    - find:
                        path: imgbase64
                        do:
                        # PARSE CONTENT AND SAVE IT TO THE VARIABLE
                        - parse
                        - variable_set: capimg
                        # SOLVE CAPTCHA
                        - captcha_resolve:
                            provider: diggernaut
                            type: amazon
                            image: <%capimg%>
                        # READ VARIABLE TO THE REGISTER
                        - variable_get: captcha
                        # IF CAPTCHA IS SOLVED
                        - if:
                            match: \S+
                            do:
                            # SEND IT TO THE AMAZON SERVER
                            - walk:
                                to: https://www.amazon.com/errors/validateCaptcha?amzn=<%amzn%>&amzn-r=<%amznr%>&field-keywords=<%captcha%>
                                do:
            else:
            # THERE IS NO CAPTCHA, TURN OFF REPEAT MODE
            - variable_set:
                field: "repeat"
                value: "no"
            # PARSE PAGE AND GET DATA

Lets review the case when some page uses Google ReCaptcha v2. We will use "proxyless" mode:

            # LOADING THE PAGE WITH CAPTCHA
- walk:
    to: https://www.nebraska.gov/sos/corp/corpsearch.cgi
    do:
    # RESOLVING CAPTCHA
    - captcha_resolve:
        provider: anticaptcha
        type: proxyless_recaptchav2
        apikey: xxxxxxxxxxxxxxxxxxx
    - find: 
        path: body 
        do: 
        # CHECK IF WE HAVE A TOKEN IN THE captcha VARIABLE
        - variable_get: captcha 
        - if:
            match: \S
            do:
            # TOKEN IS OK, SENDING FORM
            - walk:
                to:
                    post: https://www.nebraska.gov/sos/corp/corpsearch.cgi
                    data:
                        search: 1
                        keyword_type: all
                        search_type: num_search
                        corpname: 
                        acct-num: 1000011010101
                        g-recaptcha-response: <%captcha%>
                        submit: submit
                do:
                # PARSE PAGE AND EXTRACT DATA

Another example for image captcha:

                # LOAD PAGE WITH CAPTCHA
- walk:
    to: https://eservices.cmcoh.org/eservices/home.page
    headers:
        Wicket-Focusedelementid: ''
        Wicket-Ajax: ''
    do:
    # FIND ELEMENT WITH IMAGE CAPTCHA
    - find:
        path: img.captchaImg
        do:
        # PARSE URL TO THE IMAGE CAPTCHA
        - parse:
            attr: src
        # LOAD IMAGE IN BASE64 ENCODING
        - walk:
            to: value
            do:
            - find:
                path: imgbase64
                do:
                - parse
                - variable_set: image
                # RESOLVE CAPTCHA
                - captcha_resolve:
                    provider: anticaptcha
                    type: image
                    apikey: xxxxxxxxxxxxxxxxxxx
                    image: <%image%>
        - find:
            path: a.anchorButton
            do:
            - variable_get: captcha
            - if:
                match: \w+
                do:
                # CAPTCHA SEEMS RESOLVED SO HERE WE CAN SEND IT TO THE SERVER WITH THE FORM

In some cases, captcha may be recognized incorrectly. And if you determine it, you can send a report on the incorrectly solved Captcha using the captcha_report command. In such cases captcha resolving service usually reimburse cost you paid them for captcha resolve job. But you should be extremely careful and do not send such a report if captcha was correctly recognized, otherwise the service you are suing for captcha resolve jobs can impose sanctions and limitations on your account.

                # OPEN PAGE WITH THE CAPTCHA
- walk:
    to: https://www.nebraska.gov/sos/corp/corpsearch.cgi
    do:
    # RESOLVE CAPTCHA
    - captcha_resolve:
        provider: anticaptcha
        type: proxyless_recaptchav2
        apikey: xxxxxxxxxxxxxxxxxxx
    - find: 
        path: body 
        do: 
        # CHECK IF captcha VARIABLE HAS A TOKEN
        - variable_get: captcha 
        - if:
            match: \S
            do:
            # TOKEN IS THERE, SENDING FORM
            - walk:
                to:
                    post: https://www.nebraska.gov/sos/corp/corpsearch.cgi
                    data:
                        search: 1
                        keyword_type: all
                        search_type: num_search
                        corpname: 
                        acct-num: 1000011010101
                        g-recaptcha-response: <%captcha%>
                        submit: submit
                do:
                - variable_clear: recap
                - find: 
                    path: body 
                    do:
                    # CHECK IF CAPTCHA TOKEN WAS ACCEPTED
                    - find: 
                        path: .g-recaptcha 
                        do: 
                        # IF THIS BLOCK IS EXIST, THE TOKEN IS INVALID
                        - parse:
                            attr: data-sitekey
                        - variable_set: recap
                    - variable_get: recap
                    - if:
                        match: \S
                        do:
                        # CAPTCHA WAS RESOLVED WRONGLY, REPORTING
                        - captcha_report
                        else:
                        # CAPTCHA WAS SOLVED PROPERLY, EXTRACTING DATA

Next we are going to learn how to modify images and save them.