Captcha

Bypassing Captcha

In the process of web scraping you probably encountered restrictions imposed by webmasters on automated usage of website resources. One such method is the use of captcha. Capcha is an automated Turing test to determine if user is a human or a robot. Usually, the user is shown an image with letters and numbers and the system asks you to enter these characters to the field, or a series of images where user have to select only images with a predefined thematic. For example, only those that show cars or road signs.

There are many services and software products that allow the webmaster to implement captcha on the site. The most famous are Google ReCaptcha and Funcaptcha services. If Captcha is complex and our OCR functionality does not help you to bypass it, then specialized services with manual CAPTCHA solution will come to your aid. We have implemented integration with two such services, and your diggers can easily pass captcha to resolve to these services and receive in response a special token or manually recognized letters and digits from the picture in automatic mode.

To use this functionality, you will need to have an account with one of these services. Since they are paid, you pay their charges yourself using your own account.

AntiCaptcha - service for solving the image and Google ReCaptcha v2 captcha. For Google ReCaptcha v2 the "proxyless" mode can be used (in normal mode you should use your own password protected proxy for scraping as AntiCaptcha workers will need to access website with captcha using your proxy server). This mode can be useful to those who do not have own proxy servers and does not want to use them, preferring to use our proxy network for scraping.

DeathByCaptcha - can do almost the same as AntiCaptcha, except for the "proxyless" mode. Also for this provider we have integrated only the service for the Google ReCaptcha v2. Therefore, if you want to use the "proxyless" mode or solve image captcha - use the AntiCaptcha service.

Command for resolving captcha captcha_resolve may be used in the block or page contexts. The process is fully automated for you, but since this is a manual job (3rd party captcha resolving process), the process can last from 20 seconds to 2 minutes. By completion of the command, the result of recognition (or a token for ReCaptcha v2) will be saved to the captcha variable.

This variable can then be read into the register and sent with a form of validation, or with a request to the server. Below we give you an example of how to correctly code the logic of work to resolve the ReCaptcha v2 and the standard image captcha.

The command uses the following parameters:

Parameter Description
provider Mandatory parameter. Indicates provider you are going to use for solving the captcha. At this moment the following providers are supported: deathbycaptcha.com and anticaptcha.
type Type of captcha. At this moment the following options are supported: image to resolve image captcha (works only for AntiCaptcha provider), nocaptchav2 to resolve Google ReCaptcha v2 and proxyless_recaptchav2 to resolve Google ReCaptcha v2 in the "proxyless" mode (works only for AntiCaptcha provider).
image If captcha type is image, this parameter is used to pass an image with captcha, encoded to the base64 format.
username If you are using Death By Captcha provider, this parameted should has username for your account on deathbycaptcha.com platform.
password If you are using Death By Captcha provider, this parameted should has password for your account on deathbycaptcha.com platform.
apikey If you are using AntiCaptcha provider, this parameter should has your API key for the AntiCaptcha platform.

Lets review the case when some page uses Google ReCaptcha v2. We will use "proxyless" mode:

            # LOADING THE PAGE WITH CAPTCHA
- walk:
    to: https://www.nebraska.gov/sos/corp/corpsearch.cgi
    do:
    # RESOLVING CAPTCHA
    - captcha_resolve:
        provider: anticaptcha
        type: proxyless_recaptchav2
        apikey: xxxxxxxxxxxxxxxxxxx
    - find: 
        path: body 
        do: 
        # CHECK IF WE HAVE A TOKEN IN THE captcha VARIABLE
        - variable_get: captcha 
        - if:
            match: \S
            do:
            # TOKEN IS OK, SENDING FORM
            - walk:
                to:
                    post: https://www.nebraska.gov/sos/corp/corpsearch.cgi
                    data:
                        search: 1
                        keyword_type: all
                        search_type: num_search
                        corpname: 
                        acct-num: 1000011010101
                        g-recaptcha-response: <%captcha%>
                        submit: submit
                do:
                # PARSE PAGE AND EXTRACT DATA
                
        

Another example for image captcha:

                # LOAD PAGE WITH CAPTCHA
- walk:
    to: https://eservices.cmcoh.org/eservices/home.page
    headers:
        Wicket-Focusedelementid: ''
        Wicket-Ajax: ''
    do:
    # FIND ELEMENT WITH IMAGE CAPTCHA
    - find:
        path: img.captchaImg
        do:
        # PARSE URL TO THE IMAGE CAPTCHA
        - parse:
            attr: src
        # LOAD IMAGE IN BASE64 ENCODING
        - walk:
            to: value
            do:
            - find:
                path: imgbase64
                do:
                - parse
                - variable_set: image
                # RESOLVE CAPTCHA
                - captcha_resolve:
                    provider: anticaptcha
                    type: image
                    apikey: xxxxxxxxxxxxxxxxxxx
                    image: <%image%>
        - find:
            path: a.anchorButton
            do:
            - variable_get: captcha
            - if:
                match: \w+
                do:
                # CAPTCHA SEEMS RESOLVED SO HERE WE CAN SEND IT TO THE SERVER WITH THE FORM
            
        

In some cases, captcha may be recognized incorrectly. And if you determine it, you can send a report on the incorrectly solved Captcha using the captcha_report command. In such cases captcha resolving service usually reimburse cost you paid them for captcha resolve job. But you should be extremely careful and do not send such a report if captcha was correctly recognized, otherwise the service you are suing for captcha resolve jobs can impose sanctions and limitations on your account.

                # OPEN PAGE WITH THE CAPTCHA
- walk:
    to: https://www.nebraska.gov/sos/corp/corpsearch.cgi
    do:
    # RESOLVE CAPTCHA
    - captcha_resolve:
        provider: anticaptcha
        type: proxyless_recaptchav2
        apikey: xxxxxxxxxxxxxxxxxxx
    - find: 
        path: body 
        do: 
        # CHECK IF captcha VARIABLE HAS A TOKEN
        - variable_get: captcha 
        - if:
            match: \S
            do:
            # TOKEN IS THERE, SENDING FORM
            - walk:
                to:
                    post: https://www.nebraska.gov/sos/corp/corpsearch.cgi
                    data:
                        search: 1
                        keyword_type: all
                        search_type: num_search
                        corpname: 
                        acct-num: 1000011010101
                        g-recaptcha-response: <%captcha%>
                        submit: submit
                do:
                - variable_clear: recap
                - find: 
                    path: body 
                    do:
                    # CHECK IF CAPTCHA TOKEN WAS ACCEPTED
                    - find: 
                        path: .g-recaptcha 
                        do: 
                        # IF THIS BLOCK IS EXIST, THE TOKEN IS INVALID
                        - parse:
                            attr: data-sitekey
                        - variable_set: recap
                    - variable_get: recap
                    - if:
                        match: \S
                        do:
                        # CAPTCHA WAS RESOLVED WRONGLY, REPORTING
                        - captcha_report
                        else:
                        # CAPTCHA WAS SOLVED PROPERLY, EXTRACTING DATA
            
        

Now its time to learn more about geospatial data and how you can work with in on Diggernaut platform.