In the process of web scraping you probably encountered restrictions imposed by webmasters on automated usage of website resources. One such method is the use of captcha. Capcha is an automated Turing test to determine if user is a human or a robot. Usually, the user is shown an image with letters and numbers and the system asks you to enter these characters to the field, or a series of images where user have to select only images with a predefined thematic. For example, only those that show cars or road signs.
There are many services and software products that allow the webmaster to implement captcha on the site. The most famous are Google ReCaptcha and Funcaptcha services. If Captcha is complex and our OCR functionality does not help you to bypass it, then specialized services with manual CAPTCHA solution will come to your aid. We have implemented integration with two such services, and your diggers can easily pass captcha to resolve to these services and receive in response a special token or manually recognized letters and digits from the picture in automatic mode.
To use this functionality, you will need to have an account with one of these services. Since they are paid, you pay their charges yourself using your own account.
AntiCaptcha - service for solving the image and Google ReCaptcha v2 captcha. For Google ReCaptcha v2 the "proxyless" mode can be used (in normal mode you should use your own password protected proxy for scraping as AntiCaptcha workers will need to access website with captcha using your proxy server). This mode can be useful to those who do not have own proxy servers and does not want to use them, preferring to use our proxy network for scraping.
DeathByCaptcha - can do almost the same as AntiCaptcha, except for the "proxyless" mode. Also for this provider we have integrated only the service for the Google ReCaptcha v2. Therefore, if you want to use the "proxyless" mode or solve image captcha - use the AntiCaptcha service.
Command for resolving captcha captcha_resolve may be used in the block or page contexts. The process is fully automated for you, but since this is a manual job (3rd party captcha resolving process), the process can last from 20 seconds to 2 minutes. By completion of the command, the result of recognition (or a token for ReCaptcha v2) will be saved to the captcha variable.
This variable can then be read into the register and sent with a form of validation, or with a request to the server. Below we give you an example of how to correctly code the logic of work to resolve the ReCaptcha v2 and the standard image captcha.
The command uses the following parameters:
|provider||Mandatory parameter. Indicates provider you are going to use for solving the captcha. At this moment the following providers are supported: deathbycaptcha.com and anticaptcha.|
|type||Type of captcha. At this moment the following options are supported: image to resolve image captcha (works only for AntiCaptcha provider), nocaptchav2 to resolve Google ReCaptcha v2 and proxyless_recaptchav2 to resolve Google ReCaptcha v2 in the "proxyless" mode (works only for AntiCaptcha provider).|
|image||If captcha type is image, this parameter is used to pass an image with captcha, encoded to the base64 format.|
|username||If you are using Death By Captcha provider, this parameted should has username for your account on deathbycaptcha.com platform.|
|password||If you are using Death By Captcha provider, this parameted should has password for your account on deathbycaptcha.com platform.|
|apikey||If you are using AntiCaptcha provider, this parameter should has your API key for the AntiCaptcha platform.|
Lets review the case when some page uses Google ReCaptcha v2. We will use "proxyless" mode:
# LOADING THE PAGE WITH CAPTCHA - walk: to: https://www.nebraska.gov/sos/corp/corpsearch.cgi do: # RESOLVING CAPTCHA - captcha_resolve: provider: anticaptcha type: proxyless_recaptchav2 apikey: xxxxxxxxxxxxxxxxxxx - find: path: body do: # CHECK IF WE HAVE A TOKEN IN THE captcha VARIABLE - variable_get: captcha - if: match: \S do: # TOKEN IS OK, SENDING FORM - walk: to: post: https://www.nebraska.gov/sos/corp/corpsearch.cgi data: search: 1 keyword_type: all search_type: num_search corpname: acct-num: 1000011010101 g-recaptcha-response: <%captcha%> submit: submit do: # PARSE PAGE AND EXTRACT DATA
Another example for image captcha:
# LOAD PAGE WITH CAPTCHA - walk: to: https://eservices.cmcoh.org/eservices/home.page headers: Wicket-Focusedelementid: '' Wicket-Ajax: '' do: # FIND ELEMENT WITH IMAGE CAPTCHA - find: path: img.captchaImg do: # PARSE URL TO THE IMAGE CAPTCHA - parse: attr: src # LOAD IMAGE IN BASE64 ENCODING - walk: to: value do: - find: path: imgbase64 do: - parse - variable_set: image # RESOLVE CAPTCHA - captcha_resolve: provider: anticaptcha type: image apikey: xxxxxxxxxxxxxxxxxxx image: <%image%> - find: path: a.anchorButton do: - variable_get: captcha - if: match: \w+ do: # CAPTCHA SEEMS RESOLVED SO HERE WE CAN SEND IT TO THE SERVER WITH THE FORM
In some cases, captcha may be recognized incorrectly. And if you determine it, you can send a report on the incorrectly solved Captcha using the captcha_report command. In such cases captcha resolving service usually reimburse cost you paid them for captcha resolve job. But you should be extremely careful and do not send such a report if captcha was correctly recognized, otherwise the service you are suing for captcha resolve jobs can impose sanctions and limitations on your account.
# OPEN PAGE WITH THE CAPTCHA - walk: to: https://www.nebraska.gov/sos/corp/corpsearch.cgi do: # RESOLVE CAPTCHA - captcha_resolve: provider: anticaptcha type: proxyless_recaptchav2 apikey: xxxxxxxxxxxxxxxxxxx - find: path: body do: # CHECK IF captcha VARIABLE HAS A TOKEN - variable_get: captcha - if: match: \S do: # TOKEN IS THERE, SENDING FORM - walk: to: post: https://www.nebraska.gov/sos/corp/corpsearch.cgi data: search: 1 keyword_type: all search_type: num_search corpname: acct-num: 1000011010101 g-recaptcha-response: <%captcha%> submit: submit do: - variable_clear: recap - find: path: body do: # CHECK IF CAPTCHA TOKEN WAS ACCEPTED - find: path: .g-recaptcha do: # IF THIS BLOCK IS EXIST, THE TOKEN IS INVALID - parse: attr: data-sitekey - variable_set: recap - variable_get: recap - if: match: \S do: # CAPTCHA WAS RESOLVED WRONGLY, REPORTING - captcha_report else: # CAPTCHA WAS SOLVED PROPERLY, EXTRACTING DATA
Now its time to learn more about geospatial data and how you can work with in on Diggernaut platform.