{"id":441,"date":"2018-02-28T20:29:51","date_gmt":"2018-02-28T20:29:51","guid":{"rendered":"https:\/\/www.diggernaut.com\/blog\/?p=441"},"modified":"2019-01-12T13:00:48","modified_gmt":"2019-01-12T13:00:48","slug":"bypass-captcha-diggernaut-web-scraping-platform-solving-google-recaptcha-v2","status":"publish","type":"post","link":"https:\/\/www.diggernaut.com\/blog\/bypass-captcha-diggernaut-web-scraping-platform-solving-google-recaptcha-v2\/","title":{"rendered":"How to bypass captcha on Diggernaut web scraping platform: Solving Google reCaptcha v2"},"content":{"rendered":"<p>Google reCaptcha v2 is not a problem for our users anymore. We recently implemented integration with Death by Captcha service, and you can easily bypass such captcha now.<\/p>\n<p>Let\u2019s see how reCaptcha v2 looks like:<\/p>\n<figure id=\"attachment_mmd_442\" class=\"wp-block-image alignnone\"><a href=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/recaptchav21.jpg\"><img width=\"1905\" height=\"525\" src=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/recaptchav21.jpg\" class=\"attachment-full size-full\" alt=\"Bypass captcha: identifying reCaptcha v2 by token API\" decoding=\"async\" loading=\"lazy\" align=\"none\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/recaptchav21.jpg 1905w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/recaptchav21-768x212.jpg 768w\" sizes=\"auto, (max-width: 1905px) 100vw, 1905px\" \/><\/a><\/figure>\n<p>If you see such captcha on the website you need to scrape, you can continue reading as we are going to give you guide how to implement it in your web scraper. Moreover, we are going to use a real example. We are planning to scrape this website: <a href=\"http:\/\/www.receita.fazenda.gov.br\/PessoaJuridica\/CNPJ\/cnpjreva\/Cnpjreva_Solicitacao2.asp\">http:\/\/www.receita.fazenda.gov.br\/PessoaJuridica\/CNPJ\/cnpjreva\/Cnpjreva_Solicitacao2.asp<\/a>.<\/p>\n<p>To use this solution you need to have an account with <a href=\"http:\/\/www.deathbycaptcha.com\">Death by Captcha<\/a> service. This service is not free, solving 1000 such captchas going to cost you approx 2.89$ (at least this is an actual price on 02.28.2018).<\/p>\n<p>Diggernaut solve captcha automatically. You need to load the page with captcha and call special command <strong>captcha_resolve<\/strong> with specific parameters:<\/p>\n<p><strong>provider<\/strong>: captcha solution provider, should be set to <em>deathbycaptcha.com<\/em>\n<br>\n<strong>type<\/strong>: captcha type, should be set to <em>nocaptchav2<\/em>\n<br>\n<strong>username<\/strong>: your death by captcha account username<br>\n<strong>password<\/strong>: your death by captcha account password<\/p>\n<p>Also please <strong>NOTE!<\/strong> To have such captcha successfully resolved, persons who resolve your captcha should be able to connect page with captcha using the same IP as your web scraper. So, the only way to achieve it is to use <strong>YOUR OWN PROXY SERVER<\/strong> in the digger configuration. Any IP outside of our network cannot access our rotating proxies, so they cannot be used for this operation.<\/p>\n<p>So base code for our scraper to resolve captcha would be:<\/p>\n<pre class=\"language-yaml line-numbers\"><code class=\"language-yaml\">---\nconfig:\n    debug: 2\n    agent: Firefox\n    proxy: PUT YOUR OWN PROXY HERE\ndo:\n# We are going to retry calls until we get successfully resolved captcha (yes its not 100% successful rate), so we set variable we going to use in the walk command\n- variable_set:\n    field: repeat\n    value: &quot;yes&quot;\n# Loading page with captcha\n- walk:\n    to: http:\/\/www.receita.fazenda.gov.br\/PessoaJuridica\/CNPJ\/cnpjreva\/cnpjreva_solicitacao2.asp\n    repeat: \n    do:\n        # Resolving captcha\n    - captcha_resolve:\n        provider: deathbycaptcha.com\n        type: nocaptchav2\n        username: YOUR DBC USERNAME HERE\n        password: YOUR DBC PASSWORD HERE<\/code><\/pre>\n<p><strong>Don\u2019t run this code yet.<\/strong> If the captcha has been successfully resolved, you get token value in the <strong>captcha<\/strong> variable. So the first thing we are going to do after captcha resolve process is to check if we have some value in the variable and depending on check results either retry call or submit token to the server and get the results.<\/p>\n<pre class=\"language-yaml line-numbers\"><code class=\"language-yaml\">---\nconfig:\n    debug: 2\n    agent: Firefox\n    proxy: PUT YOUR OWN PROXY HERE\ndo:\n# We are going to retry calls until we get successfully resolved captcha (yes its not 100% successful rate), so we set variable we going to use in the walk command\n- variable_set:\n    field: repeat\n    value: &quot;yes&quot;\n# Loading page with captcha\n- walk:\n    to: http:\/\/www.receita.fazenda.gov.br\/PessoaJuridica\/CNPJ\/cnpjreva\/cnpjreva_solicitacao2.asp\n    repeat: \n    do:\n        # Resolving captcha\n    - captcha_resolve:\n        provider: deathbycaptcha.com\n        type: nocaptchav2\n        username: YOUR DBC USERNAME HERE\n        password: YOUR DBC PASSWORD HERE\n        # Switching to the body block\n    - find:\n        path: body\n        do:\n                # Reading captcha variable value to the register\n        - variable_get: captcha\n                # Checking if register has not empty value\n        - if:\n            match: \\w+\n            do:\n                        # If its not empty, turn off repeat mode for walk command\n            - variable_set:\n                field: repeat\n                value: &quot;no&quot;\n                        # Submit token along with other form data we need to submit to get information we need: we are getting information about the company using its CNPJ number\n            - walk:\n                to:\n                    post: http:\/\/www.receita.fazenda.gov.br\/PessoaJuridica\/CNPJ\/cnpjreva\/valida_recaptcha.asp\n                    data:\n                        origem: comprovante\n                        cnpj: 05754558000186\n                        g-recaptcha-response: \n                        submit1: Consultar\n                        search_type: cnpj\n                do:\n                - find:\n                    path: &#039;div#principal&#039;\n                    do:\n                    - object_new: item\n                    - find:\n                        path: td:haschild(font:contains(&#039;N\u00daMERO DE INSCRI\u00c7\u00c3O&#039;)) b\n                        slice: 0\n                        do:\n                        - parse\n                        - space_dedupe\n                        - trim\n                        - object_field_set:\n                            object: item\n                            field: registration_number\n                    - find:\n                        path: td:haschild(font:contains(&#039;DATA DE ABERTURA&#039;)) b\n                        slice: 0\n                        do:\n                        - parse\n                        - space_dedupe\n                        - trim\n                        - object_field_set:\n                            object: item\n                            field: registration_date\n                    - find:\n                        path: td:haschild(font:contains(&#039;NOME EMPRESARIAL&#039;)) b\n                        slice: 0\n                        do:\n                        - parse\n                        - space_dedupe\n                        - trim\n                        - object_field_set:\n                            object: item\n                            field: company_name\n                    - find:\n                        path: td:haschild(font:contains(&#039;C\u00d3DIGO E DESCRI\u00c7\u00c3O DA ATIVIDADE ECON\u00d4MICA PRINCIPAL&#039;)) b\n                        slice: 0\n                        do:\n                        - parse\n                        - space_dedupe\n                        - trim\n                        - object_field_set:\n                            object: item\n                            field: primary_code\n                    - find:\n                        path: td:haschild(font:contains(&#039;C\u00d3DIGO E DESCRI\u00c7\u00c3O DAS ATIVIDADES ECON\u00d4MICAS SECUND\u00c1RIAS&#039;)) b\n                        do:\n                        - parse\n                        - space_dedupe\n                        - trim\n                        - object_field_push:\n                            object: item\n                            field: secondary_codes\n                    - find:\n                        path: td:haschild(font:contains(&#039;C\u00d3DIGO E DESCRI\u00c7\u00c3O DA NATUREZA JUR\u00cdDICA&#039;)) b\n                        slice: 0\n                        do:\n                        - parse\n                        - space_dedupe\n                        - trim\n                        - object_field_set:\n                            object: item\n                            field: legal_code\n                    - find:\n                        path: td:haschild(font:contains(&#039;LOGRADOURO&#039;)) b\n                        slice: 0\n                        do:\n                        - parse\n                        - space_dedupe\n                        - trim\n                        - object_field_set:\n                            object: item\n                            field: street\n                    - find:\n                        path: td:haschild(font:contains(&#039;BAIRRO\/DISTRITO&#039;)) b\n                        slice: 0\n                        do:\n                        - parse\n                        - space_dedupe\n                        - trim\n                        - object_field_set:\n                            object: item\n                            field: district\n                    - find:\n                        path: td:haschild(font:contains(&#039;MUNIC\u00cdPIO&#039;)) b\n                        slice: 0\n                        do:\n                        - parse\n                        - space_dedupe\n                        - trim\n                        - object_field_set:\n                            object: item\n                            field: municipal\n                    - find:\n                        path: td:haschild(font:contains(&#039;TELEFONE&#039;)) b\n                        slice: 0\n                        do:\n                        - parse\n                        - space_dedupe\n                        - trim\n                        - object_field_set:\n                            object: item\n                            field: phone\n                    - find:\n                        path: td:haschild(font:contains(&#039;E-MAIL&#039;)) b\n                        slice: 0\n                        do:\n                        - parse\n                        - space_dedupe\n                        - trim\n                        - object_field_set:\n                            object: item\n                            field: email\n                    - object_save:\n                        name: item<\/code><\/pre>\n<p>You should get the following data when you run it.<\/p>\n<pre><code class=\"language-js\">[{\n    &quot;item&quot;: {\n        &quot;company_name&quot;: &quot;LIST - LOGISTICA INTEGRADA, SERVICOS E TRANSPORTES LTDA&quot;,\n        &quot;district&quot;: &quot;COROADO&quot;,\n        &quot;legal_code&quot;: &quot;206-2 - Sociedade Empres\u00e1ria Limitada&quot;,\n        &quot;municipal&quot;: &quot;MANAUS&quot;,\n        &quot;phone&quot;: &quot;(92) 3622-7885&quot;,\n        &quot;primary_code&quot;: &quot;52.12-5-00 - Carga e descarga&quot;,\n        &quot;registration_date&quot;: &quot;26\/06\/2003&quot;,\n        &quot;registration_number&quot;: &quot;05.754.558\/0001-86&quot;,\n        &quot;secondary_codes&quot;: [\n            &quot;49.30-2-02 - Transporte rodovi\u00e1rio de carga, exceto produtos perigosos e mudan\u00e7as, intermunicipal, interestadual e internacional&quot;,\n            &quot;77.39-0-99 - Aluguel de outras m\u00e1quinas e equipamentos comerciais e industriais n\u00e3o especificados anteriormente, sem operador&quot;,\n            &quot;77.19-5-99 - Loca\u00e7\u00e3o de outros meios de transporte n\u00e3o especificados anteriormente, sem condutor&quot;,\n            &quot;52.29-0-99 - Outras atividades auxiliares dos transportes terrestres n\u00e3o especificadas anteriormente&quot;,\n            &quot;49.30-2-01 - Transporte rodovi\u00e1rio de carga, exceto produtos perigosos e mudan\u00e7as, municipal&quot;,\n            &quot;52.50-8-03 - Agenciamento de cargas, exceto para o transporte mar\u00edtimo&quot;\n        ],\n        &quot;street&quot;: &quot;R PROFESSORA RAYMUNDA MAGALHAES&quot;\n    }\n}\n]\n<\/code><\/pre>\n<p>This particular solution can be used in \u201cData on demand\u201d scenario when someone enters a number on your website. Your website sends a command to the Diggernaut API to run digger with a specific incoming parameter.  API then return results to your website and its render results to your user. Please note, that captcha resolve process is not very fast. For example, resolve process for reCaptcha v2 may take more than a minute.<\/p>\n<p>However, in any case, we are hoping that this article was useful for you and you now know, how to bypass recaptcha v2 with Diggernaut and Death by Captcha. We are planning to implement other captcha solutions soon, so please keep tuned.<\/p>","protected":false},"excerpt":{"rendered":"<p>Google reCaptcha v2 is not a problem for our users anymore. We recently implemented integration with Death by Captcha service, and you can easily bypass such captcha now. Let\u2019s see how reCaptcha v2 looks like: If you see such captcha on the website you need to scrape, you can continue reading as we are going [&hellip;]<\/p>","protected":false},"author":4,"featured_media":447,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[32,9,2],"tags":[],"class_list":["post-441","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-codeproject","category-learning-meta-language","category-web-scraping"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts\/441","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/comments?post=441"}],"version-history":[{"count":7,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts\/441\/revisions"}],"predecessor-version":[{"id":634,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts\/441\/revisions\/634"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/media\/447"}],"wp:attachment":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/media?parent=441"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/categories?post=441"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/tags?post=441"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}