{"id":711,"date":"2019-01-20T15:43:39","date_gmt":"2019-01-20T15:43:39","guid":{"rendered":"https:\/\/www.diggernaut.com\/blog\/?p=711"},"modified":"2020-05-03T00:56:47","modified_gmt":"2020-05-03T00:56:47","slug":"build-web-scraper-for-amazon-in-30-min","status":"publish","type":"post","link":"https:\/\/www.diggernaut.com\/blog\/build-web-scraper-for-amazon-in-30-min\/","title":{"rendered":"Build web scraper for Amazon in 30 min"},"content":{"rendered":"<p>Today we are going to build a web scraper for Amazon.com. The tool will be designed to collect basic information about products from a specific category. If you wish, you can expand the dataset to be collected on your own. Or, if you do not want to spend your time, you have the opportunity <a href=\"https:\/\/www.diggernaut.com\/hire\/\">to hire our developers<\/a>.<\/p>\n<h3>Important points before starting development<\/h3>\n<p>Amazon renders the goods depending on the geo-factor, which is determined by the client&#8217;s IP address. Therefore, if you are interested in information for the US market, you should use a proxy from the USA. In our Diggernaut platform, you can specify geo-targeting to a specific country using the <a href=\"https:\/\/www.diggernaut.com\/dev\/meta-language-basic-settings-proxies.html\">geo<\/a> option. However, it only works with paid subscription plans. With a free account, you can use own proxy server. How to use them is described in our documentation in the link above. If you do not need a targeting by country, you can omit any settings in the proxy section. In this case, mixed proxies from our pool will be used. Of course, if you run the web scraper in the cloud. To reduce the chance of blocking, we will also use pauses between requests.<\/p>\n<p>There is one more thing about which we want to tell you. Amazon can temporarily block the IP from which automated requests go. Different means can be used for it. For example, Amazon may show a captcha or a page with an error. Therefore, for the scraper to work successfully, we need to think about how it will catch and bypass these cases.<\/p>\n<h3>Bypassing Amazon.com captcha<\/h3>\n<figure id=\"attachment_mmd_712\" class=\"wp-block-image alignnone\"><a href=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_captcha.png\"><img width=\"1897\" height=\"563\" src=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_captcha.png\" class=\"attachment-full size-full\" alt=\"Bypassing Amazon.com captcha\" decoding=\"async\" loading=\"lazy\" align=\"none\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_captcha.png 1897w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_captcha-300x89.png 300w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_captcha-768x228.png 768w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_captcha-1024x304.png 1024w\" sizes=\"auto, (max-width: 1897px) 100vw, 1897px\" \/><\/a><\/figure>\n<p>We will bypass the captcha with our internal captcha solution. Since this mechanism works as a microservice, it is available only when running the digger in the cloud, but it is free for all users of the Diggernaut platform. If you want to run the compiled digger on your computer, you will need to use one of the integrated services to solve the captcha: <a href=\"http:\/\/getcaptchasolution.com\/djlpm4vcub\">Anti-captcha<\/a> or <a href=\"https:\/\/2captcha.com\/?from=7106312\">2Captcha<\/a>. You will also need own account in one of these services. In addition, you will have to change the scraper code a little. Namely, configure the parameters of the <a href=\"https:\/\/www.diggernaut.com\/dev\/meta-language-methods-captcha-bypassing-captcha.html\">captcha_resolve<\/a> command.<\/p>\n<h3>Bypassing the access error<\/h3>\n<figure id=\"attachment_mmd_713\" class=\"wp-block-image alignnone\"><a href=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_error.png\"><img width=\"1915\" height=\"880\" src=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_error.png\" class=\"attachment-full size-full\" alt=\"Bypassing the access error\" decoding=\"async\" loading=\"lazy\" align=\"none\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_error.png 1915w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_error-300x138.png 300w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_error-768x353.png 768w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_error-1024x471.png 1024w\" sizes=\"auto, (max-width: 1915px) 100vw, 1915px\" \/><\/a><\/figure>\n<p>To bypass the access error, we will use proxy rotation and repeat mode for the walk command. This mode allows you to loop the page request until we say that everything went well. When rotating the proxy, the digger selects the next proxy from the list. If the list ends, the scraper returns to the first proxy. This function works both with the proxy specified in the config by the user and with the proxy in our cloud, which all users of the Diggernaut platform have access to.<\/p>\n<h3>Amazon Scraping Algorithm<\/h3>\n<p>Since the category has a paginator and many catalog pages with products, we will use the pool. It will allow us to describe the logic of parsing only once, for the entire pool, and not for each page separately. Take into account that the maximum number of pages in one category (or search query) given by Amazon is 400. Therefore, if there are more than 8000 products in your category and you want to collect the maximum quantity, then you need to revise the parameters of the search query, or you should collect products from subcategories. Our web scraper will be able to extract product information by any search request, so you can configure all the query filters in the browser and use the URL from the browser address bar as the start page in the config.<\/p>\n<p>The algorithm will be as follows. We will create a pool and put a start page in it. Then we go to the pool, load the next page from it. Check if there is a captcha page. If so, we solve it and reload the page. Also, we check if Amazon has returned the access error. If so, we change the proxy and reload the page. If all checks are successful, then web scraper should parse the page and collect information about products. Next, we find the paginator, extract the link to the next page and add it to the pool.<\/p>\n<p>Let&#8217;s look at the start page and define the CSS selectors that we need to find the paginator, the block with the product information and the fields to be extracted. To do it, we need Google Chrome and the developer tools built into it.<\/p>\n<h3>Find the CSS selector for listings (products)<\/h3>\n<p>Open the following URL in your browser:<br>\n<code>https:\/\/www.amazon.com\/s?bbn=16225011011&amp;rh=n%3A%2116225011011%2Cn%3A284507%2Cn%3A289913%2Cn%3A289940&amp;dc&amp;fst=as%3Aoff&amp;ie=UTF8&amp;qid=1547931533&amp;rnid=289913&amp;ref=sr_nr_n_1<\/code><\/p>\n<p>Press the right mouse button on the page and select the &#8220;Show code&#8221; option.<\/p>\n<figure id=\"attachment_mmd_715\" class=\"wp-block-image alignnone\"><a href=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_open_dev_tools.png\"><img width=\"1524\" height=\"531\" src=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_open_dev_tools.png\" class=\"attachment-full size-full\" alt=\"Open developer tools\" decoding=\"async\" loading=\"lazy\" align=\"none\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_open_dev_tools.png 1524w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_open_dev_tools-300x105.png 300w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_open_dev_tools-768x268.png 768w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_open_dev_tools-1024x357.png 1024w\" sizes=\"auto, (max-width: 1524px) 100vw, 1524px\" \/><\/a><\/figure>\n<p>It will open a panel with developer tools. We are interested in a function for selecting an item on a page. Therefore, we activate this tool and select the first block with the product on the page. In the window with the source code of the page, you will immediately see the selected item.<\/p>\n<figure id=\"attachment_mmd_717\" class=\"wp-block-image alignnone\"><a href=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/Amazon_select_listing.png\"><img width=\"1582\" height=\"695\" src=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/Amazon_select_listing.png\" class=\"attachment-full size-full\" alt=\"Select Amazon listing\" decoding=\"async\" loading=\"lazy\" align=\"none\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/Amazon_select_listing.png 1582w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/Amazon_select_listing-300x132.png 300w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/Amazon_select_listing-768x337.png 768w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/Amazon_select_listing-1024x450.png 1024w\" sizes=\"auto, (max-width: 1582px) 100vw, 1582px\" \/><\/a><\/figure>\n<p>If you look closely at the HTML code in this part, you will see that all listings have the class <code>s-result-item<\/code>. Therefore, our listing selector will be: <code>div.s-result-item<\/code>.<\/p>\n<figure id=\"attachment_mmd_716\" class=\"wp-block-image alignnone\"><a href=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_selecting_listings.png\"><img width=\"1898\" height=\"711\" src=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_selecting_listings.png\" class=\"attachment-full size-full\" alt=\"CSS selector for Amazon listing\" decoding=\"async\" loading=\"lazy\" align=\"none\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_selecting_listings.png 1898w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_selecting_listings-300x112.png 300w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_selecting_listings-768x288.png 768w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_selecting_listings-1024x384.png 1024w\" sizes=\"auto, (max-width: 1898px) 100vw, 1898px\" \/><\/a><\/figure>\n<h3>Find a paginator and define a CSS selector of the link to the next page<\/h3>\n<p>Paginator is at the bottom of the page.<\/p>\n<figure id=\"attachment_mmd_718\" class=\"wp-block-image alignnone\"><a href=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_paginator.png\"><img width=\"1754\" height=\"354\" src=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_paginator.png\" class=\"attachment-full size-full\" alt=\"Amazon paginator\" decoding=\"async\" loading=\"lazy\" align=\"none\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_paginator.png 1754w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_paginator-300x61.png 300w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_paginator-768x155.png 768w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_paginator-1024x207.png 1024w\" sizes=\"auto, (max-width: 1754px) 100vw, 1754px\" \/><\/a><\/figure>\n<p>Same way as with the finding listing selector, open the developer tools and select the &#8220;Next&#8221; button. We will see the selected element in the HTML source code of the page. It is the <code>a<\/code> tag inside the <code>li<\/code> tag with the <code>a-last<\/code> class. Therefore, our selector will be like this: <code>li.a-last &gt; a<\/code>.<\/p>\n<figure id=\"attachment_mmd_719\" class=\"wp-block-image alignnone\"><a href=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_next_page.png\"><img width=\"1657\" height=\"540\" src=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_next_page.png\" class=\"attachment-full size-full\" alt=\"Next page of paginator\" decoding=\"async\" loading=\"lazy\" align=\"none\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_next_page.png 1657w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_next_page-300x98.png 300w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_next_page-768x250.png 768w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_next_page-1024x334.png 1024w\" sizes=\"auto, (max-width: 1657px) 100vw, 1657px\" \/><\/a><\/figure>\n<p>Now we have selectors for listings and the next page in the catalog. We can proceed to the extraction of selectors for product fields.<\/p>\n<h3>Selectors for listing fields<\/h3>\n<figure id=\"attachment_mmd_720\" class=\"wp-block-image alignnone\"><a href=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_fields_en.png\"><img width=\"1616\" height=\"255\" src=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_fields_en.png\" class=\"attachment-full size-full\" alt=\"Listing field selectors\" decoding=\"async\" loading=\"lazy\" align=\"none\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_fields_en.png 1616w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_fields_en-300x47.png 300w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_fields_en-768x121.png 768w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_fields_en-1024x162.png 1024w\" sizes=\"auto, (max-width: 1616px) 100vw, 1616px\" \/><\/a><\/figure>\n<p>We can search for selectors on any of the products. The algorithm is the same as when we are getting any other selectors: we select a tool to select an element and click it on the page, watch and study the HTML fragment and extract a selector. But first, let&#8217;s take a close look at the item with the listing: <code>div.s-result-item<\/code>. This tag has an attribute <code>data-asin<\/code>. There is an ASIN (a unique identifier of the product variation in Amazon) stored. Having this ASIN, you can easily access the product page, since it can be formed using the following template: <code>https:\/\/www.amazon.com\/dp\/&lt;%ASIN%&gt;<\/code>, where  is Amazon ASIN for the product. In the same way, a link to the page with offers from other sellers on Amazon can be formed: <code>https:\/\/www.amazon.com\/gp\/offer-listing\/&lt;%ASIN%&gt;<\/code>. Therefore, we have to collect it and store in the data object field. The selector is not needed for it because it matches the listing selector. So when we switch to the listing block, we need to parse the attribute <code>data-asin<\/code> of the current block.<\/p>\n<p><strong>Product name<\/strong> &#8211; <code>h5<\/code>. There is only one h5 element in the listing block, so you can safely use this selector.<\/p>\n<p><strong>Brand<\/strong>. The brand is not as easy to extract, because all classes in the listing block are generalized and the brand block does not have a unique class or id. Therefore, we need to find some anchor to use. We know that we have only one h5 in the listing and that the block with the brand is in the same parent block with h5. It means that we can select the parent block using the <code>haschild<\/code> directive. This directive let you select an element that has a direct child element specified in the selector. In this case, the selector for the parent block will be: <code>div.a-section:haschild (h5)<\/code>. Now we need to add a selector for the block with the brand relative to its parent block: <code>div.a-color-secondary<\/code>. As a result, we get the following selector: <code>div.a-section:haschild(h5) &gt; div.a-color-secondary<\/code>. We also can see that in the block the brand is listed with the prefix &#8220;by&#8221;. Therefore, we will have to clean the data before storing it to the data object field using the <a href=\"https:\/\/www.diggernaut.com\/dev\/meta-language-methods-working-with-register-normalize.html\">normalize<\/a> function.<\/p>\n<p><strong>Rating and the number of reviews<\/strong>. We see that the rating and the number of reviews are in the tags <code>span<\/code>. And these tags have an attribute <code>aria-label<\/code>. To select all the elements with this attribute, we can use the following selector: <code>span[aria-label]<\/code>. However, there are may be 2, 3, 4 or even 5 such elements in one listing. What should we do in this case? The <code>slice<\/code> option for the <a href=\"https:\/\/www.diggernaut.com\/dev\/meta-language-methods-navigation-find.html\">find<\/a> command comes to rescue. Thus, we can select only the first element found (rating) and the second (number of reviews). Also for both fields, we will parse the contents of the <code>aria-label<\/code> attribute. However, the values in these fields contain additional text and symbols, and we want to get numeric values in our object with data. If we use the int and float types when <a href=\"https:\/\/www.diggernaut.com\/dev\/meta-language-methods-entity-manipulations-data-objects.html\">storing the object field<\/a>, then after the export of the dataset to Excel,  numeric filters and sort will work properly for numeric columns. In addition, using the <a href=\"https:\/\/www.diggernaut.com\/dev\/website-projects-data-validation.html\">validation scheme for dataset<\/a>, you can filter out unnecessary records by numeric value using numeric filters. So, to extract numeric values, we will use the <code>filter<\/code> option of the <a href=\"https:\/\/www.diggernaut.com\/dev\/meta-language-methods-working-with-register-parse.html\">parse<\/a> command.<\/p>\n<p><strong>Price<\/strong>. Everything is simple. The price is in the <code>span<\/code> tag with the class <code>a-price<\/code>. In this element, the price is presented in different formats. It will be easier to extract it from the element: <code>span.a-price &gt; span.a-offscreen<\/code>. However, on some listings, there may be two prices if the product is sold with a discount. Therefore, we will use the <code>slice<\/code> option and select the first element found (element with index 0, since the numbering of the elements of the array starts from 0).<\/p>\n<p><strong>Prime<\/strong> &#8211; an icon indicating whether the product has free express delivery with the Amazon Prime subscription. The selector for this element is also simple since this icon has a unique class: <code>i.a-icon-prime<\/code>. It will work as follows. We store the default value (&#8220;no&#8221;) to the &#8220;prime&#8221; field. Then we search for the block with the icon and switch into it. Then store &#8220;yes&#8221; to the &#8220;prime&#8221; field. If the icon is in the listing, the scraper will go to this block and execute the specified commands. If not, the default value will remain in the field.<\/p>\n<p>** Number of sellers **. We are not interested in sellers of used items, so we will collect only those items which indicate &#8220;new offers&#8221;. It will not give us entirely accurate data, because sometimes there are new and used product offers. So if you need to have exact numbers, you have to scrape the page with offers from other sellers. But in this particular case, we are ok with it. Selector for any link is <code>a<\/code>. However, in a listing block, there may be more than one link. Therefore, we are going to use the <code>contains<\/code> directive, which means that the selector should contain the specified text. Our selector should be like this: <code>a:contains(&quot;new offers&quot;)<\/code>.<\/p>\n<p><strong>Link to the full-sized image<\/strong> The image selector is a very simple: <code>img.s-image<\/code>. However, the image in the src attribute is not full-sized. How to fix it? Let us open you a little secret. To make a full-sized one from a trimmed image in Amazon, you need to delete a small piece of the URL. Suppose in the src attribute we have the following URL:<br>\n<code>https:\/\/m.media-amazon.com\/images\/I\/81pMpXtqWrL._AC_UL436_.jpg<\/code>. All we have to do is remove the <code>_AC_UL436_<\/code> string between the points before the file extension and remove one of the two points. We will do it using the normalization function.<\/p>\n<h3>Catalog with a different layout<\/h3>\n<p>Sometimes Amazon shows the catalog using a completely different layout. It means such pages will have other selectors and selectors we defined already will not work there. We don\u2019t know what the choice of one or another template depends on, but we have the opportunity to include logic for a variety of templates in the digger config. We will not go into the details of the selectors definitions, just list them below.<\/p>\n<figure id=\"attachment_mmd_721\" class=\"wp-block-image alignnone\"><a href=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_other_layout.png\"><img width=\"1898\" height=\"925\" src=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_other_layout.png\" class=\"attachment-full size-full\" alt=\"Amazon catalog with a different layout\" decoding=\"async\" loading=\"lazy\" align=\"none\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_other_layout.png 1898w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_other_layout-300x146.png 300w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_other_layout-768x374.png 768w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/amazon_other_layout-1024x499.png 1024w\" sizes=\"auto, (max-width: 1898px) 100vw, 1898px\" \/><\/a><\/figure>\n<p><strong>Listing selector<\/strong>: <code>li.s-result-item<\/code>\n<br>\n<strong>Next page selector<\/strong>: <code>a.pagnNext<\/code>\n<br>\n<strong>Product name<\/strong>: <code>h2<\/code>\n<br>\n<strong>Brand<\/strong>: <code>div.a-spacing-mini:has(h2) &gt; div.a-row &gt; span<\/code> (use slice and select last element)<br>\n<strong>Rating<\/strong>: <code>i.a-icon-star&gt;span<\/code>\n<br>\n<strong>Reviews<\/strong>: <code>div.a-spacing-none:has(i.a-icon-star) &gt; a.a-size-small<\/code>\n<br>\n<strong>Price<\/strong>: <code>span.a-offscreen<\/code>\n<br>\n<strong>Prime<\/strong>: <code>i.a-icon-prime<\/code>\n<br>\n<strong>Sellers<\/strong>: <code>a:contains(&quot;new offers&quot;)<\/code>\n<br>\n<strong>Link to the full-sized image<\/strong>: <code>img.s-access-image<\/code><\/p>\n<p>Now we have all the necessary selectors and a raw data processing plan. Let&#8217;s start writing the configuration of the web scraper.<\/p>\n<h3>Building the Amazon Web Scraper<\/h3>\n<p>Log into your account on the Diggernaut platform, create a new digger in any of your projects and click on the &#8220;Add config&#8221; button. Then just write the config with us, carefully reading the comments.<\/p>\n<pre><code class=\"language-yaml\">---\nconfig:\n    debug: 2\n    agent: Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/71.0.3578.98 Safari\/537.36\ndo:\n# Add start URL to the pool (you can add a list of start URLs)\n- link_add:\n    url:\n    - https:\/\/www.amazon.com\/s?bbn=16225011011&amp;rh=n%3A%2116225011011%2Cn%3A284507%2Cn%3A289913%2Cn%3A289940&amp;dc&amp;fst=as%3Aoff&amp;ie=UTF8&amp;qid=1547931533&amp;rnid=289913&amp;ref=sr_nr_n_1\n# Set variable to &quot;yes&quot; value to use it with repeat mode\n# while value of this variable is &quot;yes&quot;. the walk command will reload current page in the pool\n- variable_set:\n    field: rip\n    value: &quot;yes&quot;\n# Iterating links in the pool and loading them\n- walk:\n    to: links\n    repeat_in_pool: &lt;%rip%&gt;\n    do:\n    # Expecting to the worse and set repeat variable to &quot;yes&quot; value\n    - variable_set:\n        field: rip\n        value: &quot;yes&quot;\n    # Switch to the title block to check if access if blocked\n    - find: \n        path: title \n        do: \n        # Parse value of title to the register\n        - parse\n        # Check if there is word &quot;Sorry&quot;\n        - if:\n            match: Sorry\n            do:\n            # If so, access is blocked, so we switch the proxy\n            - proxy_switch\n            else:\n            # If not, then access is allowed. Check if there is captcha on the page.\n            # Switch to the body block. Please note of option &quot;in&quot; usage.\n            # Right now we are in the &quot;title&quot; bvlock and there is no body tag inside of it\n            # so we should search for body block in full document intead of current block\n            # to do it you can use option in: doc\n            - find:\n                path: body\n                in: doc\n                do:\n                # Parse entire page text to the register\n                - parse\n                # Check if there is specified string in the text\n                - if:\n                    match: Type the characters you see in this image\n                    do:\n                    # If so, we have a captcha on the page and we need to solve it\n                    # to do it we need to extract some parameters from the page and save them to variables\n                    - find:\n                        path: input[name=&quot;amzn&quot;]\n                        do:\n                        - parse:\n                            attr: value\n                        - normalize:\n                            routine: urlencode\n                        - variable_set: amzn\n                    - find:\n                        path: input[name=&quot;amzn-r&quot;]\n                        do:\n                        - parse:\n                            attr: value\n                        - normalize:\n                            routine: urlencode\n                        - variable_set: amznr\n                    # Switch to the block with captcha image\n                    - find:\n                        path: div.a-row&gt;img\n                        do:\n                        # Parse URL of the image\n                        - parse:\n                            attr: src\n                        # Load the image\n                        - walk:\n                            to: value\n                            do:\n                            # In imgbase64 block we will have an image encoded as base64\n                            # This is how Diggernaut works with binary data\n                            # Any binary file is encoded as base64  and you can work with it\n                            # using other Diggernaut functionality (such as OCR, save files etc)\n                            - find:\n                                path: imgbase64\n                                do:\n                                # Parse block content\n                                - parse\n                                # save it to the capimg variable\n                                - variable_set: capimg\n                                # Use command for solving captcha\n                                # Use &quot;diggernaut&quot; as provider to use in-house captcha solver\n                                # Our captcha type is &quot;amazon&quot;\n                                # Also we should pass image with captcha here, using variable capimg\n                                - captcha_resolve:\n                                    provider: diggernaut\n                                    type: amazon\n                                    image: &lt;%capimg%&gt;\n                                # After captcha_resolve exacution we should have\n                                # recognized captcha text in the &quot;captcha&quot; variable\n                                # so we read this variable value to the register\n                                - variable_get: captcha\n                                # And check if there is any value\n                                - if:\n                                    match: \\S+\n                                    do:\n                                    # If captcha recognized, we are sending captcha answer to Amazon server\n                                    - walk:\n                                        to: https:\/\/www.amazon.com\/errors\/validateCaptcha?amzn=&lt;%amzn%&gt;&amp;amzn-r=&lt;%amznr%&gt;&amp;field-keywords=&lt;%captcha%&gt;\n                                        do:\n                    else:\n                    # String not found, so page is without captcha\n                    # and we are working with standard catalog page\n                    # Turn off repeat mode for the current page\n                    - variable_set:\n                        field: rip\n                        value: &quot;no&quot;\n                    # Pause for 5 sec\n                    - sleep: 5\n                    # Start parsing process\n                    # First lets get next page and push to the pool\n                    - find:\n                        path: li.a-last &gt; a, a.pagnNext\n                        do:\n                        # Parse href attribute\n                        - parse:\n                            attr: href\n                        # Check if there is value in href\n                        - if:\n                            match: \\w+\n                            do:\n                            # If so put it to the pool\n                            - link_add\n                    # Extract listings, jump to each listing\n                    # First layout\n                    - find:\n                        path: div.s-result-item\n                        do:\n                        # Parse data-asin attribute to get ASIN\n                        - parse:\n                            attr: data-asin\n                        # Check if ASIN is in register\n                        - if:\n                            match: \\w+\n                            do:\n                            # Create new data object\n                            - object_new: item\n                            - object_field_set:\n                                object: item\n                                field: asin\n                            # Lets generate URL to product page\n                            # Save ASIN to variable\n                            - variable_set: asin\n                            # Write string to the register and then save URL to the object field\n                            - register_set: https:\/\/www.amazon.com\/dp\/&lt;%asin%&gt;\n                            - object_field_set:\n                                object: item\n                                field: url\n                            # Extract product name\n                            - find:\n                                path: h5\n                                do:\n                                - parse\n                                # Normalize whitespaces\n                                - space_dedupe\n                                # Trim the register value\n                                - trim\n                                # Save register value to the object field\n                                - object_field_set:\n                                    object: item\n                                    field: title\n                            # Extract brand\n                            - find:\n                                path: div.a-section:haschild(h5) &gt; div.a-color-secondary\n                                do:\n                                - parse\n                                - space_dedupe\n                                - trim\n                                # Remove &quot;by&quot; word\n                                - normalize:\n                                    routine: replace_substring\n                                    args:\n                                        ^by\\s+: &#039;&#039;\n                                - object_field_set:\n                                    object: item\n                                    field: brand\n                            # Extract rating\n                            - find:\n                                path: span[aria-label]\n                                slice: 0\n                                do:\n                                - parse:\n                                    attr: aria-label\n                                    filter: ^([0-9\\.]+)\n                                # Check if rating value is exist\n                                - if:\n                                    match: \\d+\n                                    do:\n                                    # Save value to the object field as float type\n                                    - object_field_set:\n                                        object: item\n                                        field: rating\n                                        type: float\n                            # Extract reviews\n                            - find:\n                                path: span[aria-label]\n                                slice: 1\n                                do:\n                                - parse:\n                                    attr: aria-label\n                                    filter: (\\d+)\n                                - if:\n                                    match: \\d+\n                                    do:\n                                    # Save value to the object field as int type\n                                    - object_field_set:\n                                        object: item\n                                        field: reviews\n                                        type: int\n                            # Extract price\n                            - find:\n                                path: span.a-price &gt; span.a-offscreen\n                                slice: 0\n                                do:\n                                - parse:\n                                    filter:\n                                    - ([0-9\\.]+)\\s*\\-\n                                    - ([0-9\\.]+)\n                                - object_field_set:\n                                    object: item\n                                    field: price\n                                    type: float\n                            # Extract prime\n                            # Dafault it to &quot;no&quot;\n                            - register_set: &quot;no&quot;\n                            - object_field_set:\n                                object: item\n                                field: prime\n                            - find:\n                                path: i.a-icon-prime\n                                do:\n                                - register_set: &quot;yes&quot;\n                                - object_field_set:\n                                    object: item\n                                    field: prime\n                            # Extract sellers\n                            - find:\n                                path: a:contains(&quot;new offers&quot;)\n                                do:\n                                - parse:\n                                    filter: (\\d+)\n                                - object_field_set:\n                                    object: item\n                                    field: sellers\n                                    type: int\n                            # Extract image\n                            - find:\n                                path: img.s-image\n                                do:\n                                - parse:\n                                    attr: src\n                                # Replace substring in URL to make it full-sized\n                                - normalize:\n                                    routine: replace_substring\n                                    args:\n                                        \\.[^\\.]+\\.jpg: &#039;.jpg&#039;\n                                - normalize:\n                                    routine: url\n                                - object_field_set:\n                                    object: item\n                                    field: image\n                            - object_save:\n                                name: item\n                    # Second layout\n                    - find:\n                        path: li.s-result-item\n                        do:\n                        - parse:\n                            attr: data-asin\n                        - if:\n                            match: \\w+\n                            do:\n                            - object_new: item\n                            - object_field_set:\n                                object: item\n                                field: asin\n                            - variable_set: asin\n                            - register_set: https:\/\/www.amazon.com\/dp\/&lt;%asin%&gt;\n                            - object_field_set:\n                                object: item\n                                field: url\n                            - find:\n                                path: h2\n                                do:\n                                - node_remove: span.a-offscreen\n                                - parse\n                                - space_dedupe\n                                - trim\n                                - object_field_set:\n                                    object: item\n                                    field: title\n                            # Slice here will select just last found element\n                            - find:\n                                path: div.a-spacing-mini:has(h2) &gt; div.a-row &gt; span\n                                slice: -1\n                                do:\n                                - parse\n                                - space_dedupe\n                                - trim\n                                - object_field_set:\n                                    object: item\n                                    field: brand\n                            - find:\n                                path: i.a-icon-star&gt;span\n                                do:\n                                - parse:\n                                    filter: ^([0-9\\.]+)\n                                - if:\n                                    match: \\d+\n                                    do:\n                                    - object_field_set:\n                                        object: item\n                                        field: rating\n                                        type: float\n                            - find:\n                                path: div.a-spacing-none:has(i.a-icon-star) &gt; a.a-size-small\n                                do:\n                                - parse:\n                                    filter: (\\d+)\n                                - if:\n                                    match: \\d+\n                                    do:\n                                    - object_field_set:\n                                        object: item\n                                        field: reviews\n                                        type: int\n                            - find:\n                                path: span.a-offscreen\n                                slice: -1\n                                do:\n                                - parse:\n                                    filter:\n                                    - ([0-9\\.]+)\\s*\\-\n                                    - ([0-9\\.]+)\n                                - object_field_set:\n                                    object: item\n                                    field: price\n                                    type: float\n                            - register_set: &quot;no&quot;\n                            - object_field_set:\n                                object: item\n                                field: prime\n                            - find:\n                                path: i.a-icon-prime\n                                do:\n                                - register_set: &quot;yes&quot;\n                                - object_field_set:\n                                    object: item\n                                    field: prime\n                            - find:\n                                path: a:contains(&quot;new offers&quot;)\n                                do:\n                                - parse:\n                                    filter: (\\d+)\n                                - object_field_set:\n                                    object: item\n                                    field: sellers\n                                    type: int\n                            - find:\n                                path: img.s-access-image\n                                do:\n                                - parse:\n                                    attr: src\n                                - normalize:\n                                    routine: replace_substring\n                                    args:\n                                        \\.[^\\.]+\\.jpg: &#039;.jpg&#039;\n                                - normalize:\n                                    routine: url\n                                - object_field_set:\n                                    object: item\n                                    field: image\n                            - object_save:\n                                name: item\n<\/code><\/pre>\n<p>You can also download sample of <a href=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/01\/digger_4885_session_634309.xlsx\">Amazon dataset<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>Today we are going to build a web scraper for Amazon.com. The tool will be designed to collect basic information about products from a specific category. If you wish, you can expand the dataset to be collected on your own. Or, if you do not want to spend your time, you have the opportunity to [&hellip;]<\/p>","protected":false},"author":4,"featured_media":725,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[],"class_list":["post-711","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-web-scraping"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts\/711","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/comments?post=711"}],"version-history":[{"count":4,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts\/711\/revisions"}],"predecessor-version":[{"id":726,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts\/711\/revisions\/726"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/media\/725"}],"wp:attachment":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/media?parent=711"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/categories?post=711"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/tags?post=711"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}