{"id":750,"date":"2019-02-10T13:08:40","date_gmt":"2019-02-10T13:08:40","guid":{"rendered":"https:\/\/www.diggernaut.com\/blog\/?p=750"},"modified":"2020-02-01T07:46:56","modified_gmt":"2020-02-01T07:46:56","slug":"scraping-olx-classified-ads-making-a-one-stop-solution","status":"publish","type":"post","link":"https:\/\/www.diggernaut.com\/blog\/scraping-olx-classified-ads-making-a-one-stop-solution\/","title":{"rendered":"Scraping OLX classified ads: making a one-stop solution."},"content":{"rendered":"<p>Probably many people know what OLX is. In Russia, the company was absorbed by Avito. However, OLX still exists in many other countries: Ukraine, Poland, Kazakhstan, and many others. A complete list of countries can be found on the main site <a href=\"https:\/\/www.olx.com\/\">OLX<\/a>.<\/p>\n<p><strong>Modified on 01 Feb 2020. Added functionality to bypass cases when proxy is banned<\/strong><\/p>\n<p>Since all the OLX sites are usually built on the same framework, the web scraper that we code can theoretically work with the website in any country. There may be exceptions of course, but as a rule, everything should work. Therefore, we are going to use OLX Ukraine as a base, and after we have the web scraper ready, we will test it with other sites as well.<\/p>\n<p>So, we will begin with the catalog. We<br>\nshould understand how the navigation between the pages of one section works. Find where to get links to pages with a specific ad. Choose a random category: <a href=\"https:\/\/www.olx.ua\/detskiy-mir\/detskaya-odezhda\/dnepr\/\">Children\u2019s clothing<\/a>. Open the page in Chrome, and enable the developer tools. Go to the Elements tab, and select a tool to inspect the item on the page (1). Click on the block we are interested in with the first item in the list (2), then, in the HTML code in the \u201cElements\u201d tab, you should be able to see selected HTML node (3).<\/p>\n<figure id=\"attachment_mmd_752\" class=\"wp-block-image aligncenter\"><a href=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx1.jpg\"><img width=\"1873\" height=\"888\" src=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx1.jpg\" class=\"attachment-full size-full\" alt=\"OLX: Selecting ad block\" decoding=\"async\" loading=\"lazy\" align=\"center\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx1.jpg 1873w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx1-300x142.jpg 300w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx1-768x364.jpg 768w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx1-1024x485.jpg 1024w\" sizes=\"auto, (max-width: 1873px) 100vw, 1873px\" \/><\/a><\/figure>\n<p>We may be able to use <code>td.offer<\/code> as the CSS selector to select a block, but first, we need to make sure of that. To do it, press CTRL + F, while you are in the HTML code of the \u201cElements\u201d tab. Let\u2019s type our selector in the search bar. If you do everything correctly, then you should see that 44 elements (1) have been found. To check if the selector has not taken something extra, simply use the up and down buttons (2) and see what nodes are selected. If you want to exclude ads in the top (which are promoted for money), you can use the following selector: <code>td.offer: not (.promoted)<\/code>.<\/p>\n<figure id=\"attachment_mmd_753\" class=\"wp-block-image aligncenter\"><a href=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx2.jpg\"><img width=\"1595\" height=\"345\" src=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx2.jpg\" class=\"attachment-full size-full\" alt=\"OLX: Checking CSS selector\" decoding=\"async\" loading=\"lazy\" align=\"center\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx2.jpg 1595w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx2-300x65.jpg 300w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx2-768x166.jpg 768w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx2-1024x221.jpg 1024w\" sizes=\"auto, (max-width: 1595px) 100vw, 1595px\" \/><\/a><\/figure>\n<p>However, we do not need the block itself, but a link to the ad page. Therefore, let\u2019s open the HTML elements (1) and find the link we need (2). Thus, our selector for links to ads pages will be <code>td.offer a.link.detailsLink<\/code>. We need to check and make sure that there are exactly 44 links. In different versions of OLX, there may be different formatting of blocks with ads, so we can use the <code>a.link.detailsLink<\/code> selector for better compatibility.<\/p>\n<figure id=\"attachment_mmd_755\" class=\"wp-block-image aligncenter\"><a href=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx3.jpg\"><img width=\"1876\" height=\"692\" src=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx3.jpg\" class=\"attachment-full size-full\" alt=\"OLX: Finding links to the ad pages\" decoding=\"async\" loading=\"lazy\" align=\"center\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx3.jpg 1876w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx3-300x111.jpg 300w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx3-768x283.jpg 768w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx3-1024x378.jpg 1024w\" sizes=\"auto, (max-width: 1876px) 100vw, 1876px\" \/><\/a><\/figure>\n<p>Let\u2019s check the paginator. Doing the same as we found the elements with the ad, we are going to find the link to the next page in the paginator (3). And we get the selector <code>a[data-cy=&quot;page-link-next&quot;]<\/code>. Let\u2019s make sure that there is just single element with such selector on the page.<\/p>\n<figure id=\"attachment_mmd_756\" class=\"wp-block-image aligncenter\"><a href=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx4.jpg\"><img width=\"1903\" height=\"836\" src=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx4.jpg\" class=\"attachment-full size-full\" alt=\"OLX: Find the next page link\" decoding=\"async\" loading=\"lazy\" align=\"center\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx4.jpg 1903w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx4-300x132.jpg 300w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx4-768x337.jpg 768w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx4-1024x450.jpg 1024w\" sizes=\"auto, (max-width: 1903px) 100vw, 1903px\" \/><\/a><\/figure>\n<p>Now we have everything to describe the scraping logic for the specific category. To navigate through the pages of the category we are planning to use <a href=\"https:\/\/www.diggernaut.com\/dev\/meta-language-runtime-entities-link-pool.html\">link pool<\/a>. This will allow us to use the same code for all pages of the category. Therefore, our scraper  looks like:<\/p>\n<pre><code class=\"language-yaml\">---\nconfig:\n    agent: Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/71.0.3578.98 Safari\/537.36\n    debug: 2\ndo:\n# Push start URL to the pool\n- link_add:\n    url: https:\/\/www.olx.ua\/detskiy-mir\/detskaya-odezhda\/dnepr\/\n# Iterate over the pool and load each link\n- walk:\n    to: links\n    do:\n    # Find the link to the next page\n    - find:\n        path: a[data-cy=&quot;page-link-next&quot;]\n        do:\n        # Parse link from href attribute\n        - parse:\n            attr: href\n        # Add it to the pool\n        - link_add\n    # Find a link to an ad\n    - find:\n        path: a.link.detailsLink\n        do:\n        # Parse URL from href attribute\n        - parse:\n            attr: href\n        # We are not going to do anything with it for now\n<\/code><\/pre>\n<p>This code will go through all the pages of the catalog, go into blocks with ads and parse a link to the ad page from there.<\/p>\n<p>Now we need to describe the logic of data collection from the ad page. To do it, open any ad and find CSS selectors to blocks you want to extract same way as we did for catalog page.<\/p>\n<ul>\n<li>Selector for ad block on page: <code>div#offer_active<\/code>. We will initially switch to this block so that in case of its absence we would not create an empty object.<\/li>\n<li>Ad title: <code>h1<\/code>. Note that selectors are built relative to the current block (<code>div#offer_active<\/code>).<\/li>\n<li>Address: <code>address > p<\/code><\/li>\n<li>Ad ID: <code>em > small<\/code> (we need to filter data when parsing content to remove extra text)<\/li>\n<li>Date and time of ad placement: <code>em<\/code> (you will need to delete nodes \u201ca\u201d and \u201csmall\u201d before parsing, and also to clear the data a bit)<\/li>\n<li>We have a table with details, but the fields there may be different, depending on the type of ad. Therefore, we will collect field names and values. A detailed explanation will be in the code, and a selector for this table is  <code>table.details<\/code>.<\/li>\n<li>Description: <code>div#textContent<\/code><\/li>\n<li>Image: <code>div#photo-gallery-opener > img<\/code> (since we need a full-sized image, we\u2019ll have to cut off the part of the URL that contains the image size. We are going to use the filter.)<\/li>\n<li>Price: <code>div.price-label<\/code><\/li>\n<li>Seller Name: <code>div.offer-user__details > h4<\/code><\/li>\n<li>Phone: there is no phone on the page. To scrape it, we will have to make an additional request. We get back to it a bit later.<\/li>\n<\/ul>\n<p>Let\u2019s code part of the scraper which will collect all the data from the ad page, except for the phone number (for now) and see what we get in the dataset:<\/p>\n<pre><code class=\"language-yaml\">---\nconfig:\n    agent: Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/71.0.3578.98 Safari\/537.36\n    debug: 2\ndo:\n# Push start URL to the pool\n- link_add:\n    url: https:\/\/www.olx.ua\/detskiy-mir\/detskaya-odezhda\/dnepr\/\n# Iterate over the pool and load each link\n- walk:\n    to: links\n    do:\n    # Find the link to the next page\n    - find:\n        path: a[data-cy=&quot;page-link-next&quot;]\n        do:\n        # Parse link from href attribute\n        - parse:\n            attr: href\n        # Add it to the pool\n        - link_add\n    # Find a link to an ad\n    - find:\n        path: a.link.detailsLink\n        do:\n        # Parse URL from href attribute\n        - parse:\n            attr: href\n        # Load page with the ad\n        - walk:\n            to: value\n            do:\n            # Find common container for the ad\n            - find:\n                path: &#039;div#offer_active&#039;\n                do:\n                # Create data object with name item\n                - object_new: item\n                # Find element with ad title\n                - find:\n                    path: h1\n                    do:\n                    # Parse text content\n                    - parse\n                    # Normalize parsed data, depupe and trim whitespaces\n                    - space_dedupe\n                    - trim\n                    # Save data to the item data object field\n                    - object_field_set:\n                        object: item\n                        field: title\n                # Find element with ad description\n                - find:\n                    path: &#039;div#textContent&#039;\n                    do:\n                    # Parse text content\n                    - parse\n                    # Normalize parsed data, depupe and trim whitespaces\n                    - space_dedupe\n                    - trim\n                    # Save data to the item data object field\n                    - object_field_set:\n                        object: item\n                        field: description\n                # Find element with ad ID\n                - find:\n                    path: &#039;em > small&#039;\n                    do:\n                    # Parse text content using the filter. Since ID consist with digits only, we will apply filter to extract only digits.\n                    - parse:\n                        filter: (\\d+)\n                    # Save data to the item data object field\n                    - object_field_set:\n                        object: item\n                        field: ad_id\n                # Find element with ad date and time\n                - find:\n                    path: &#039;em&#039;\n                    do:\n                    # Remove nodes with not relevant information\n                    - node_remove: a,small\n                    # Parse text content\n                    - parse\n                    # Normalize parsed data, depupe and trim whitespaces\n                    - space_dedupe\n                    - trim\n                    # Remove trailing comma\n                    - normalize:\n                        routine: replace_substring\n                        args:\n                            \\,$: &#039;&#039;\n                    # Save data to the item data object field\n                    - object_field_set:\n                        object: item\n                        field: date\n                # Find element with ad price\n                - find:\n                    path: div.price-label\n                    do:\n                    # Parse text content\n                    - parse\n                    # Normalize parsed data, depupe and trim whitespaces\n                    - space_dedupe\n                    - trim\n                    # Save data to the item data object field\n                    - object_field_set:\n                        object: item\n                        field: price\n                # Find element with seller name\n                - find:\n                    path: div.offer-user__details > h4\n                    do:\n                    # Parse text content\n                    - parse\n                    # Normalize parsed data, depupe and trim whitespaces\n                    - space_dedupe\n                    - trim\n                    # Save data to the item data object field\n                    - object_field_set:\n                        object: item\n                        field: seller\n                # Find element with address\n                - find:\n                    path: address > p\n                    do:\n                    # Parse text content\n                    - parse\n                    # Normalize parsed data, depupe and trim whitespaces\n                    - space_dedupe\n                    - trim\n                    # Save data to the item data object field\n                    - object_field_set:\n                        object: item\n                        field: address\n                # Find element with image\n                - find:\n                    path: div#photo-gallery-opener > img\n                    do:\n                    # Parse content of src attribute and filter it to cut the end with size\n                    - parse:\n                        attr: src\n                        filter: ^([^;]+)\n                    # Save data to the item data object field\n                    - object_field_set:\n                        object: item\n                        field: image\n                # Let&#039;s also save ad URL to the data object\n                # we will use content of static variable &quot;url&quot; for it\n                - static_get: url\n                # Save data to the item data object field\n                - object_field_set:\n                    object: item\n                    field: url\n                # Now let&#039;s get data from the table with item details\n                - find:\n                    path: table.details\n                    do:\n                    # Find all table rows which has child cell with class &quot;value&quot;\n                    - find:\n                        path: tr:haschild(td.value)\n                        do:\n                        # Switch to th to get field name\n                        - find:\n                            path: th\n                            do:\n                            # Parse text content\n                            - parse\n                            # Normalize parsed data, depupe and trim whitespaces\n                            - space_dedupe\n                            - trim\n                            # Save content to the variable &quot;fieldname&quot;\n                            - variable_set: fieldname\n                        # Switch to td to get field data\n                        - find:\n                            path: td\n                            do:\n                            # Parse text content\n                            - parse\n                            # Normalize parsed data, depupe and trim whitespaces\n                            - space_dedupe\n                            - trim\n                            # Save data to the field defined by name kept in the &quot;fieldname&quot; variable of the &quot;item&quot; object\n                            - object_field_set:\n                                object: item\n                                field: <%fieldname%>\n                # Save object item to the dataset\n                - object_save:\n                    name: item\n                # Exit here to ensure all data is collected\n                                - exit\n<\/code><\/pre>\n<p>As result we get the following record in the dataset:<\/p>\n<pre><code class=\"language-json\">[{\n    &quot;item&quot;: {\n        &quot;ad_id&quot;: &quot;574946238&quot;,\n        &quot;address&quot;: &quot;\u0414\u043d\u0435\u043f\u0440, \u0414\u043d\u0435\u043f\u0440\u043e\u043f\u0435\u0442\u0440\u043e\u0432\u0441\u043a\u0430\u044f \u043e\u0431\u043b\u0430\u0441\u0442\u044c, \u0418\u043d\u0434\u0443\u0441\u0442\u0440\u0438\u0430\u043b\u044c\u043d\u044b\u0439&quot;,\n        &quot;date&quot;: &quot;\u0432 22:06, 9 \u0444\u0435\u0432\u0440\u0430\u043b\u044f 2019&quot;,\n        &quot;description&quot;: &quot;\u0421\u041e\u0421\u0422\u041e\u042f\u041d\u0418\u0415 \u041d\u041e\u0412\u041e\u0413\u041e. \u0414\u0435\u0444\u0444\u0435\u043a\u0442\u043e\u0432 \u043d\u0438\u043a\u0430\u043a\u0438\u0445 \u043d\u0435\u0442. \u0411\u0435\u0437 \u0441\u043b\u0435\u0434\u043e\u0432 \u043d\u043e\u0441\u043a\u0438. \u0411\u0440\u0435\u043d\u0434\u043e\u0432\u044b\u0439 \u043a\u0440\u0430\u0441\u0438\u0432\u0435\u043d\u043d\u044b\u0439 \u0434\u0435\u043c\u0438\u0441\u0435\u0437\u043e\u043d\u043d\u044b\u0439 \u043a\u043e\u043c\u0431\u0438\u043d\u0435\u0437\u043e\u043d F&F (\u0410\u043d\u0433\u043b\u0438\u044f) \u0434\u043b\u044f \u043c\u0430\u043b\u044c\u0447\u0438\u043a\u0430 3-6 \u043c\u0435\u0441. \u0421\u0435\u0437\u043e\u043d \u0432\u0435\u0441\u043d\u0430, \u0441\u0440\u0430\u0437\u0443 \u043a\u0430\u043a \u0441\u043d\u0438\u043c\u0438\u0442\u0435 \u0437\u0438\u043c\u043d\u0438\u0439 \u043f\u0430\u0440\u043a\u0438\u0439 \u043a\u043e\u043c\u0431\u0438\u043d\u0435\u0437\u043e\u043d. \u041f\u043e\u043a\u0443\u043f\u043a\u043e\u0439 \u0431\u0443\u0434\u0435\u0442\u0435 \u043e\u0447\u0435\u043d\u044c \u0434\u043e\u0432\u043e\u043b\u044c\u043d\u044b- \u044d\u0442\u0430 \u0432\u0435\u0449\u044c \u0412\u0430\u0441 \u0434\u0435\u0439\u0441\u0442\u0432\u0438\u0442\u0435\u043b\u044c\u043d\u043e \u043f\u043e\u0440\u0430\u0434\u0443\u0435\u0442. \u041a\u0430\u0447\u0435\u0441\u0442\u0432\u043e \u0430\u043d\u0433\u043b\u0438\u0439\u0441\u043a\u043e\u0435, \u0430 \u0437\u043d\u0430\u0447\u0438\u0442 \u0441\u0443\u043f\u0435\u0440, \u043f\u0440\u0438\u044f\u0442\u043d\u0430\u044f \u0442\u043a\u0430\u043d\u044c, \u0448\u0432\u044b \u043d\u0435 \u0442\u043e\u0440\u0447\u0430\u0442. \u0423\u0442\u0435\u043f\u043b\u0438\u0442\u0435\u043b\u044c \u0438 \u043f\u043e\u0434\u043a\u043b\u0430\u0434\u043a\u0430 \u0432 \u0438\u0434\u0435\u0430\u043b\u0435. \u0418\u0437 \u043d\u043e\u0432\u043e\u0439 \u043a\u043e\u043b\u043b\u0435\u043a\u0446\u0438\u0438, \u044f\u0440\u043a\u0438\u0439 \u043f\u0440\u0438\u043d\u0442. \u041d\u0430 \u043c\u0430\u043b\u044b\u0448\u0435 \u0441\u043c\u043e\u0442\u0440\u0438\u0442\u0441\u044f \u0431\u043e\u043c\u0431\u0435\u0437\u043d\u043e. \u0423\u0434\u043e\u0431\u043d\u044b\u0439, \u043b\u0435\u0433\u043a\u043e \u043e\u0434\u0435\u0432\u0430\u0435\u0442\u0441\u044f- \u043f\u0440\u043e\u0434\u043e\u043b\u044c\u043d\u0430\u044f \u043c\u043e\u043b\u043d\u0438\u044f, \u041d\u0415 \u043a\u043d\u043e\u043f\u043a\u0438. \u041c\u043e\u0434\u0435\u043b\u044c\u043a\u0430 \u043e\u0447\u0435\u043d\u044c \u0443\u0434\u0430\u0447\u043d\u0430\u044f, \u044d\u0440\u0433\u043e\u043d\u043e\u043c\u0438\u0447\u043d\u0430\u044f, \u043f\u0440\u0430\u0432\u0438\u043b\u044c\u043d\u043e\u0433\u043e \u043f\u043e\u043a\u0440\u043e\u044f- \u0447\u0451\u0442\u043a\u043e \u0441\u0438\u0434\u0438\u0442 \u043f\u043e \u0444\u0438\u0433\u0443\u0440\u0435 (\u043d\u0435 \u0432\u0438\u0441\u0438\u0442 \u043c\u0435\u0448\u043a\u043e\u043c). \u0412\u043d\u0443\u0442\u0440\u0438 \u0434\u043e \u0441\u0435\u0440\u0435\u0434\u0438\u043d\u044b \u0443\u0442\u0435\u043f\u043b\u0451\u043d \u0444\u043b\u0438\u0441\u043e\u043c (\u043f\u043e\u0434\u0445\u043e\u0434\u0438\u0442 \u0438 \u043d\u0430 \u0445\u043e\u043b\u043e\u0434\u043d\u0443\u044e \u0432\u0435\u0441\u043d\u0443). \u041f\u0435\u0440\u0435\u0434 \u043f\u0440\u043e\u0434\u0430\u0436\u0435\u0439 \u043f\u043e\u0441\u0442\u0438\u0440\u0430\u043d- \u0447\u0438\u0441\u0442\u0435\u043d\u044c\u043a\u0438\u0439 - \u043c\u043e\u0436\u043d\u043e \u0441\u0440\u0430\u0437\u0443 \u043d\u043e\u0441\u0438\u0442\u044c. \u0412 \u043a\u043e\u043c\u043f\u043b\u0435\u043a\u0442 \u0432\u0445\u043e\u0434\u044f\u0442 \u0432\u0430\u0440\u0435\u0436\u043a\u0438 \u0438 \u0444\u0438\u0440\u043c\u0435\u043d\u043d\u0430\u044f \u0434\u0435\u043c\u0438 \u0448\u0430\u043f\u043e\u0447\u043a\u0430 Early Days \u0432 \u0442\u043e\u043d \u043a \u043a\u043e\u043c\u0431\u0435\u0437\u0443 (\u0434\u0432\u043e\u0439\u043d\u0430\u044f \u0432\u044f\u0437\u043a\u0430) - \u0441\u043e\u0441\u0442\u043e\u044f\u043d\u0438\u0435 \u043d\u043e\u0432\u043e\u0439, \u0411\u0415\u0417 \u043a\u0430\u0442\u044b\u0448\u0435\u043a. \u0413\u043b\u0443\u0431\u043e\u043a\u0430\u044f, \u0445\u043e\u0440\u043e\u0448\u043e \u043f\u0440\u0438\u043a\u0440\u044b\u0432\u0430\u0435\u0442 \u0443\u0448\u043a\u0438, \u043d\u0435 \u0441\u043f\u043e\u043b\u0437\u0430\u0435\u0442. \u041f\u0440\u043e\u0434\u0430\u0436\u0430 \u0442\u043e\u043b\u044c\u043a\u043e \u043a\u043e\u043c\u043f\u043b\u0435\u043a\u0442\u043e\u043c. \u0417\u0430\u043c\u0435\u0440\u044b: \u0434\u043b\u0438\u043d\u0430 \u043e\u0442 \u043f\u043b\u0435\u0447\u0430 \u0434\u043e \u043f\u044f\u0442\u043e\u0447\u043a\u0438 \u043f\u043e \u0441\u043f\u0438\u043d\u043a\u0435 61; \u043e\u0442 \u0448\u0435\u0438 \u0434\u043e \u043f\u044f\u0442\u043e\u0447\u043a\u0438 \u043f\u043e \u0441\u043f\u0438\u043d\u043a\u0435 62; \u043e\u0442 \u0448\u0435\u0438 \u0434\u043e \u043f\u0430\u043c\u043f\u0435\u0440\u0441\u0430 \u043f\u043e \u0441\u043f\u0438\u043d\u043a\u0435 44; \u043e\u0442 \u0432\u0435\u0440\u0445\u0430 \u043a\u0430\u043f\u044e\u0448\u043e\u043d\u0430 \u0434\u043e \u043f\u044f\u0442\u043e\u0447\u043a\u0438 \u043f\u043e \u0441\u043f\u0438\u043d\u043a\u0435 81; \u041f\u041e\u0413 \u043e\u0442 \u043f\u043e\u0434\u043c\u044b\u0448\u043a\u0438 \u0434\u043e \u043f\u043e\u0434\u043c\u044b\u0448\u043a\u0438 34; \u0440\u0443\u043a\u0430\u0432 \u043e\u0442 \u043f\u043b\u0435\u0447\u0430 23; \u0440\u0443\u043a\u0430\u0432 \u043e\u0442 \u0448\u0435\u0438 29; \u0448\u0438\u0440\u0438\u043d\u0430 \u0432 \u043f\u043b\u0435\u0447\u0430\u0445 28; \u0448\u0430\u0433\u043e\u0432\u044b\u0439 \u043e\u0442 \u043f\u0430\u043c\u043f\u0435\u0440\u0441\u0430 \u0434\u043e \u043f\u044f\u0442\u043e\u0447\u043a\u0438 21. \u041f\u0435\u0440\u0435\u0441\u044b\u043b\u0430\u044e. \u0421\u043c\u043e\u0442\u0440\u0438\u0442\u0435 \u0432\u0441\u0435 \u043c\u043e\u0438 \u043e\u0431\u044a\u044f\u0432\u043b\u0435\u043d\u0438\u044f \u0415\u0441\u0442\u044c \u0442\u043e\u0447\u043d\u043e \u0442\u0430\u043a\u043e\u0439 \u0436\u0435 \u043a\u043e\u043c\u0431\u0438\u043d\u0435\u0437\u043e\u043d \u0432 \u0440\u0430\u0437\u043c\u0435\u0440\u0435 0-3 \u043c\u0435\u0441. (\u043f\u043e\u043a\u0443\u043f\u0430\u043b\u0430 \u0440\u043e\u0441\u0442\u043e\u0432\u043a\u043e\u0439 \u0434\u043b\u044f \u0441\u044b\u043d\u0430 \u0438 \u043f\u043b\u0435\u043c\u044f\u0448\u0430). \u0421\u043c\u043e\u0442\u0440\u0438\u0442\u0435 \u0432 \u043c\u043e\u0438\u0445 \u043e\u0431\u044a\u044f\u0432\u043b\u0435\u043d\u0438\u044f\u0445 \u041d\u0430 3-6 \u043c\u0435\u0441. \u0435\u0441\u0442\u044c \u0435\u0449\u0435 \u0441\u0435\u0440\u0435\u0431\u0440\u0438\u0441\u0442\u044b\u0439 \u043a\u043e\u043c\u0431\u0435\u0437 \u0447\u0443\u0442\u044c \u043f\u043e\u043b\u0435\u0433\u0447\u0435. \u0415\u0441\u0442\u044c \u043a\u043e\u043c\u0431\u0438\u043d\u0435\u0437\u043e\u043d\u044b \u043d\u0430 \u0434\u0440\u0443\u0433\u043e\u0439 \u0432\u043e\u0437\u0440\u0430\u0441\u0442. \u0415\u0441\u0442\u044c \u043f\u0430\u043a\u0435\u0442\u044b \u0444\u0438\u0440\u043c\u0435\u043d\u043d\u043e\u0439 \u043e\u0434\u0435\u0436\u0434\u044b \u0434\u043b\u044f \u043c\u0430\u043b\u044c\u0447\u0438\u043a\u0430 0-6 \u043c\u0435\u0441. \u0422\u0430\u043a\u0436\u0435 \u043f\u0440\u043e\u0434\u0430\u043c \u043a\u0443\u0440\u0442\u043e\u0447\u043a\u0438 \u043d\u0430 \u0441\u0442\u0430\u0440\u0448\u0438\u0439 \u0432\u043e\u0437\u0440\u0430\u0441\u0442, \u0436\u0438\u043b\u0435\u0442\u043a\u0438 \u0421\u043f\u0440\u0430\u0448\u0438\u0432\u0430\u0439\u0442\u0435, \u043d\u0435 \u0432\u0441\u0451 \u0432\u044b\u0441\u0442\u0430\u0432\u043b\u0435\u043d\u043e. \u0421\u043a\u0438\u043d\u0443 \u0444\u043e\u0442\u043e \u0447\u0442\u043e \u0435\u0441\u0442\u044c. \u041c\u043e\u0436\u043d\u043e \u043f\u0438\u0441\u0430\u0442\u044c \u0438 \u0432 Viber. \u041e\u0442\u0432\u0435\u0447\u0430\u044e \u0441\u0440\u0430\u0437\u0443&quot;,\n        &quot;image&quot;: &quot;https:\/\/apollo-ireland.akamaized.net:443\/v1\/files\/yxyp673xu3zj2-UA\/image&quot;,\n        &quot;price&quot;: &quot;600 \u0433\u0440\u043d.&quot;,\n        &quot;seller&quot;: &quot;BRAND CLOTHING&quot;,\n        &quot;title&quot;: &quot;\u041a\u043e\u043c\u0431\u0438\u043d\u0435\u0437\u043e\u043d F&F\u0434\u0435\u043c\u0438\u0441\u0435\u0437\u043e\u043d\u043d\u044b\u0439 3-6 \u043c\u0435\u0441. \u0412\u0435\u0441\u043d\u0430 next gap \u0434\u0435\u043c\u0438 + \u0448\u0430\u043f\u043a\u0430&quot;,\n        &quot;url&quot;: &quot;https:\/\/www.olx.ua\/obyavlenie\/kombinezon-f-fdemisezonnyy-3-6-mes-vesna-next-gap-demi-shapka-IDCUpMq.html#006bc65a76;promoted&quot;,\n        &quot;\u041e\u0431\u044a\u044f\u0432\u043b\u0435\u043d\u0438\u0435 \u043e\u0442&quot;: &quot;\u0427\u0430\u0441\u0442\u043d\u043e\u0433\u043e \u043b\u0438\u0446\u0430&quot;,\n        &quot;\u0420\u0430\u0437\u043c\u0435\u0440&quot;: &quot;68&quot;,\n        &quot;\u0421\u043e\u0441\u0442\u043e\u044f\u043d\u0438\u0435&quot;: &quot;\u0411\/\u0443&quot;,\n        &quot;\u0422\u0438\u043f \u043e\u0434\u0435\u0436\u0434\u044b&quot;: &quot;\u041e\u0434\u0435\u0436\u0434\u0430 \u0434\u043b\u044f \u043c\u0430\u043b\u044c\u0447\u0438\u043a\u043e\u0432&quot;\n    }\n}]\n<\/code><\/pre>\n<p>Everything is fine, so now we will see how we can collect the phone number. To do it, open the page with the ad, then the developer tools, and go to the Network tab (1). Inside the tab, we only want to see XHR requests (2) and click on clear all requests button. Then click on the \u201cShow Phone\u201d button (3). We will see that the browser has made a request to the server (4).<\/p>\n<figure id=\"attachment_mmd_757\" class=\"wp-block-image aligncenter\"><a href=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx5.jpg\"><img width=\"1892\" height=\"854\" src=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx5.jpg\" class=\"attachment-full size-full\" alt=\"OLX: Catching XHR request for phone number retrieval\" decoding=\"async\" loading=\"lazy\" align=\"center\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx5.jpg 1892w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx5-300x135.jpg 300w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx5-768x347.jpg 768w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx5-1024x462.jpg 1024w\" sizes=\"auto, (max-width: 1892px) 100vw, 1892px\" \/><\/a><\/figure>\n<p>Now open the request and see the address (1) where it is sent and what data (2) it sends.<\/p>\n<figure id=\"attachment_mmd_758\" class=\"wp-block-image aligncenter\"><a href=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx6.jpg\"><img width=\"1873\" height=\"199\" src=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx6.jpg\" class=\"attachment-full size-full\" alt=\"OLX: Extracting URL and Query for XHR request\" decoding=\"async\" loading=\"lazy\" align=\"center\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx6.jpg 1873w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx6-300x32.jpg 300w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx6-768x82.jpg 768w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2019\/02\/olx6-1024x109.jpg 1024w\" sizes=\"auto, (max-width: 1873px) 100vw, 1873px\" \/><\/a><\/figure>\n<p>Now we have the URL: <code>https:\/\/www.olx.ua\/ajax\/misc\/contact\/phone\/qsKeK\/<\/code>\n<br>\nand parameter pt \n<code>cda38f1d74d6e50f6f5a248ea2578ba04d44b58ccb6648718ce825a15dd1c036494b2cd1c6cb27762a8de30f5f58676149a11ee8a228998fd7f6b8cde5bb83a9<\/code><\/p>\n<p>Obviously, in order to emulate such request, we need to have the ad ID (which is <code>qsKeK<\/code> in this particular case) and the parameter <code>pt<\/code>. If we search for them in the page source (\u201cElements\u201d tab), we find that the <code>pt<\/code> parameter is in JavaScript on the page, which means we can extract it using a regular expression. The ad ID can be pulled from the \u201cShow Phone\u201d button. It will also give us the opportunity to pick up phones only if this button exists on the page. The logic will be simple, we will go into the node with the button and do certain actions, and if the button does not exist, then the actions will not be performed. Let\u2019s make changes to our scraper and add a snippet to collect the phone number.<\/p>\n<pre><code class=\"language-yaml\">---\nconfig:\n    agent: Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/71.0.3578.98 Safari\/537.36\n    debug: 2\ndo:\n# Push start URL to the pool\n- link_add:\n    url: https:\/\/www.olx.ua\/detskiy-mir\/detskaya-odezhda\/dnepr\/\n# Iterate over the pool and load each link\n- walk:\n    to: links\n    do:\n    # Find the link to the next page\n    - find:\n        path: a[data-cy=&quot;page-link-next&quot;]\n        do:\n        # Parse link from href attribute\n        - parse:\n            attr: href\n        # Add it to the pool\n        - link_add\n    # Find a link to an ad\n    - find:\n        path: a.link.detailsLink\n        do:\n        # Parse URL from href attribute\n        - parse:\n            attr: href\n        # Load page with the ad\n        - walk:\n            to: value\n            do:\n            # Find common container for the ad\n            - find:\n                path: &#039;div#offer_active&#039;\n                do:\n                # Create data object with name item\n                - object_new: item\n                # Find element with ad title\n                - find:\n                    path: h1\n                    do:\n                    # Parse text content\n                    - parse\n                    # Normalize parsed data, depupe and trim whitespaces\n                    - space_dedupe\n                    - trim\n                    # Save data to the item data object field\n                    - object_field_set:\n                        object: item\n                        field: title\n                # Find element with ad description\n                - find:\n                    path: &#039;div#textContent&#039;\n                    do:\n                    # Parse text content\n                    - parse\n                    # Normalize parsed data, depupe and trim whitespaces\n                    - space_dedupe\n                    - trim\n                    # Save data to the item data object field\n                    - object_field_set:\n                        object: item\n                        field: description\n                # Find element with ad ID\n                - find:\n                    path: &#039;em > small&#039;\n                    do:\n                    # Parse text content using the filter. Since ID consist with digits only, we will apply filter to extract only digits.\n                    - parse:\n                        filter: (\\d+)\n                    # Save data to the item data object field\n                    - object_field_set:\n                        object: item\n                        field: ad_id\n                # Find element with ad date and time\n                - find:\n                    path: &#039;em&#039;\n                    do:\n                    # Remove nodes with not relevant information\n                    - node_remove: a,small\n                    # Parse text content\n                    - parse\n                    # Normalize parsed data, depupe and trim whitespaces\n                    - space_dedupe\n                    - trim\n                    # Remove trailing comma\n                    - normalize:\n                        routine: replace_substring\n                        args:\n                            \\,$: &#039;&#039;\n                    # Save data to the item data object field\n                    - object_field_set:\n                        object: item\n                        field: date\n                # Find element with ad price\n                - find:\n                    path: div.price-label\n                    do:\n                    # Parse text content\n                    - parse\n                    # Normalize parsed data, depupe and trim whitespaces\n                    - space_dedupe\n                    - trim\n                    # Save data to the item data object field\n                    - object_field_set:\n                        object: item\n                        field: price\n                # Find element with seller name\n                - find:\n                    path: div.offer-user__details > h4\n                    do:\n                    # Parse text content\n                    - parse\n                    # Normalize parsed data, depupe and trim whitespaces\n                    - space_dedupe\n                    - trim\n                    # Save data to the item data object field\n                    - object_field_set:\n                        object: item\n                        field: seller\n                # Find element with address\n                - find:\n                    path: address > p\n                    do:\n                    # Parse text content\n                    - parse\n                    # Normalize parsed data, depupe and trim whitespaces\n                    - space_dedupe\n                    - trim\n                    # Save data to the item data object field\n                    - object_field_set:\n                        object: item\n                        field: address\n                # Find element with image\n                - find:\n                    path: div#photo-gallery-opener > img\n                    do:\n                    # Parse content of src attribute and filter it to cut the end with size\n                    - parse:\n                        attr: src\n                        filter: ^([^;]+)\n                    # Save data to the item data object field\n                    - object_field_set:\n                        object: item\n                        field: image\n                # Let&#039;s also save ad URL to the data object\n                # we will use content of static variable &quot;url&quot; for it\n                - static_get: url\n                # Save data to the item data object field\n                - object_field_set:\n                    object: item\n                    field: url\n                # Now let&#039;s get data from the table with item details\n                - find:\n                    path: table.details\n                    do:\n                    # Find all table rows which has child cell with class &quot;value&quot;\n                    - find:\n                        path: tr:haschild(td.value)\n                        do:\n                        # Switch to th to get field name\n                        - find:\n                            path: th\n                            do:\n                            # Parse text content\n                            - parse\n                            # Normalize parsed data, depupe and trim whitespaces\n                            - space_dedupe\n                            - trim\n                            # Save content to the variable &quot;fieldname&quot;\n                            - variable_set: fieldname\n                        # Switch to td to get field data\n                        - find:\n                            path: td\n                            do:\n                            # Parse text content\n                            - parse\n                            # Normalize parsed data, depupe and trim whitespaces\n                            - space_dedupe\n                            - trim\n                            # Save data to the field defined by name kept in the &quot;fieldname&quot; variable of the &quot;item&quot; object\n                            - object_field_set:\n                                object: item\n                                field: <%fieldname%>\n                # Find the script element with phonetoken (we need to lookup in whole document as currently we are in the block without this script tag)\n                - find:\n                    in: doc\n                    path: script:contains(&quot;phoneToken&quot;)\n                    do:\n                    # Parse only token using regular expression\n                    - parse:\n                        filter: \\&#039;([^&#039;]+)\\&#039;\n                    # Save value to the variable\n                    - variable_set: token\n                # Find the &quot;Show phone&quot; button\n                - find:\n                    path: li.link-phone\n                    do:\n                    # Parse ID of the ad\n                    - parse:\n                        attr: class\n                        filter: \\&#039;id\\&#039;\\:\\&#039;([^&#039;]+)\\&#039;\n                    # Save value to the variable\n                    - variable_set: id\n                    # Do random pause from 5 to 10 sec\n                    - sleep: 5:10\n                    # Send request to the server\n                    - walk:\n                        to: https:\/\/www.olx.ua\/uk\/ajax\/misc\/contact\/phone\/<%id%>\/?pt=<%token%>\n                        headers:\n                            accept: &#039;*\/*&#039;\n                            accept-language: ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7\n                            x-requested-with: XMLHttpRequest\n                        do:\n                        # Exit here to see HTML content we are getting from the server\n                        - exit\n                # Save object item to the dataset\n                - object_save:\n                    name: item\n<\/code><\/pre>\n<p>If we run the scraper in debug mode, in the log we will see that the server sends us the following structure:<\/p>\n<pre><code class=\"language-html\"><html><head><\/head><body><body_safe>\n<body_safe>\n<value>067-XXX-XX-XX<\/value>\n<\/body_safe>\n<\/body_safe><\/body><\/html>\n<\/code><\/pre>\n<p>So, to collect the phone number, we need to use the <code>body_safe > value<\/code> CSS selector. Let\u2019s add it to our web scraper:<\/p>\n<pre><code class=\"language-yaml\">---\nconfig:\n    agent: Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/71.0.3578.98 Safari\/537.36\n    debug: 2\ndo:\n# Push start URL to the pool\n- link_add:\n    url: https:\/\/www.olx.ua\/detskiy-mir\/detskaya-odezhda\/dnepr\/\n# Iterate over the pool and load each link\n- walk:\n    to: links\n    do:\n    # Find the link to the next page\n    - find:\n        path: a[data-cy=&quot;page-link-next&quot;]\n        do:\n        # Parse link from href attribute\n        - parse:\n            attr: href\n        # Add it to the pool\n        - link_add\n    # Find a link to an ad\n    - find:\n        path: a.link.detailsLink\n        do:\n        # Parse URL from href attribute\n        - parse:\n            attr: href\n        - variable_set:\n            field: repeat\n            value: &quot;yes&quot;\n        # Load page with the ad\n        - walk:\n            to: value\n            repeat: <%repeat%>\n            do:\n            - variable_clear: ok\n            # Find common container for the ad\n            - find:\n                path: &#039;div#offer_active&#039;\n                do:\n                - variable_set:\n                    field: ok\n                    value: 1\n                # Create data object with name item\n                - object_new: item\n                # Find element with ad title\n                - find:\n                    path: h1\n                    do:\n                    # Parse text content\n                    - parse\n                    # Normalize parsed data, depupe and trim whitespaces\n                    - space_dedupe\n                    - trim\n                    # Save data to the item data object field\n                    - object_field_set:\n                        object: item\n                        field: title\n                # Find element with ad description\n                - find:\n                    path: &#039;div#textContent&#039;\n                    do:\n                    # Parse text content\n                    - parse\n                    # Normalize parsed data, depupe and trim whitespaces\n                    - space_dedupe\n                    - trim\n                    # Save data to the item data object field\n                    - object_field_set:\n                        object: item\n                        field: description\n                # Find element with ad ID\n                - find:\n                    path: &#039;em > small&#039;\n                    do:\n                    # Parse text content using the filter. Since ID consist with digits only, we will apply filter to extract only digits.\n                    - parse:\n                        filter: (\\d+)\n                    # Save data to the item data object field\n                    - object_field_set:\n                        object: item\n                        field: ad_id\n                # Find element with ad date and time\n                - find:\n                    path: &#039;em&#039;\n                    do:\n                    # Remove nodes with not relevant information\n                    - node_remove: a,small\n                    # Parse text content\n                    - parse\n                    # Normalize parsed data, depupe and trim whitespaces\n                    - space_dedupe\n                    - trim\n                    # Remove trailing comma\n                    - normalize:\n                        routine: replace_substring\n                        args:\n                            \\,$: &#039;&#039;\n                    # Save data to the item data object field\n                    - object_field_set:\n                        object: item\n                        field: date\n                # Find element with ad price\n                - find:\n                    path: div.price-label\n                    do:\n                    # Parse text content\n                    - parse\n                    # Normalize parsed data, depupe and trim whitespaces\n                    - space_dedupe\n                    - trim\n                    # Save data to the item data object field\n                    - object_field_set:\n                        object: item\n                        field: price\n                # Find element with seller name\n                - find:\n                    path: div.offer-user__details > h4\n                    do:\n                    # Parse text content\n                    - parse\n                    # Normalize parsed data, depupe and trim whitespaces\n                    - space_dedupe\n                    - trim\n                    # Save data to the item data object field\n                    - object_field_set:\n                        object: item\n                        field: seller\n                # Find element with address\n                - find:\n                    path: address > p\n                    do:\n                    # Parse text content\n                    - parse\n                    # Normalize parsed data, depupe and trim whitespaces\n                    - space_dedupe\n                    - trim\n                    # Save data to the item data object field\n                    - object_field_set:\n                        object: item\n                        field: address\n                # Find element with image\n                - find:\n                    path: div#photo-gallery-opener > img\n                    do:\n                    # Parse content of src attribute and filter it to cut the end with size\n                    - parse:\n                        attr: src\n                        filter: ^([^;]+)\n                    # Save data to the item data object field\n                    - object_field_set:\n                        object: item\n                        field: image\n                # Let&#039;s also save ad URL to the data object\n                # we will use content of static variable &quot;url&quot; for it\n                - static_get: url\n                # Save data to the item data object field\n                - object_field_set:\n                    object: item\n                    field: url\n                # Now let&#039;s get data from the table with item details\n                - find:\n                    path: table.details\n                    do:\n                    # Find all table rows which has child cell with class &quot;value&quot;\n                    - find:\n                        path: tr:haschild(td.value)\n                        do:\n                        # Switch to th to get field name\n                        - find:\n                            path: th\n                            do:\n                            # Parse text content\n                            - parse\n                            # Normalize parsed data, depupe and trim whitespaces\n                            - space_dedupe\n                            - trim\n                            # Save content to the variable &quot;fieldname&quot;\n                            - variable_set: fieldname\n                        # Switch to td to get field data\n                        - find:\n                            path: td\n                            do:\n                            # Parse text content\n                            - parse\n                            # Normalize parsed data, depupe and trim whitespaces\n                            - space_dedupe\n                            - trim\n                            # Save data to the field defined by name kept in the &quot;fieldname&quot; variable of the &quot;item&quot; object\n                            - object_field_set:\n                                object: item\n                                field: <%fieldname%>\n                # Find the script element with phonetoken (we need to lookup in whole document as currently we are in the block without this script tag)\n                - find:\n                    in: doc\n                    path: script:contains(&quot;phoneToken&quot;)\n                    do:\n                    # Parse only token using regular expression\n                    - parse:\n                        filter: \\&#039;([^&#039;]+)\\&#039;\n                    # Save value to the variable\n                    - variable_set: token\n                # Find the &quot;Show phone&quot; button\n                - find:\n                    path: li.link-phone\n                    do:\n                    # Parse ID of the ad\n                    - parse:\n                        attr: class\n                        filter: \\&#039;id\\&#039;\\:\\&#039;([^&#039;]+)\\&#039;\n                    # Save value to the variable\n                    - variable_set: id\n                    # Do random pause from 5 to 10 sec\n                    - sleep: 5:10\n                    # Send request to the server\n                    - walk:\n                        to: https:\/\/www.olx.ua\/uk\/ajax\/misc\/contact\/phone\/<%id%>\/?pt=<%token%>\n                        headers:\n                            accept: &#039;*\/*&#039;\n                            accept-language: ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7\n                            x-requested-with: XMLHttpRequest\n                        do:\n                        # Find element with phone number\n                        - find:\n                            path: body_safe > value\n                            do:\n                            # Parse text\n                            - parse\n                            # Save data to the item data object field\n                            - object_field_set:\n                                object: item\n                                field: phone\n                # Save object item to the dataset\n                - object_save:\n                    name: item\n                - cookie_reset\n            - find:\n                path: body\n                do:\n                - variable_get: ok\n                - if:\n                    match: 1\n                    do:\n                    - variable_clear: repeat\n                    else:\n                    - error: Proxy is banned or page layout has been changed\n                    - cookie_reset\n                    - proxy_switch\n    - cookie_reset\n<\/code><\/pre>\n<p>The scraper works well on the OLX Ukraine website and collects all the data we need. But it can also work on other sites. For example, in order for it to work on the OLX Kazakhstan website, you need:<\/p>\n<ol>\n<li>Change the starting URL in line 8: <a href=\"https:\/\/www.olx.kz\/kk\/moda-i-stil\/odezhda\/\">https:\/\/www.olx.kz\/kk\/moda-i-stil\/odezhda\/<\/a><\/li>\n<li>Change the URL on line 210 (to pick up the phone number): https:\/\/www.olx.kz\/kk\/ajax\/misc\/contact\/phone\/\/?pt=<\/li>\n<\/ol>","protected":false},"excerpt":{"rendered":"<p>Probably many people know what OLX is. In Russia, the company was absorbed by Avito. However, OLX still exists in many other countries: Ukraine, Poland, Kazakhstan, and many others. A complete list of countries can be found on the main site OLX. Modified on 01 Feb 2020. Added functionality to bypass cases when proxy is [&hellip;]<\/p>","protected":false},"author":4,"featured_media":759,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[31,30,9,2],"tags":[],"class_list":["post-750","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ecommerce-scraping","category-free-scrapers","category-learning-meta-language","category-web-scraping"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts\/750","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/comments?post=750"}],"version-history":[{"count":6,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts\/750\/revisions"}],"predecessor-version":[{"id":834,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts\/750\/revisions\/834"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/media\/759"}],"wp:attachment":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/media?parent=750"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/categories?post=750"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/tags?post=750"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}