{"id":405,"date":"2018-02-12T09:39:55","date_gmt":"2018-02-12T09:39:55","guid":{"rendered":"https:\/\/www.diggernaut.com\/blog\/?p=405"},"modified":"2020-02-06T06:07:50","modified_gmt":"2020-02-06T06:07:50","slug":"how-to-scrape-pages-infinite-scroll-extracting-data-from-instagram","status":"publish","type":"post","link":"https:\/\/www.diggernaut.com\/blog\/how-to-scrape-pages-infinite-scroll-extracting-data-from-instagram\/","title":{"rendered":"How to scrape pages with infinite scroll: extracting data with Instagram scraper"},"content":{"rendered":"<p>In this article, we will teach you how generally you can scrape the data from the websites with infinite scroll and build the Instagram scraper. All you will need to do it is just your browser, a free account on the Diggernaut platform, and your head and hands.<\/p>\n<p><strong>Updated on 01.10.2020<\/strong><\/p>\n<p>Infinite scroll on the webpage is based on Javascript functionality. Therefore, to find out what URL we need to access and what parameters to use, we need to either thoroughly study the JS code that works on the page or, and preferably, examine the requests that the browser does when you scroll down the page. We can study requests using the Developer Tools, which are built-in to all modern browsers. In this article, we are going to use Google Chrome, but you can use any other browser. Just keep in mind that the developer tools may look different in different browsers.<\/p>\n<p>We will use an official <a href=\"https:\/\/www.instagram.com\/instagram\/\">Instagram<\/a> channel. Open this page in the browser, and run Chrome Dev Tools &#8211; developer tools that are built-in to Google Chrome. To do it, you need to right-click anywhere on the page and select the &#8220;Inspect&#8221; option or press &#8220;Ctrl + Shift + I&#8221;:<\/p>\n<figure id=\"attachment_mmd_407\" class=\"wp-block-image alignnone\"><a href=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram1_en.jpg\"><img width=\"902\" height=\"645\" src=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram1_en.jpg\" class=\"attachment-full size-full\" alt=\"Scraping Instagram: turning on Dev Tools\" decoding=\"async\" loading=\"lazy\" align=\"none\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram1_en.jpg 902w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram1_en-768x549.jpg 768w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram1_en-211x150.jpg 211w\" sizes=\"auto, (max-width: 902px) 100vw, 902px\" \/><\/a><\/figure>\n<p>It will open the tool window, where we go to the Network tab, and in the filters, we select only XHR requests. We do it to filter out requests we don&#8217;t need. After that, reload the page in the browser using the Reload button in the browser interface or the &#8220;F5&#8221; key on the keyboard.<\/p>\n<figure id=\"attachment_mmd_408\" class=\"wp-block-image alignnone\"><a href=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram2_en.jpg\"><img width=\"1442\" height=\"452\" src=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram2_en.jpg\" class=\"attachment-full size-full\" alt=\"Scraping Instagram: setting up Dev Tools\" decoding=\"async\" loading=\"lazy\" align=\"none\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram2_en.jpg 1442w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram2_en-768x241.jpg 768w\" sizes=\"auto, (max-width: 1442px) 100vw, 1442px\" \/><\/a><\/figure>\n<p>Let&#8217;s now scroll down the page several times with the mouse wheel. It will cause content loading. Whenever we scroll down to the bottom of the page, JS makes an XHR request to the server, receive the data and add it to the page. As a result, we should have several requests in the list that look almost the same. Most likely its what we are looking for.<\/p>\n<figure id=\"attachment_mmd_409\" class=\"wp-block-image alignnone\"><a href=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram3.jpg\"><img width=\"1920\" height=\"353\" src=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram3.jpg\" class=\"attachment-full size-full\" alt=\"Scraping Instagram: looking for specific XHR requests\" decoding=\"async\" loading=\"lazy\" align=\"none\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram3.jpg 1920w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram3-768x141.jpg 768w\" sizes=\"auto, (max-width: 1920px) 100vw, 1920px\" \/><\/a><\/figure>\n<p>To make sure, we have to click on one of the requests and in the newly opened panel go to the Preview tab. There we can see the formatted content that the server returns to the browser for this request. Let&#8217;s get to one of the leaf elements in the tree and make sure that there are data about the images that we have on the page.<\/p>\n<figure id=\"attachment_mmd_410\" class=\"wp-block-image alignnone\"><a href=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram4_en.jpg\"><img width=\"1916\" height=\"554\" src=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram4_en.jpg\" class=\"attachment-full size-full\" alt=\"Scraping Instagram: making sure we found right requests\" decoding=\"async\" loading=\"lazy\" align=\"none\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram4_en.jpg 1916w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram4_en-768x222.jpg 768w\" sizes=\"auto, (max-width: 1916px) 100vw, 1916px\" \/><\/a><\/figure>\n<p>After making sure that these are the queries we need, let&#8217;s look at one of them more carefully. To do it, go to the Headers tab. There, we can find information about what URL is used to make the request, what is the type of that request (POST or GET), and what parameters are passed with the request.<\/p>\n<figure id=\"attachment_mmd_411\" class=\"wp-block-image alignnone\"><a href=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram5_en.jpg\"><img width=\"1919\" height=\"339\" src=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram5_en.jpg\" class=\"attachment-full size-full\" alt=\"Scraping Instagram: finding request parameters\" decoding=\"async\" loading=\"lazy\" align=\"none\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram5_en.jpg 1919w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram5_en-768x136.jpg 768w\" sizes=\"auto, (max-width: 1919px) 100vw, 1919px\" \/><\/a><\/figure>\n<p>It&#8217;s better to check query string parameters at  Query String Parameters section. To see it, you need to scroll down the pane to the bottom:<\/p>\n<figure id=\"attachment_mmd_413\" class=\"wp-block-image alignnone\"><a href=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram6_fix.jpg\"><img width=\"1465\" height=\"113\" src=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram6_fix.jpg\" class=\"attachment-full size-full\" alt=\"Scraping Instagram: checking query string\" decoding=\"async\" loading=\"lazy\" align=\"none\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram6_fix.jpg 1465w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram6_fix-768x59.jpg 768w\" sizes=\"auto, (max-width: 1465px) 100vw, 1465px\" \/><\/a><\/figure>\n<p>As result of our analysis we get the following:<\/p>\n<p><strong>Request URL:<\/strong> https:\/\/www.instagram.com\/graphql\/query\/<br>\n<strong>Request type:<\/strong> GET<br>\n<strong>Query string parameters:<\/strong> query_hash and variables<\/p>\n<p>Obviously, some static id is passed as query_hash, which is generated by JS or exist either on the page, cookie or some JS file. There are also some parameters, which defines what exactly you get from the server are passed in the JSON format as the variables query parameter.<\/p>\n<p>Now we need to understand where query_hash comes from. If we go to the Elements tab and try to find (CTRL + F) our query_hash <strong>e769aa130647d2354c40ea6a439bfc08<\/strong>, then we find out that it doesn&#8217;t exist on the page itself, which means that it is loaded or generated somewhere in the Javascript code, or comes with cookies. Therefore, go back to the Network tab and put the filter on JS. Thus, we can see only requests for JS files. Sequentially browsing the request by request, we have to search for our id in the loaded JS files: just click on the request, then open the Response tab in the opened panel to see the content of JS and do a search for our id (CTRL + F). After several unsuccessful attempts, we find that our id is in the following JS file:<\/p>\n<p>https:\/\/www.instagram.com\/static\/bundles\/ProfilePageContainer.js\/031ac4860b53.js<\/p>\n<p>and the code fragment that surrounds the id looks like this:<\/p>\n<pre><code class=\"language-js\">profilePosts.byUserId.get(n))||void 0===s?void 0:s.pagination},queryId:&quot;e769aa130647d2354c40ea6a439bfc08&quot;,queryParams\n<\/code><\/pre>\n<p>So, to get query_hash, we need to:<\/p>\n<ol>\n<li>Load main channel page<\/li>\n<li>Find the URL to the file which filename contains <em>ProfilePageContainer.js<\/em><\/li>\n<li>Extract this URL<\/li>\n<li>Load JS file<\/li>\n<li>Parse the id we need<\/li>\n<li>Write it into a variable for later use.<\/li>\n<\/ol>\n<p>Now let&#8217;s see what data is passed as variables parameter:<\/p>\n<pre><code class=\"language-js\">{&quot;id&quot;:&quot;25025320&quot;,&quot;first&quot;:12,&quot;after&quot;:&quot;AQAzEauY26BEUyDxOz9NhBP2gjLbTTD3OD1ajDxZIHvldwFwboiBnIcglaL6Kb_yDssRABBoUDdIls5V8unGC86hC2qk_IeLFUcH2QPTrY3f4A&quot;}\n<\/code><\/pre>\n<p>If we analyze all XHR requests that load data, we find that only the <em>after<\/em> parameter changes. Therefore <em>id<\/em> most likely is the id of the channel, which we need to extract from somewhere, <em>first<\/em> &#8211; the number of records that the server should return, and <em>after<\/em> is the id of the last record shown.<\/p>\n<p>We need to find where we can extract the channel id. So the first thing we do is look for the text <strong>25025320<\/strong> in the source code of the main channel page. Let&#8217;s go to the Elements tab and do a search (CTRL + F) for our id. We  find that it exists in the JSON structure on the page itself, and we can easily extract it:<\/p>\n<figure id=\"attachment_mmd_418\" class=\"wp-block-image alignnone\"><a href=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram8.jpg\"><img width=\"1920\" height=\"443\" src=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram8.jpg\" class=\"attachment-full size-full\" alt=\"Scraping Instagram: JSON structure with data\" decoding=\"async\" loading=\"lazy\" align=\"none\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram8.jpg 1920w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2018\/02\/instagram8-768x177.jpg 768w\" sizes=\"auto, (max-width: 1920px) 100vw, 1920px\" \/><\/a><\/figure>\n<p>It seems everything is clear, but where do we get <em>after<\/em> value for each subsequent data loading? It is straightforward. Since it gets changed with each new loading, it&#8217;s most likely loaded with the data feed. Let&#8217;s look at the loaded data again more carefully:<\/p>\n<p>We will see the following data structure there:<\/p>\n<pre><code class=\"language-js\">data: {\n    user: {\n        edge_owner_to_timeline_media: {\n            count: 5014,\n            page_info: {\n                has_next_page: true,\n                end_cursor: &quot;AQCCoEpYvQtj0-NgbaQUg9g4ffOJf8drV2RieFJw1RA3E9lDoc8euxXjeuwlUEtXB6CRS9Zs2ZGJcNKseKF9f6b0cN0VC3ck8rnTfOw5q8nlJw&quot;\n            }\n        }\n    }\n}\n<\/code><\/pre>\n<p>where <em>end_cursor<\/em> looks like what we are looking for. Also, there is field <em>has_next_page<\/em> which can be very handy for us so we could stop loading feeds with data if there is no more data available.<\/p>\n<p>Now we&#8217;ll write the beginning part of our Instagram scraper, load the main channel page and try to load the JS file with query_hash. Create a digger in your Diggernaut account and add the following configuration to it:<\/p>\n<pre class=\"language-yaml line-numbers\"><code class=\"language-yaml\">---\nconfig:\n    agent: Firefox\n    debug: 2\ndo:\n# Load main channel page\n- walk:\n    to: https:\/\/www.instagram.com\/instagram\/\n    do:\n    # Find all elements that loads Javascript files\n    - find:\n        path: script[type=&quot;text\/javascript&quot;]\n        do:\n        # Parse value in the src attribute\n        - parse:\n            attr: src\n        # Check if filename contains ProfilePageContainer.js string\n        - if:\n            match: ProfilePageContainer\\.js\n            do:\n            # If check is true, load JS file\n            - walk:\n                to: value\n                do:<\/code><\/pre>\n<p>Set the Digger to the Debug mode. Now we need to run our Instagram scraper, and when the job is done, we are going to check the log. At the end of the log, we can see how the diggernaut works with JS files. It converts them into the following structure:<\/p>\n<pre><code class=\"language-html\">&lt;html&gt;\n  &lt;head&gt;&lt;\/head&gt;\n    &lt;body&gt;\n      &lt;body_safe&gt;\n          &lt;script&gt;\n              ... JS code will be located here\n                &lt;\/script&gt;\n      &lt;\/body_safe&gt;\n    &lt;\/body&gt;\n&lt;\/html&gt;\n<\/code><\/pre>\n<p>So the CSS selector for all JS script content will be <strong>script<\/strong>. Let&#8217;s add the query_hash parsing function:<\/p>\n<pre class=\"language-yaml line-numbers\"><code class=\"language-yaml\">---\nconfig:\n    agent: Firefox\n    debug: 2\ndo:\n# Load main channel page\n- walk:\n    to: https:\/\/www.instagram.com\/instagram\/\n    do:\n    # Find all elements that loads Javascript files\n    - find:\n        path: script[type=&quot;text\/javascript&quot;]\n        do:\n        # Parse value in the src attribute\n        - parse:\n            attr: src\n        # Check if filename contains ProfilePageContainer.js string\n        - if:\n            match: ProfilePageContainer\\.js\n            do:\n            # If check is true, load JS file\n            - walk:\n                to: value\n                do:\n                # Find element with JS content\n                - find:\n                    path: script\n                    do:\n                    # Parse content of the block and apply regular expression filter to extract only query_hash\n                    - parse:\n                        filter: profilePosts\\.byUserId\\.get[^,]+,queryId\\:\\&amp;\\s*quot\\;([^&amp;]+)\\&amp;\\s*quot\\;\n                    # Set extracted value to the variable queryid\n                    - variable_set: queryid<\/code><\/pre>\n<p>Let&#8217;s save our digger configuration and rerun it. Wait until it finishes the job and recheck the log. In the log we see the following line:<\/p>\n<p><em>Set variable queryid to register value: df16f80848b2de5a3ca9495d781f98df<\/em><\/p>\n<p>It means that query_hash was successfully extracted and written to the variable named queryid.<\/p>\n<p>Now we need to extract the channel id. As you remember, it is in the JSON object on the page itself. So we need to parse the contents of a specific <strong>script<\/strong> element, pull JSON out of there, convert it to XML, and take the value we need using the CSS selector.<\/p>\n<pre class=\"language-yaml line-numbers\"><code class=\"language-yaml\">---\nconfig:\n    agent: Firefox\n    debug: 2\ndo:\n# Load main channel page\n- walk:\n    to: https:\/\/www.instagram.com\/instagram\/\n    do:\n    # Find all elements that loads Javascript files\n    - find:\n        path: script[type=&quot;text\/javascript&quot;]\n        do:\n        # Parse value in the src attribute\n        - parse:\n            attr: src\n        # Check if filename contains ProfilePageContainer.js string\n        - if:\n            match: ProfilePageContainer\\.js\n            do:\n            # If check is true, load JS file\n            - walk:\n                to: value\n                do:\n                # Find element with JS content\n                - find:\n                    path: script\n                    do:\n                    # Parse content of the block and apply regular expression filter to extract only query_hash\n                    - parse:\n                        filter: profilePosts\\.byUserId\\.get[^,]+,queryId\\:\\&amp;\\s*quot\\;([^&amp;]+)\\&amp;\\s*quot\\;\n                    # Set extracted value to the variable queryid\n                    - variable_set: queryid\n    # Find element script, which contains string window._sharedData\n    - find:\n        path: script:contains(&quot;window._sharedData&quot;)\n        do:\n        # Parse only JSON content\n        - parse:\n            filter: window\\._sharedData\\s+\\=\\s+(.+)\\s*;\\s*$\n        # Convert JSON to XML\n        - normalize:\n            routine: json2xml\n        # Convert XML content of the register to the block\n        - to_block\n        # Find elements where channel id is kept\n        - find:\n            path: entry_data &gt; profilepage &gt; user &gt; id\n            do:\n            # Parse content of the current block\n            - parse\n            # Set parsed value to the variable chid\n            - variable_set: chid<\/code><\/pre>\n<p>If you look closely at the log, you can see that the JSON structure is transformed into an XML DOM like this:<\/p>\n<pre><code class=\"language-html\">&lt;body_safe&gt;\n    &lt;activity_counts&gt;&lt;\/activity_counts&gt;\n    &lt;config&gt;\n        &lt;csrf_token&gt;qNVodzmebd0ZnAEOYxFCPpMV1XWGEaDz&lt;\/csrf_token&gt;\n        &lt;viewer&gt;&lt;\/viewer&gt;\n    &lt;\/config&gt;\n    &lt;country_code&gt;US&lt;\/country_code&gt;\n    &lt;display_properties_server_guess&gt;\n        &lt;orientation&gt;&lt;\/orientation&gt;\n        &lt;pixel_ratio&gt;1.5&lt;\/pixel_ratio&gt;\n        &lt;viewport_height&gt;480&lt;\/viewport_height&gt;\n        &lt;viewport_width&gt;360&lt;\/viewport_width&gt;\n    &lt;\/display_properties_server_guess&gt;\n    &lt;entry_data&gt;\n        &lt;profilepage&gt;\n            &lt;logging_page_id&gt;profilePage_25025320&lt;\/logging_page_id&gt;\n            &lt;graphql&gt;\n                &lt;user&gt;\n                    &lt;biography&gt;Discovering &mdash; and telling &mdash; stories from around the world. Curated by Instagram&rsquo;s community\n                        team.&lt;\/biography&gt;\n                    &lt;blocked_by_viewer&gt;false&lt;\/blocked_by_viewer&gt;\n                    &lt;connected_fb_page&gt;&lt;\/connected_fb_page&gt;\n                    &lt;country_block&gt;false&lt;\/country_block&gt;\n                    &lt;external_url&gt;http:\/\/blog.instagram.com\/&lt;\/external_url&gt;\n                    &lt;external_url_linkshimmed&gt;http:\/\/l.instagram.com\/?u=http%3A%2F%2Fblog.instagram.com%2F&amp;e=ATM_VrrL-_PjBU0WJ0OT_xPSlo-70w2PtE177ZsbPuLY9tmVs8JmIXfYgban04z423i2IL8M&lt;\/external_url_linkshimmed&gt;\n                    &lt;followed_by&gt;\n                            &lt;count&gt;230937095&lt;\/count&gt;\n                    &lt;\/followed_by&gt;\n                    &lt;followed_by_viewer&gt;false&lt;\/followed_by_viewer&gt;\n                    &lt;follows&gt;\n                            &lt;count&gt;197&lt;\/count&gt;\n                    &lt;\/follows&gt;\n                    &lt;follows_viewer&gt;false&lt;\/follows_viewer&gt;\n                    &lt;full_name&gt;Instagram&lt;\/full_name&gt;\n                    &lt;has_blocked_viewer&gt;false&lt;\/has_blocked_viewer&gt;\n                    &lt;has_requested_viewer&gt;false&lt;\/has_requested_viewer&gt;\n                    &lt;id&gt;25025320&lt;\/id&gt;\n                    &lt;is_private&gt;false&lt;\/is_private&gt;\n                    &lt;is_verified&gt;true&lt;\/is_verified&gt;\n                    &lt;edge_owner_to_timeline_media&gt;\n                        &lt;count&gt;5014&lt;\/count&gt;\n                        &lt;edges&gt;\n                            &lt;node&gt;\n                                &lt;safe___typename&gt;GraphVideo&lt;\/safe___typename&gt;\n                                &lt;comments_disabled&gt;false&lt;\/comments_disabled&gt;\n                                &lt;dimensions&gt;\n                                        &lt;height&gt;607&lt;\/height&gt;\n                                        &lt;width&gt;1080&lt;\/width&gt;\n                                &lt;\/dimensions&gt;\n                                &lt;display_url&gt;https:\/\/scontent-iad3-1.cdninstagram.com\/vp\/9cdd0906e30590eed4ad793888595629\/5A5F5679\/t51.2885-15\/s1080x1080\/e15\/fr\/26158234_2061044554178629_8867446855789707264_n.jpg&lt;\/display_url&gt;\n                                &lt;edge_media_preview_like&gt;\n                                        &lt;count&gt;573448&lt;\/count&gt;\n                                &lt;\/edge_media_preview_like&gt;\n                                &lt;edge_media_to_caption&gt;\n                                        &lt;edges&gt;\n                                                &lt;node&gt;\n                                                        &lt;text&gt;Video by @yanndixon Spontaneous by nature,\n                                                                a flock of starlings swarm as one\n                                                                at sunset in England. #WHPspontaneous&lt;\/text&gt;\n                                                &lt;\/node&gt;\n                                        &lt;\/edges&gt;\n                                &lt;\/edge_media_to_caption&gt;\n                                &lt;edge_media_to_comment&gt;\n                                        &lt;count&gt;4709&lt;\/count&gt;\n                                &lt;\/edge_media_to_comment&gt;\n                                &lt;id&gt;1688175842423510712&lt;\/id&gt;\n                                &lt;is_video&gt;true&lt;\/is_video&gt;\n                                &lt;owner&gt;\n                                        &lt;id&gt;25025320&lt;\/id&gt;\n                                &lt;\/owner&gt;\n                                &lt;shortcode&gt;Bdtmvv-DJa4&lt;\/shortcode&gt;\n                                &lt;taken_at_timestamp&gt;1515466361&lt;\/taken_at_timestamp&gt;\n                                &lt;thumbnail_resources&gt;\n                                        &lt;config_height&gt;150&lt;\/config_height&gt;\n                                        &lt;config_width&gt;150&lt;\/config_width&gt;\n                                        &lt;src&gt;https:\/\/scontent-iad3-1.cdninstagram.com\/vp\/1ec5640a0a97e98127a1a04f1be62b6b\/5A5F436E\/t51.2885-15\/s150x150\/e15\/c236.0.607.607\/26158234_2061044554178629_8867446855789707264_n.jpg&lt;\/src&gt;\n                                &lt;\/thumbnail_resources&gt;\n                                &lt;thumbnail_resources&gt;\n                                        &lt;config_height&gt;240&lt;\/config_height&gt;\n                                        &lt;config_width&gt;240&lt;\/config_width&gt;\n                                        &lt;src&gt;https:\/\/scontent-iad3-1.cdninstagram.com\/vp\/8c972cdacf536ea7bc6764279f3801b3\/5A5EF038\/t51.2885-15\/s240x240\/e15\/c236.0.607.607\/26158234_2061044554178629_8867446855789707264_n.jpg&lt;\/src&gt;\n                                &lt;\/thumbnail_resources&gt;\n                                &lt;thumbnail_resources&gt;\n                                        &lt;config_height&gt;320&lt;\/config_height&gt;\n                                        &lt;config_width&gt;320&lt;\/config_width&gt;\n                                        &lt;src&gt;https:\/\/scontent-iad3-1.cdninstagram.com\/vp\/a74e8d0f933bffe75b28af3092f12769\/5A5EFC3E\/t51.2885-15\/s320x320\/e15\/c236.0.607.607\/26158234_2061044554178629_8867446855789707264_n.jpg&lt;\/src&gt;\n                                &lt;\/thumbnail_resources&gt;\n                                &lt;thumbnail_resources&gt;\n                                        &lt;config_height&gt;480&lt;\/config_height&gt;\n                                        &lt;config_width&gt;480&lt;\/config_width&gt;\n                                        &lt;src&gt;https:\/\/scontent-iad3-1.cdninstagram.com\/vp\/59790fbcf0a358521f5eb81ec48de4a6\/5A5F4F4D\/t51.2885-15\/s480x480\/e15\/c236.0.607.607\/26158234_2061044554178629_8867446855789707264_n.jpg&lt;\/src&gt;\n                                &lt;\/thumbnail_resources&gt;\n                                &lt;thumbnail_resources&gt;\n                                        &lt;config_height&gt;640&lt;\/config_height&gt;\n                                        &lt;config_width&gt;640&lt;\/config_width&gt;\n                                        &lt;src&gt;https:\/\/scontent-iad3-1.cdninstagram.com\/vp\/556243558c189f5dfff4081ecfdf06cc\/5A5F43E1\/t51.2885-15\/e15\/c236.0.607.607\/26158234_2061044554178629_8867446855789707264_n.jpg&lt;\/src&gt;\n                                &lt;\/thumbnail_resources&gt;\n                                &lt;thumbnail_src&gt;https:\/\/scontent-iad3-1.cdninstagram.com\/vp\/556243558c189f5dfff4081ecfdf06cc\/5A5F43E1\/t51.2885-15\/e15\/c236.0.607.607\/26158234_2061044554178629_8867446855789707264_n.jpg&lt;\/thumbnail_src&gt;\n                                &lt;video_view_count&gt;2516274&lt;\/video_view_count&gt;\n                            &lt;\/node&gt;\n                        &lt;\/edges&gt;\n                        ...\n                        &lt;page_info&gt;\n                                &lt;end_cursor&gt;AQAchf_lNcgUmnCZ0JTwqV_p3J0f-N21HeHzR2xplwxalNZDXg9tNmrBCzkegX1lN53ROI_HVoUZBPtdxZLuDyvUsYdNoLRb2-z6HMtJoTXRYQ&lt;\/end_cursor&gt;\n                                &lt;has_next_page&gt;true&lt;\/has_next_page&gt;\n                        &lt;\/page_info&gt;\n                    &lt;\/edge_owner_to_timeline_media&gt;\n                &lt;\/user&gt;\n            &lt;\/graphql&gt;\n        &lt;\/profilepage&gt;\n    &lt;\/entry_data&gt;\n    &lt;rollout_hash&gt;45ca3dc3d5fd&lt;\/rollout_hash&gt;\n    &lt;show_app_install&gt;true&lt;\/show_app_install&gt;\n    &lt;zero_data&gt;&lt;\/zero_data&gt;\n&lt;\/body_safe&gt;\n<\/code><\/pre>\n<p>It helps us build CSS selectors to get the first 12 records and the last record marker which we are going to use in <em>after<\/em> parameter for the request we send to the server to get next 12 records. Let&#8217;s write the logic for data extraction, and also let&#8217;s use a pool of links. We are going to iterate links in this pool and consequently add next page URL to this pool. Once next 12 records are loaded we are going to stop and see how loaded JSON is transformed to XML so we could build CSS selectors for data we want to extract.<\/p>\n<p>Most recently, Instagram made changes to the public API. Now authorization is done not by using CSRF token, but by a particular signature, which is calculated with the new parameter rhx_gis, passed in the sharedData object on the page. You can learn used algorithm if you research JS of Instagram. Mainly its just MD5 sum of some parameters joined to a single string. So we use this algorithm and automatically sign requests. To do it, we need to extract the rhx_gis parameter.<\/p>\n<pre class=\"language-yaml line-numbers\"><code class=\"language-yaml\">---\nconfig:\n    agent: Firefox\n    debug: 2\ndo:\n# Load main channel page\n- walk:\n    to: https:\/\/www.instagram.com\/instagram\/\n    do:\n    # Find all elements that loads Javascript files\n    - find:\n        path: script[type=&quot;text\/javascript&quot;]\n        do:\n        # Parse value in the src attribute\n        - parse:\n            attr: src\n        # Check if filename contains ProfilePageContainer.js string\n        - if:\n            match: ProfilePageContainer\\.js\n            do:\n            # If check is true, load JS file\n            - walk:\n                to: value\n                do:\n                # Find element with JS content\n                - find:\n                    path: script\n                    do:\n                    # Parse content of the block and apply regular expression filter to extract only query_hash\n                    - parse:\n                        filter: profilePosts\\.byUserId\\.get[^,]+,queryId\\:\\&amp;\\s*quot\\;([^&amp;]+)\\&amp;\\s*quot\\;\n                    # Set extracted value to the variable queryid\n                    - variable_set: queryid\n    # Find element script, which contains string window._sharedData\n    - find:\n        path: script:contains(&quot;window._sharedData&quot;)\n        do:\n        - parse\n        - space_dedupe\n        - trim\n        # extracting JSON\n        - filter: \n            args: window\\._sharedData\\s+\\=\\s+(.+)\\s*;\\s*$\n        # Convert JSON to XML\n        - normalize:\n            routine: json2xml\n        # Convert XML content of the register to the block\n        - to_block\n        - exit\n        # Find elements where channel id is kept\n        - find:\n            path: entry_data &gt; profilepage &gt; graphql &gt; user &gt; id\n            do:\n            # Parse content of the current block\n            - parse\n            # Set parsed value to the variable chid\n            - variable_set: chid\n        # Find elements where rhx_gis is kept\n        - find:\n            path: rhx_gis\n            do:\n            # Parse content of the current block\n            - parse\n            # Set parsed value to the variable rhxgis\n            - variable_set: rhxgis\n        # Find record elements and iterate over them\n        - find:\n            path: entry_data &gt; profilepage &gt; graphql &gt; user &gt; edge_owner_to_timeline_media &gt; edges &gt; node\n            do:\n            # Create new object named item\n            - object_new: item\n            # Find element with URL of image\n            - find:\n                path: display_url\n                do:\n                # Parse content\n                - parse\n                # Save value to the field of the object item\n                - object_field_set:\n                    object: item\n                    field: url\n            # Find element with record description\n            - find:\n                path: edge_media_to_caption &gt; edges &gt; node &gt; text\n                do:\n                # Parse content\n                - parse\n                # Save value to the field of the object item\n                - object_field_set:\n                    object: item\n                    field: caption\n            # Find element indicating if record is video or not\n            - find:\n                path: is_video\n                do:\n                # Parse content\n                - parse\n                # Save value to the field of the object item\n                - object_field_set:\n                    object: item\n                    field: video\n            # Find element with number of comments\n            - find:\n                path: edge_media_to_comment &gt; count\n                do:\n                # Parse content\n                - parse\n                # Save value to the field of the object item\n                - object_field_set:\n                    object: item\n                    field: comments\n            # Find element with number of likes\n            - find:\n                path: edge_media_preview_like &gt; count\n                do:\n                # Parse content\n                - parse\n                # Save value to the field of the object item\n                - object_field_set:\n                    object: item\n                    field: likes\n            # Save object item to the DB\n            - object_save:\n                name: item\n        # Find element where next page data kept\n        - find:\n            path: entry_data &gt; profilepage &gt; graphql &gt; user &gt; edge_owner_to_timeline_media &gt; page_info\n            do:\n            # Find element indicating if there is next page\n            - find:\n                path: has_next_page\n                do:\n                # parse content\n                - parse\n                # Save value to the variable\n                - variable_set: hnp\n            # Read variable hnp to the register\n            - variable_get: hnp\n            # Check if value is &#039;true&#039;\n            - if:\n                match: &#039;true&#039;\n                do:\n                # If yes, then find element with last shown record marker\n                - find:\n                    path: end_cursor\n                    do:\n                    # Parse content\n                    - parse\n                    # Save value to the variable with name cursor\n                    - variable_set: cursor\n                    # Apply URL-encode for the value (since it may contain characters not allowed in the URL)\n                    - eval:\n                        routine: js\n                        body: &#039;(function () {return encodeURIComponent(&quot;&quot;)})();&#039;\n                    # Save value to the variable with name cursor\n                    - variable_set: cursor_encoded\n                    # Form pool of links and add first link to this pool\n                    - link_add:\n                        url: https:\/\/www.instagram.com\/graphql\/query\/?query_hash=&amp;variables=%7B%22id%22%3A%22%22%2C%22first%22%3A12%2C%22after%22%3A%22%22%7D\n                    # Calculate signature\n                    - register_set: &#039;:{&quot;id&quot;:&quot;&quot;,&quot;first&quot;:12,&quot;after&quot;:&quot;&quot;}&#039;\n                    - normalize:\n                        routine: md5\n                    - variable_set: signature\n    # Set counter for number of loads to 0\n    - counter_set:\n        name: pages\n        value: 0\n    # Iterate over the pool and load current URL using signature in request header\n    - walk:\n        to: links\n        headers:\n            x-instagram-gis: \n            x-requested-with: XMLHttpRequest\n        do:<\/code><\/pre>\n<p>Again, save the configuration and run the digger. Wait for completion and check the log. You should see the following structure for loaded JSON with next 12 records:<\/p>\n<pre><code class=\"language-html\">&lt;html&gt;\n\n&lt;head&gt;&lt;\/head&gt;\n\n&lt;body&gt;\n&lt;body_safe&gt;\n    &lt;activity_counts&gt;&lt;\/activity_counts&gt;\n    &lt;config&gt;\n        &lt;csrf_token&gt;qNVodzmebd0ZnAEOYxFCPpMV1XWGEaDz&lt;\/csrf_token&gt;\n        &lt;viewer&gt;&lt;\/viewer&gt;\n    &lt;\/config&gt;\n    &lt;country_code&gt;US&lt;\/country_code&gt;\n    &lt;display_properties_server_guess&gt;\n        &lt;orientation&gt;&lt;\/orientation&gt;\n        &lt;pixel_ratio&gt;1.5&lt;\/pixel_ratio&gt;\n        &lt;viewport_height&gt;480&lt;\/viewport_height&gt;\n        &lt;viewport_width&gt;360&lt;\/viewport_width&gt;\n    &lt;\/display_properties_server_guess&gt;\n    &lt;entry_data&gt;\n        &lt;profilepage&gt;\n            &lt;logging_page_id&gt;profilePage_25025320&lt;\/logging_page_id&gt;\n            &lt;graphql&gt;\n                &lt;user&gt;\n                    &lt;biography&gt;Discovering &mdash; and telling &mdash; stories from around the world. Curated by Instagram&rsquo;s community\n                        team.&lt;\/biography&gt;\n                    &lt;blocked_by_viewer&gt;false&lt;\/blocked_by_viewer&gt;\n                    &lt;connected_fb_page&gt;&lt;\/connected_fb_page&gt;\n                    &lt;country_block&gt;false&lt;\/country_block&gt;\n                    &lt;external_url&gt;http:\/\/blog.instagram.com\/&lt;\/external_url&gt;\n                    &lt;external_url_linkshimmed&gt;http:\/\/l.instagram.com\/?u=http%3A%2F%2Fblog.instagram.com%2F&amp;e=ATM_VrrL-_PjBU0WJ0OT_xPSlo-70w2PtE177ZsbPuLY9tmVs8JmIXfYgban04z423i2IL8M&lt;\/external_url_linkshimmed&gt;\n                    &lt;followed_by&gt;\n                            &lt;count&gt;230937095&lt;\/count&gt;\n                    &lt;\/followed_by&gt;\n                    &lt;followed_by_viewer&gt;false&lt;\/followed_by_viewer&gt;\n                    &lt;follows&gt;\n                            &lt;count&gt;197&lt;\/count&gt;\n                    &lt;\/follows&gt;\n                    &lt;follows_viewer&gt;false&lt;\/follows_viewer&gt;\n                    &lt;full_name&gt;Instagram&lt;\/full_name&gt;\n                    &lt;has_blocked_viewer&gt;false&lt;\/has_blocked_viewer&gt;\n                    &lt;has_requested_viewer&gt;false&lt;\/has_requested_viewer&gt;\n                    &lt;id&gt;25025320&lt;\/id&gt;\n                    &lt;is_private&gt;false&lt;\/is_private&gt;\n                    &lt;is_verified&gt;true&lt;\/is_verified&gt;\n                    &lt;edge_owner_to_timeline_media&gt;\n                        &lt;count&gt;5014&lt;\/count&gt;\n                        &lt;edges&gt;\n                            &lt;node&gt;\n                                &lt;safe___typename&gt;GraphVideo&lt;\/safe___typename&gt;\n                                &lt;comments_disabled&gt;false&lt;\/comments_disabled&gt;\n                                &lt;dimensions&gt;\n                                        &lt;height&gt;607&lt;\/height&gt;\n                                        &lt;width&gt;1080&lt;\/width&gt;\n                                &lt;\/dimensions&gt;\n                                &lt;display_url&gt;https:\/\/scontent-iad3-1.cdninstagram.com\/vp\/9cdd0906e30590eed4ad793888595629\/5A5F5679\/t51.2885-15\/s1080x1080\/e15\/fr\/26158234_2061044554178629_8867446855789707264_n.jpg&lt;\/display_url&gt;\n                                &lt;edge_media_preview_like&gt;\n                                        &lt;count&gt;573448&lt;\/count&gt;\n                                &lt;\/edge_media_preview_like&gt;\n                                &lt;edge_media_to_caption&gt;\n                                        &lt;edges&gt;\n                                                &lt;node&gt;\n                                                        &lt;text&gt;Video by @yanndixon Spontaneous by nature,\n                                                                a flock of starlings swarm as one\n                                                                at sunset in England. #WHPspontaneous&lt;\/text&gt;\n                                                &lt;\/node&gt;\n                                        &lt;\/edges&gt;\n                                &lt;\/edge_media_to_caption&gt;\n                                &lt;edge_media_to_comment&gt;\n                                        &lt;count&gt;4709&lt;\/count&gt;\n                                &lt;\/edge_media_to_comment&gt;\n                                &lt;id&gt;1688175842423510712&lt;\/id&gt;\n                                &lt;is_video&gt;true&lt;\/is_video&gt;\n                                &lt;owner&gt;\n                                        &lt;id&gt;25025320&lt;\/id&gt;\n                                &lt;\/owner&gt;\n                                &lt;shortcode&gt;Bdtmvv-DJa4&lt;\/shortcode&gt;\n                                &lt;taken_at_timestamp&gt;1515466361&lt;\/taken_at_timestamp&gt;\n                                &lt;thumbnail_resources&gt;\n                                        &lt;config_height&gt;150&lt;\/config_height&gt;\n                                        &lt;config_width&gt;150&lt;\/config_width&gt;\n                                        &lt;src&gt;https:\/\/scontent-iad3-1.cdninstagram.com\/vp\/1ec5640a0a97e98127a1a04f1be62b6b\/5A5F436E\/t51.2885-15\/s150x150\/e15\/c236.0.607.607\/26158234_2061044554178629_8867446855789707264_n.jpg&lt;\/src&gt;\n                                &lt;\/thumbnail_resources&gt;\n                                &lt;thumbnail_resources&gt;\n                                        &lt;config_height&gt;240&lt;\/config_height&gt;\n                                        &lt;config_width&gt;240&lt;\/config_width&gt;\n                                        &lt;src&gt;https:\/\/scontent-iad3-1.cdninstagram.com\/vp\/8c972cdacf536ea7bc6764279f3801b3\/5A5EF038\/t51.2885-15\/s240x240\/e15\/c236.0.607.607\/26158234_2061044554178629_8867446855789707264_n.jpg&lt;\/src&gt;\n                                &lt;\/thumbnail_resources&gt;\n                                &lt;thumbnail_resources&gt;\n                                        &lt;config_height&gt;320&lt;\/config_height&gt;\n                                        &lt;config_width&gt;320&lt;\/config_width&gt;\n                                        &lt;src&gt;https:\/\/scontent-iad3-1.cdninstagram.com\/vp\/a74e8d0f933bffe75b28af3092f12769\/5A5EFC3E\/t51.2885-15\/s320x320\/e15\/c236.0.607.607\/26158234_2061044554178629_8867446855789707264_n.jpg&lt;\/src&gt;\n                                &lt;\/thumbnail_resources&gt;\n                                &lt;thumbnail_resources&gt;\n                                        &lt;config_height&gt;480&lt;\/config_height&gt;\n                                        &lt;config_width&gt;480&lt;\/config_width&gt;\n                                        &lt;src&gt;https:\/\/scontent-iad3-1.cdninstagram.com\/vp\/59790fbcf0a358521f5eb81ec48de4a6\/5A5F4F4D\/t51.2885-15\/s480x480\/e15\/c236.0.607.607\/26158234_2061044554178629_8867446855789707264_n.jpg&lt;\/src&gt;\n                                &lt;\/thumbnail_resources&gt;\n                                &lt;thumbnail_resources&gt;\n                                        &lt;config_height&gt;640&lt;\/config_height&gt;\n                                        &lt;config_width&gt;640&lt;\/config_width&gt;\n                                        &lt;src&gt;https:\/\/scontent-iad3-1.cdninstagram.com\/vp\/556243558c189f5dfff4081ecfdf06cc\/5A5F43E1\/t51.2885-15\/e15\/c236.0.607.607\/26158234_2061044554178629_8867446855789707264_n.jpg&lt;\/src&gt;\n                                &lt;\/thumbnail_resources&gt;\n                                &lt;thumbnail_src&gt;https:\/\/scontent-iad3-1.cdninstagram.com\/vp\/556243558c189f5dfff4081ecfdf06cc\/5A5F43E1\/t51.2885-15\/e15\/c236.0.607.607\/26158234_2061044554178629_8867446855789707264_n.jpg&lt;\/thumbnail_src&gt;\n                                &lt;video_view_count&gt;2516274&lt;\/video_view_count&gt;\n                            &lt;\/node&gt;\n                        &lt;\/edges&gt;\n                        ...\n                        &lt;page_info&gt;\n                                &lt;end_cursor&gt;AQAchf_lNcgUmnCZ0JTwqV_p3J0f-N21HeHzR2xplwxalNZDXg9tNmrBCzkegX1lN53ROI_HVoUZBPtdxZLuDyvUsYdNoLRb2-z6HMtJoTXRYQ&lt;\/end_cursor&gt;\n                                &lt;has_next_page&gt;true&lt;\/has_next_page&gt;\n                        &lt;\/page_info&gt;\n                    &lt;\/edge_owner_to_timeline_media&gt;\n                &lt;\/user&gt;\n            &lt;\/graphql&gt;\n        &lt;\/profilepage&gt;\n    &lt;\/entry_data&gt;\n    &lt;rollout_hash&gt;45ca3dc3d5fd&lt;\/rollout_hash&gt;\n    &lt;show_app_install&gt;true&lt;\/show_app_install&gt;\n    &lt;zero_data&gt;&lt;\/zero_data&gt;\n&lt;\/body_safe&gt;\n&lt;\/body&gt;\n&lt;\/html&gt;\n<\/code><\/pre>\n<p>We shortened source code on purpose. We just removed all records from data except the first record. Since the structure is the same for all records, we need to see the structure of one record to build CSS selectors. Now we can define the logic of scraping for all fields we need. We are going to set the limit to the number of loads also, let&#8217;s say to 10. Moreover, add a pause for less aggressive scraping. As a result, we get the final version of our Instagram scraper.<\/p>\n<pre class=\"language-yaml line-numbers\"><code class=\"language-yaml\">---\nconfig:\n    agent: Firefox\n    debug: 2\ndo:\n# Load main channel page\n- walk:\n    to: https:\/\/www.instagram.com\/instagram\/\n    do:\n    # Find all elements that loads Javascript files\n    - find:\n        path: script[type=&quot;text\/javascript&quot;]\n        do:\n        # Parse value in the src attribute\n        - parse:\n            attr: src\n        # Check if filename contains ProfilePageContainer.js string\n        - if:\n            match: ProfilePageContainer\\.js\n            do:\n            # If check is true, load JS file\n            - walk:\n                to: value\n                do:\n                # Find element with JS content\n                - find:\n                    path: script\n                    do:\n                    # Parse content of the block and apply regular expression filter to extract only query_hash\n                    - parse:\n                        filter: profilePosts\\.byUserId\\.get[^,]+,queryId\\:\\&amp;\\s*quot\\;([^&amp;]+)\\&amp;\\s*quot\\;\n                    # Set extracted value to the variable queryid\n                    - variable_set: queryid\n    # Find element script, which contains string window._sharedData\n    - find:\n        path: script:contains(&quot;window._sharedData&quot;)\n        do:\n        - parse\n        - space_dedupe\n        - trim\n        # extracting JSON\n        - filter: \n            args: window\\._sharedData\\s+\\=\\s+(.+)\\s*;\\s*$\n        # Convert JSON to XML\n        - normalize:\n            routine: json2xml\n        # Convert XML content of the register to the block\n        - to_block\n        # Find elements where channel id is kept\n        - find:\n            path: entry_data &gt; profilepage &gt; graphql &gt; user &gt; id\n            do:\n            # Parse content of the current block\n            - parse\n            # Set parsed value to the variable chid\n            - variable_set: chid\n        # Find elements where rhx_gis is kept\n        - find:\n            path: rhx_gis\n            do:\n            # Parse content of the current block\n            - parse\n            # Set parsed value to the variable rhxgis\n            - variable_set: rhxgis\n        # Find record elements and iterate over them\n        - find:\n            path: entry_data &gt; profilepage &gt; graphql &gt; user &gt; edge_owner_to_timeline_media &gt; edges &gt; node\n            do:\n            # Create new object named item\n            - object_new: item\n            # Find element with URL of image\n            - find:\n                path: display_url\n                do:\n                # Parse content\n                - parse\n                # Save value to the field of the object item\n                - object_field_set:\n                    object: item\n                    field: url\n            # Find element with record description\n            - find:\n                path: edge_media_to_caption &gt; edges &gt; node &gt; text\n                do:\n                # Parse content\n                - parse\n                # Save value to the field of the object item\n                - object_field_set:\n                    object: item\n                    field: caption\n            # Find element indicating if record is video or not\n            - find:\n                path: is_video\n                do:\n                # Parse content\n                - parse\n                # Save value to the field of the object item\n                - object_field_set:\n                    object: item\n                    field: video\n            # Find element with number of comments\n            - find:\n                path: edge_media_to_comment &gt; count\n                do:\n                # Parse content\n                - parse\n                # Save value to the field of the object item\n                - object_field_set:\n                    object: item\n                    field: comments\n            # Find element with number of likes\n            - find:\n                path: edge_media_preview_like &gt; count\n                do:\n                # Parse content\n                - parse\n                # Save value to the field of the object item\n                - object_field_set:\n                    object: item\n                    field: likes\n            # Save object item to the DB\n            - object_save:\n                name: item\n        # Find element where next page data kept\n        - find:\n            path: entry_data &gt; profilepage &gt; graphql &gt; user &gt; edge_owner_to_timeline_media &gt; page_info\n            do:\n            # Find element indicating if there is next page\n            - find:\n                path: has_next_page\n                do:\n                # parse content\n                - parse\n                # Save value to the variable\n                - variable_set: hnp\n            # Read variable hnp to the register\n            - variable_get: hnp\n            # Check if value is &#039;true&#039;\n            - if:\n                match: &#039;true&#039;\n                do:\n                # If yes, then find element with last shown record marker\n                - find:\n                    path: end_cursor\n                    do:\n                    # Parse content\n                    - parse\n                    # Save value to the variable with name cursor\n                    - variable_set: cursor\n                    # Apply URL-encode for the value (since it may contain characters not allowed in the URL)\n                    - eval:\n                        routine: js\n                        body: &#039;(function () {return encodeURIComponent(&quot;&quot;)})();&#039;\n                    # Save value to the variable with name cursor\n                    - variable_set: cursor_encoded\n                    # Form pool of links and add first link to this pool\n                    - link_add:\n                        url: https:\/\/www.instagram.com\/graphql\/query\/?query_hash=&amp;variables=%7B%22id%22%3A%22%22%2C%22first%22%3A12%2C%22after%22%3A%22%22%7D\n                    # Calculate signature\n                    - register_set: &#039;:{&quot;id&quot;:&quot;&quot;,&quot;first&quot;:12,&quot;after&quot;:&quot;&quot;}&#039;\n                    - normalize:\n                        routine: md5\n                    - variable_set: signature\n    # Set counter for number of loads to 0\n    - counter_set:\n        name: pages\n        value: 0\n    # Iterate over the pool and load current URL using signature in request header\n    - walk:\n        to: links\n        headers:\n            x-instagram-gis: \n            x-requested-with: XMLHttpRequest\n        do:\n        - sleep: 3\n        # Find element that hold data used for loading next page\n        - find:\n            path: edge_owner_to_timeline_media &gt; page_info\n            do:\n            # Find element indicating if there is next page available\n            - find:\n                path: has_next_page\n                do:\n                # Parse content\n                - parse\n                # Save value to the variable\n                - variable_set: hnp\n            # Read the variable to the register\n            - variable_get: hnp\n            # Check if value is &#039;true&#039;\n            - if:\n                match: &#039;true&#039;\n                do:\n                # If yes, check loads counter if its greater than 10\n                - counter_get: pages\n                - if:\n                    type: int\n                    gt: 10\n                    else:\n                    # If not, find element with the cursor\n                    - find:\n                        path: end_cursor\n                        do:\n                        # Parse content\n                        - parse\n                        # Save value to the variable\n                        - variable_set: cursor\n                        # Doing URL-encode\n                        - eval:\n                            routine: js\n                            body: &#039;(function () {return encodeURIComponent(&quot;&quot;)})();&#039;\n                        # Save value to the variable\n                        - variable_set: cursor_encoded\n                        # Add next page URL to the links pool\n                        - link_add:\n                            url: https:\/\/www.instagram.com\/graphql\/query\/?query_hash=&amp;variables=%7B%22id%22%3A%22%22%2C%22first%22%3A12%2C%22after%22%3A%22%22%7D\n                        # Calculate signature\n                        - register_set: &#039;:{&quot;id&quot;:&quot;&quot;,&quot;first&quot;:12,&quot;after&quot;:&quot;&quot;}&#039;\n                        - normalize:\n                            routine: md5\n                        - variable_set: signature\n        # Find record elements and iterate over them\n        - find:\n            path: edge_owner_to_timeline_media &gt; edges &gt; node\n            do:\n            # Create object with name item\n            - object_new: item\n            # Find element with URL of image\n            - find:\n                path: display_url\n                do:\n                # Parse content\n                - parse\n                # Save value to the field of the object item\n                - object_field_set:\n                    object: item\n                    field: url\n            # Find element with record description\n            - find:\n                path: edge_media_to_caption &gt; edges &gt; node &gt; text\n                do:\n                # Parse content\n                - parse\n                # Save value to the field of the object item\n                - object_field_set:\n                    object: item\n                    field: caption\n            # Find element indicating if record is video or not\n            - find:\n                path: is_video\n                do:\n                # Parse content\n                - parse\n                # Save value to the field of the object item\n                - object_field_set:\n                    object: item\n                    field: video\n            # Find element with number of comments\n            - find:\n                path: edge_media_to_comment &gt; count\n                do:\n                # Parse content\n                - parse\n                # Save value to the field of the object item\n                - object_field_set:\n                    object: item\n                    field: comments\n            # Find element with number of likes\n            - find:\n                path: edge_media_preview_like &gt; count\n                do:\n                # Parse content\n                - parse\n                # Save value to the field of the object item\n                - object_field_set:\n                    object: item\n                    field: likes\n            # Save object item to the DB\n            - object_save:\n                name: item\n        # Increment loads counter by 1\n        - counter_increment:\n            name: pages\n            by: 1<\/code><\/pre>\n<p>We hope that this article helps you to learn the meta-language and now you can solve tasks where you need to parse pages with infinite scroll without problems.<\/p>\n<p>Happy Scraping!<\/p>","protected":false},"excerpt":{"rendered":"<p>In this article, we will teach you how generally you can scrape the data from the websites with infinite scroll and build the Instagram scraper. All you will need to do it is just your browser, a free account on the Diggernaut platform, and your head and hands. Updated on 01.10.2020 Infinite scroll on the [&hellip;]<\/p>","protected":false},"author":4,"featured_media":424,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[32,9,2],"tags":[],"class_list":["post-405","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-codeproject","category-learning-meta-language","category-web-scraping"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts\/405","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/comments?post=405"}],"version-history":[{"count":38,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts\/405\/revisions"}],"predecessor-version":[{"id":836,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts\/405\/revisions\/836"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/media\/424"}],"wp:attachment":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/media?parent=405"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/categories?post=405"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/tags?post=405"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}