How to scrape pages with infinite scroll: extracting data from Instagram

How to scrape pages with infinite scroll: extracting data from Instagram

Infinite scroll on the webpage is based on Javascript functionality. Therefore, in order to find out what URL we need to access and what parameters to use, we need to either thoroughly study the JS code that works on the page, or, and preferably, examine the requests that the browser do when you scroll down the page. We can study requests using the Developer Tools, which are built-in to all modern browsers. In this article we will use Google Chrome, but you can use any other browser. Just keep in mind that the developer tools may looks different in different browsers.

We will study our task using the Instagram as example. And we will use official Instagram channel. Open this page in the browser, and run Chrome Dev Tools – developer tools that are built-in to Google Chrome. To do it, you need to right-click anywhere in the page and select the “Inspect” option or press “Ctrl + Shift + I”:

Scraping Instagram: turning on Dev Tools

It will open the tool window, where we go to the Network tab and in the filters we select only XHR requests. We do this in order to filter out requests we dont need. After that, reload the page in the browser using the Reload button in the browser interface or the “F5” key on the keyboard.

Scraping Instagram: setting up Dev Tools

Let’s now scroll down the page several times with the mouse wheel. It will cause content loading. Every time when we scroll down to the bottom of the page, JS will make an XHR request to the server, receive the data and add it to the page. As a result, we should have several requests in the list that look almost same. Most likely its what we are looking for.

Scraping Instagram: looking for specific XHR requests

To make sure, we have to click on one of the requests and in the new opened panel go to the Preview tab. There we can see the formatted content that the server returns to the browser for this request. Let’s get to one of the leaf elements in the tree and make sure that there are data about the images that we have on the page.

Scraping Instagram: making sure we found right requests

After making sure that these are the queries we need, let’s look at one of them more carefully. To do it, go to the Headers tab. There, we can find information about what URL is used to do the request, what is type of that request (POST or GET), and what parameters are passed with the request.

Scraping Instagram: finding request parameters

Its better to check query string parameters at Query String Parameters section. To see it you need to scroll down the pane to the bottom:

Scraping Instagram: checking query string

As result of our analysis get the following:

Request URL: https://www.instagram.com/graphql/query/
Request type: GET
Query string parameters: query_hash and variables

Obviously, some static id is passed as query_hash, which is generated by JS or exist either on the page, cookie or some JS file. There are also some parameters, which defines what exactly you get from server are passed in the JSON format as variables query parameter.

Let’s run a small experiment, take the URL with the parameters that was used to load the data:

https://www.instagram.com/graphql/query/?query_hash=472f257a40c653c64c666ce877d59d2b&variables=%7B%22id%22%3A%2225025320%22%2C%22first%22%3A12%2C%22after%22%3A%22AQAzEauY26BEUyDxOz9NhBP2gjLbTTD3OD1ajDxZIHvldwFwboiBnIcglaL6Kb_yDssRABBoUDdIls5V8unGC86hC2qk_IeLFUcH2QPTrY3f4A%22%7D

Paste it into the address bar of the browser and press Enter. We’ll see how the page loads in JSON format:

Scraping Instagram: data feed

Now we need to understand where query_hash comes from. If we go to the Elements tab and try to find (CTRL + F) our query_hash 472f257a40c653c64c666ce877d59d2b, then we find out that it doesnt exist on the page itself, which means that it is loaded or generated somewhere in the Javascript code, or comes with cookies. Therefore, go back to the Network tab and put the filter on JS. Thus, we will see only requests for JS files. Sequentially browsing the request by the request, we will search for our id in the loaded JS files: simply click on the request, then open the Response tab in the opened panel to see the content of JS and do a search for our id (CTRL + F). After several unsuccessful attempts, we find that our id is in the following JS file:

https://www.instagram.com/static/bundles/ProfilePageContainer.js/031ac4860b53.js

and the code fragment that surrounds the id looks like this:

So, to get query_hash we need to:

  1. Load main channel page
  2. Find the URL to the file which filename contains ProfilePageContainer.js
  3. Extract this URL
  4. Load JS file
  5. Parse the the id we need
  6. Write it into a variable for later use.

Now let’s see what data is passed as variables parameter:

If we analyze all XHR requests that loads data, we find that only the after parameter changes. Therefore id most likely is the id of the channel, which we need to extract from somewhere, first – the number of records that the server should return, and after is obviously the id of the last record shown.

We need to find where we can extract the channel id. And first thing we do is look for the text 25025320 in the source code of the main channel page. Let’s go to the Elements tab and do a search (CTRL + F) for our id. We will find that it exists in the JSON structure on the page itself, and we can easily extract it:

Scraping Instagram: JSON structure with data

It seems everything is clear, but where do we get after value for each subsequent data loading? It is very easy. Since it gets changed with each new loading, its most likely loaded with data feed. Let’s try to load data feed again to the browser and look at the data more carefully:

https://www.instagram.com/graphql/query/?query_hash=472f257a40c653c64c666ce877d59d2b&variables=%7B%22id%22%3A%2225025320%22%2C%22first%22%3A12%2C%22after%22%3A%22AQAzEauY26BEUyDxOz9NhBP2gjLbTTD3OD1ajDxZIHvldwFwboiBnIcglaL6Kb_yDssRABBoUDdIls5V8unGC86hC2qk_IeLFUcH2QPTrY3f4A%22%7D

We will see the following data structure there:

where end_cursor looks like what we are looking for. Also there is field has_next_page which can be very handy for us so we could stop loading feeds with data if there is no more data available.

Now we’ll write the the beginning part of our scraper, load the main channel page and try to load the JS file with query_hash. Create a digger in your Diggernaut account and add the following configuration to it:

Set the Digger to the Debug mode. Now we need to run our scraper and when job is done we will need to check the log. At the end of the log we will see how the diggernaut works with JS files. It converts them into the following structure:

So the CSS selector for all JS script content will be script. Let’s add the query_hash parsing function:

Let’s save our digger configuration and run it again. Wait until it finishes the job and check the log again. In the log we will see the following line:

Set variable queryid to register value: 472f257a40c653c64c666ce877d59d2b

It means that query_hash was successfully extracted and written to the variable named queryid.

Now we need to extract the channel id. As you remember, it is in the JSON object on the page itself. So we need to parse the contents of a specific script element, pull JSON out of there, convert it to XML, and take the value we need using the CSS selector.

If you look closely at the log, you will see that the JSON structure is transformed into a XML DOM like this:

This will help us build CSS selectors to get the first 12 records and the last record marker which will be used in after parameter for the request we send to server to get next 12 records. Let’s write the logic for data extraction, and also let’s use pool of links. We are going to iterate links in this pool and consequently add next page URL to this pool. Once next 12 records is loaded we are going to stop and see how loaded JSON is transformed to XML so we could build CSS selectors for data we want to extract.

Again, save configuration and run the digger. Wait for completion and check the log. You should see following structure for loaded JSON with next 12 records:

We shortened source code on purpose, we just removed all records data except the first record. Since structure is same for all records, we just need to see structure of one record in order to build CSS selectors. Now we can define the logic of scraping for all fields we need. We will also set the limit to the number of loads, lets say to 10. And add pause for less aggressive scraping. As a result, we will receive the final version of our Instagram scraper.

We hope that this article will help you to learn the meta-language and now you can solve tasks where you need to parse pages with infinile scroll without problems.

Happy Scraping!

2 comments

Leave a Reply

Your email address will not be published. Required fields are marked *