How to scrape pages with infinite scroll: extracting data from Instagram

How to scrape pages with infinite scroll: extracting data from Instagram

Updated on 31.10.2018 Infinite scroll on the webpage is based on Javascript functionality. Therefore, in order to find out what URL we need to access and what parameters to use, we need to either thoroughly study the JS code that works on the page, or, and preferably, examine the requests that the browser do when you scroll down the page. We can study requests using the Developer Tools, which are built-in to all modern browsers. In this article we will use Google Chrome, but you can use any other browser. Just keep in mind that the developer tools may looks different in different browsers.

We will study our task using the Instagram as example. And we will use official Instagram channel. Open this page in the browser, and run Chrome Dev Tools – developer tools that are built-in to Google Chrome. To do it, you need to right-click anywhere in the page and select the “Inspect” option or press “Ctrl + Shift + I”:

Scraping Instagram: turning on Dev Tools

It will open the tool window, where we go to the Network tab and in the filters we select only XHR requests. We do this in order to filter out requests we dont need. After that, reload the page in the browser using the Reload button in the browser interface or the “F5” key on the keyboard.

Scraping Instagram: setting up Dev Tools

Let’s now scroll down the page several times with the mouse wheel. It will cause content loading. Every time when we scroll down to the bottom of the page, JS will make an XHR request to the server, receive the data and add it to the page. As a result, we should have several requests in the list that look almost same. Most likely its what we are looking for.

Scraping Instagram: looking for specific XHR requests

To make sure, we have to click on one of the requests and in the new opened panel go to the Preview tab. There we can see the formatted content that the server returns to the browser for this request. Let’s get to one of the leaf elements in the tree and make sure that there are data about the images that we have on the page.

Scraping Instagram: making sure we found right requests

After making sure that these are the queries we need, let’s look at one of them more carefully. To do it, go to the Headers tab. There, we can find information about what URL is used to do the request, what is type of that request (POST or GET), and what parameters are passed with the request.

Scraping Instagram: finding request parameters

Its better to check query string parameters at Query String Parameters section. To see it you need to scroll down the pane to the bottom:

Scraping Instagram: checking query string

As result of our analysis get the following:

Request URL: https://www.instagram.com/graphql/query/
Request type: GET
Query string parameters: query_hash and variables

Obviously, some static id is passed as query_hash, which is generated by JS or exist either on the page, cookie or some JS file. There are also some parameters, which defines what exactly you get from server are passed in the JSON format as variables query parameter.

Let’s run a small experiment, take the URL with the parameters that was used to load the data (please NOTE that currently you cannot retrieve feed this way as it requires now special parameter to be sent in headers):

https://www.instagram.com/graphql/query/?query_hash=df16f80848b2de5a3ca9495d781f98df&variables=%7B%22id%22%3A%2225025320%22%2C%22first%22%3A12%2C%22after%22%3A%22AQDsbvCEthjsp_O_8UO9vPTHKy6Qea2H_RRxe7v46B2XKXhSYVTv8FLSDk0BxmXqLw_T1R9aB8DB51Kp2hp80mP51bKdG9Ahy4eKWT9h3QplzA%22%7D

Paste it into the address bar of the browser and press Enter. We’ll see how the page loads in JSON format:

Scraping Instagram: data feed

Now we need to understand where query_hash comes from. If we go to the Elements tab and try to find (CTRL + F) our query_hash 5b0222df65d7f6659c9b82246780caa7, then we find out that it doesnt exist on the page itself, which means that it is loaded or generated somewhere in the Javascript code, or comes with cookies. Therefore, go back to the Network tab and put the filter on JS. Thus, we will see only requests for JS files. Sequentially browsing the request by the request, we will search for our id in the loaded JS files: simply click on the request, then open the Response tab in the opened panel to see the content of JS and do a search for our id (CTRL + F). After several unsuccessful attempts, we find that our id is in the following JS file:

https://www.instagram.com/static/bundles/ProfilePageContainer.js/031ac4860b53.js

and the code fragment that surrounds the id looks like this:

So, to get query_hash we need to:

  1. Load main channel page
  2. Find the URL to the file which filename contains ProfilePageContainer.js
  3. Extract this URL
  4. Load JS file
  5. Parse the the id we need
  6. Write it into a variable for later use.

Now let’s see what data is passed as variables parameter:

If we analyze all XHR requests that loads data, we find that only the after parameter changes. Therefore id most likely is the id of the channel, which we need to extract from somewhere, first – the number of records that the server should return, and after is obviously the id of the last record shown.

We need to find where we can extract the channel id. And first thing we do is look for the text 25025320 in the source code of the main channel page. Let’s go to the Elements tab and do a search (CTRL + F) for our id. We will find that it exists in the JSON structure on the page itself, and we can easily extract it:

Scraping Instagram: JSON structure with data

It seems everything is clear, but where do we get after value for each subsequent data loading? It is very easy. Since it gets changed with each new loading, its most likely loaded with data feed. Let’s try to load data feed again to the browser and look at the data more carefully:

https://www.instagram.com/graphql/query/?query_hash=df16f80848b2de5a3ca9495d781f98df&variables=%7B%22id%22%3A%2225025320%22%2C%22first%22%3A12%2C%22after%22%3A%22AQAzEauY26BEUyDxOz9NhBP2gjLbTTD3OD1ajDxZIHvldwFwboiBnIcglaL6Kb_yDssRABBoUDdIls5V8unGC86hC2qk_IeLFUcH2QPTrY3f4A%22%7D

We will see the following data structure there:

where end_cursor looks like what we are looking for. Also there is field has_next_page which can be very handy for us so we could stop loading feeds with data if there is no more data available.

Now we’ll write the the beginning part of our scraper, load the main channel page and try to load the JS file with query_hash. Create a digger in your Diggernaut account and add the following configuration to it:

Set the Digger to the Debug mode. Now we need to run our scraper and when job is done we will need to check the log. At the end of the log we will see how the diggernaut works with JS files. It converts them into the following structure:

So the CSS selector for all JS script content will be script. Let’s add the query_hash parsing function:

Let’s save our digger configuration and run it again. Wait until it finishes the job and check the log again. In the log we will see the following line:

Set variable queryid to register value: df16f80848b2de5a3ca9495d781f98df

It means that query_hash was successfully extracted and written to the variable named queryid.

Now we need to extract the channel id. As you remember, it is in the JSON object on the page itself. So we need to parse the contents of a specific script element, pull JSON out of there, convert it to XML, and take the value we need using the CSS selector.

If you look closely at the log, you will see that the JSON structure is transformed into a XML DOM like this:

This will help us build CSS selectors to get the first 12 records and the last record marker which will be used in after parameter for the request we send to server to get next 12 records. Let’s write the logic for data extraction, and also let’s use pool of links. We are going to iterate links in this pool and consequently add next page URL to this pool. Once next 12 records is loaded we are going to stop and see how loaded JSON is transformed to XML so we could build CSS selectors for data we want to extract.

Most recently, Instagram made changes to the public API. Now authorization is done not by using CSRF token, but by a special signature, which is calculated with the new parameter rhx_gis, passed in the sharedData object on the page. You can learn used algo if you research JS of Instagram. Basically its just MD5 sub of some parameters joined to single string. So we use this algorithm and automatically sign requests. To do it, we need to extract the rhx_gis parameter.

Again, save configuration and run the digger. Wait for completion and check the log. You should see following structure for loaded JSON with next 12 records:

We shortened source code on purpose, we just removed all records data except the first record. Since structure is same for all records, we just need to see structure of one record in order to build CSS selectors. Now we can define the logic of scraping for all fields we need. We will also set the limit to the number of loads, lets say to 10. And add pause for less aggressive scraping. As a result, we will receive the final version of our Instagram scraper.

We hope that this article will help you to learn the meta-language and now you can solve tasks where you need to parse pages with infinile scroll without problems.

Happy Scraping!

Co-founder of cloud based web scraping and data extraction platform Diggernaut

25 comments

    • Mikhail Sisin

      Thank you for reporting, we are going to check it shortly and update article. I will put a notice additionally.

      Article has been updated to fit new JS structure, query_hash should be extracted now properly

  • Nadia

    Hi Mikhail, Instagram has changed their js structure again. It is also requiring x-instagram-gls. Is it still possible to scrape instagram like this?

    • Mikhail Sisin

      Hi Nadia, thanks for commenting on it. We will check it today and update article.

      Article has been updated to cover recent changes, later we will try to come up with more universal solution for query hash extraction as seems like they change JS structure pretty frequently.
      Thanks again for pointing it out 🙂

  • nahkampf

    This works like a charm! But, say I have a list of 15 instagram accounts to scrape, do I have to make a scraper for each and every one, or is there some way to make this scraper loop through a list of accounts?

    • Mikhail Sisin

      You can use iterator to scrape multiple accounts, in this case beginning of your config will looks like:

  • Yosi

    Great Tutorial!
    I have a question, does the query_hash changed sometimes? if so, it happens every day? week? month?

    Thanks again!

    • Mikhail Sisin

      Yes, it does change, so extracting it from the JS could be better idea than hardcoding it to the config. However, there is another issue – when they change JS structure query_hash may stop extracting. And we are hoping to come with more universal solution at some point. I cannot say how often they change query_hash, for me it look completely random.

  • André

    Hi,

    Sorry, I was wrong. It still works the same way. Please don’t approve my last comment not to confuse anybody.

    Thank you for this great article, this is exactly what I was looking for!

    Cheers! 🙂

  • Martin

    Hi, does this still work?
    “Let’s run a small experiment, take the URL with the parameters that was used to load the data:”

    results with a “not authorized” page. Do they check cookies now maybe?

  • Almaz

    Hi, there!
    Sorry if I’m asking stupid question, not enough imaginations.
    As I understand, this is not official way to scrape and this is not an instagram API for developers.
    So some limitations for API requests will concern for this kind of scrape method?
    Or there are no limitations for it, so I can make endless requests and pull the data?
    Thanks in advance!

  • Amit Moondra

    Great tutorial.
    It seems that the graphql/query doesn’t work anymore
    “https://www.instagram.com/graphql/query/?query_hash=3f1ec7fcdad5fb10359a6b14054721d3&variables=%7B%22tag_name%22%3A%22cat%22%2C%22include_reel%22%3Afalse%2C%22include_logged_out%22%3Atrue%7D”

    • Mikhail Sisin

      Sorry for delay, we had checked revision of digger configuration posted in this article and it seems working fine. You cannot load this query request in the browser anymore because this request now requires special parameter to be sent with request headers. But this query works perfectly in the digger.

Leave a Reply

Your email address will not be published. Required fields are marked *