Learning how to build web scraper if your source is RSS feed

Learning how to build web scraper if your source is RSS feed

You know that there are a lot of sites that uses RSS feed to distribute content. Many news websites uses RSS, almost all blogs has RSS feed enabled, a lot of other services uses RSS feeds. Good thing about such source is same scraper you build can be used with RSS on any website source, because its standardized and uses same layout everywhere. RSS feed is very simple case, but since you just started to work with our meta-language, we have to start with simple stuff. So we will use Google News as source and teach you to create scrapers yourself.

We will use Google Chrome as our main tool to work with the website. To start, we recommend you to install an extension for Google Chrome: Quick Javascript Switcher – it will let you to quickly turn Javascript off and on for specific websites. This is very handy tool as it let you know how the data is displayed on the page: on the server side or with Javascript (this can be data embedded in the JS on the page, a hidden block on the page that gets visible using JS, or the data is collected by an additional XHR (Ajax) request) . When we work with RSS feed its not necessary to use this extension because RSS doesnt use JS, but its very useful when you work with HTML pages.

Let’s open the page with the top news https://news.google.com/news/?ned=us&gl=US&hl=en in the browser, press Ctrl+F to open search console and try to search word RSS. We will find the link we are looking for at the footer of the page.

Make a web scraper yourself: Searching RSS feed

Its not always you will find RSS feed that easily, sometimes you may need to look in the page source, eg in our blog you need to open page source in your browser and search for RSS:

Make a web scraper yourself: searching RSS feed in the page source

So now we need to pick the link to the RSS feed as we found it on the webpage. To do it you need to right click on the “RSS” and select “Copy link address” option. It will copy URL to the feed to your clipboard.

Make a web scraper yourself: copy RSS feed URL

Now we need to paste (Ctrl + V) URL from clipboard to the address field of the browser and load RSS feed to the browser. URL of RSS feed to Google Top News (just for your reference) is: https://news.google.com/news/rss/?ned=us&gl=US&hl=en. Once you open it, you will see the XML source code for RSS Feed. As you can see it has some header data blocks, we pick feed title from there to put it to our dataset so we could easily identify sorce of record in the future. Its useful if you are gathering data from many sources. Also you can find there many “item” blocks which contains news entries. That data we actually want to scrape, so we will need to have our scraper to iterate over these “item” blocks, extract data from them and put it as record fields to the our dataset.

Make a web scraper yourself: XML source of the RSS feed

In the Diggernaut you traverse the DOM using CSS selectors. Once you load the page you will be in the root of the loaded document, so we need to build our selectors from our current block (root). To get title of the RSS feed you need to extract text from the title tag, which is child of the channel element. Our CSS selector for this field may be channel > title. Now lets define selector for news items. Each item is enclosed into item tag, which is child of the channel element. So again our CSS selector may be channel > item.

Since each news item has specific structure and few fields we would like to extract, lets look at it.

Make a web scraper yourself: news item structure

As our current block will be item when we iterate over them, our selectors should be relative to the current block. So we can define news item selectors now:

  1. Headline is title
  2. URL to the source is link
  3. Category of the news item is category
  4. Date of publication is pubDate
  5. Description is description

Thats it for the basic RSS. Sometimes RSS items may include media items that could be scraped as well, but not in the case of Google News. Also for this particular case we see that we have HTML in the description that has links to the news, relevant to the event. We can extract them and use as nested objects. If you look into the first item, then you will see the following HTML source in the description (we prettified it so it become a bit more readable):

To traverse this HTML DOM we will need first to parse it as HTML and then turn the register content into the block. Then digger will switch to the context of this block and we will be in the root of this HTML structure and can easily navigate it. So lets define CSS selectors for the elements we are going to extract data from to use it in our dataset. There is an image that we could probably use and since there is just 1 image, we are going to have it in the main news item object. As for relevant news, there may be many, so we will have them saved to our main object as array of the nested data objects. Each such nested object will have: URL to the source, headline and source name.

Its easy to navigate to the image element. We are assuming that the table we have has only one row tr. If not, you may want to either pick just first row, or iterate over rows. But in the last case you may need each such records to be stored as element of the array of the nested objects (same as we going to do with relevant news records). So in this particular case CSS selector to get image will be table tr > td > img.

Now, since we are going to put relevant news as nested data objects, we want to navigate first to the root element of the each relevant news entry. Root element is li as we can see, but we also see that the first li is enclosed to the strong tag so ol > li selector will not work, we can omit > telling selector that li is not only direct child of the ol. So our CSS selector may looks like: table tr > td > ol li.

Once we get to the relevant news item element we will need to extract URL, headline and source. Again we will need to use relative selectors:

  1. URL to the news a
  2. Headline is a (yes its same selector as URL but in case of URL we will extract src attribute and in case of headline we will extract text)
  3. Source is font

Seems like we defined all data we need to extract and decided what resulting data structure we wil have, now we can build our scraper:

You need to create a new digger at the Diggernaut plartform, copy and paste this digger configuration and run it. We hope that this material was useful and helped you to study our meta-language. As result you should get dataset like below:

Happy scraping!

Co-founder of cloud based web scraping and data extraction platform Diggernaut

Leave a Reply

Your email address will not be published. Required fields are marked *