As you already know, we are using YAML as a markup language for configurations of diggers. We chose it because it uses indentation to separate scopes and logical blocks, it is intuitively more understandable than, for example, JSON or XML. Finally, you can write code in YAML faster than in the above formats. However, if you are not yet quite familiar with YAML, you can make a mistake in markup and your digger will return an error. Therefore, at initial time you can check the correctness of your code using Linter.
We already mentioned that you do not need to know other programming languages, because we use our own meta-language in Diggernaut, which is so simple that you can master it only within couple evenings. The reward to you is that with it you will be able to scrape websites of any complexity and get the data in the format you need.
Source website research
Structure of digger configurations
After you analyze the site, find out all the nuances of the location and pathing of the data you need, you can proceed with creating the configuration.
It is divided into 3 large logical parts. The first includes the initial digger settings, the second part is used to initialize the iterators, and the third part contains the main the logic of your scraper - the sequence of commands that the digger will execute when you run it. The first two parts are optional and can be used as needed, while you should always has the third part in the configuration, otherwise the digger simply does nothing because there are no commands to execute.
Below we are showing a fragment of a configuration containing all three parts described above.
--- # FIRST BLOCK - INITIAL DIGGER SETTINGS - config: debug: 2 agent: Firefox proxy: 18.104.22.168:80 # SECOND BLOCK - DATES ITERATOR INITIALIZATION - iterator: - type: date interval: 50 period: 100 template: '%B %d %Y' # THIRD BLOCK - MAIN LOGIC OF THE WEBSITE SCRAPER - do: - walk: to: http://www.com do: - find: path: td.ages
The first block begins with the code word config and contains initial debug level settings, user-agent identifier settings to pretend as real browser and directive to use a custom proxy server for requests. More information about these and other settings you can read in the Basic settings section.
The dates iterator is initialized in the second block, which is opened with the command iterator. There are several types of iterators and you will learn more about them in the corresponding section.
The third configuration block, opened by the do command, contains the main logic of the website scraper. In this particular example it goes to http://www.com and find the element (or elements) in the CSS path td.ages. You will find a tutorial for all available commands in this section of the documentation.
All three blocks are located in the root (they begin without indentation) of configuration.
Later we'll overview these blocks in details, but before that, let's get acquainted with runtime entities. Because without these basic knowledge, you will find it more difficult to understand how digger works.