Methods for Navigation

Navigation is used to load various pages, documents and files on the website, as well as to traverse over the DOM structure of the loaded document.

Walk

The walk method is used to load pages and other documents (json, js, ical, xml, images) from various web resources or websites. If the downloaded file is presented in a format other than HTML or XML, the digger automatically converts the content of the resource into XML. It works this way so you can use the same approach for extracting data from heterogeneous resources.

The main points of the walk method:

  1. The method can be called from any context
  2. Can work with the contents of the register, use the values ​​of arguments and variables as data for the substitution
  3. The execution of the block logic can be looped until a certain condition is reached
  4. Can iterate over link pool
  5. It is possible to use custom request headers
  6. The method can do GET and POST requests
  7. If a page or document is successfully loaded, the digger goes into a page context and works with the downloaded content

Parameters that you can use in the walk method:

Parameter Description
to The value that defines which request the digger should make. If the value is a literal, the GET request will be executed. If the dictionary - POST request will be done. When using a literal, you can use the URL of the resource that the digger should download. It is possible to use variables and arguments in a URL. If you want to use a URL value from the register, you can use the reserved word value. And if you want a digger to iterate over the link pool, use the word pool. To do POST request, you will need to make a dictionary with fields described below and use this dictionary as value for parameter to.
headers A dictionary where you can include any headers that will be sent to the server with the request. You can use any standard and non-standard headers, except for user-agent. User agent header is populated with value you define in config section of the digger's configuration.
mode Enables mode that only unique URLs (across all digger sessions) will be loaded. To enable this mode, it is enough to specify the value of this parameter as unique. In this mode, the digger will be cache all downloaded URLs in the database and the next time you try access the URL, it will check the database if this URL has been fetched before. And if its so, such URL will be skipped. In some cases, this mode helps to save on resources (page requests) you pay for.
pool The name of the link pool. Used only if the reserved word links is used as value for the parameter to. If this parameter is omitted, then the digger will use the default pool.
repeat A special flag that sets execution the block of walk command to the loop while the value of this flag is equivalent to "yes". In practice, there is variable used as value of this flag, which initially is set to "yes". Then during execution of the loop, when digger meet some condition, it changes variable value to something other and digger breaks out of the loop and continues execution of code outside of this walk block.
repeat_in_pool Works just the same as repeat, but for link pool.

GET

The following are examples of GET requests with some parameters:

              ---
do:
# LOADING PAGE LOCATED AT SPECIFIED URL AND RUN LOGIC INSIDE THE `walk` BLOCK FOR ITS CONTENT
- walk:
    to: http://www.somesite.com/
    do:
    # FIND ALL LINKS OF THIS PAGE
    - find:
        path: a
        do:
        # PUT VALUE OF `href` ATTRIBUTE TO THE REGISTER
        - parse:
            attr: href
        # LOAD PAGE WITH URL WE HAVE IN REGISTER
        - walk:
            to: value
            do:
              
              ---
do:
# ADD URL OF PAGES TO THE LINK POOL WITH NAME `somepool`
- link_add:
    pool: somepool
    url:
    - http://www.somesite.com/page-1/
    - http://www.somesite.com/page-2/
    - http://www.somesite.com/page-3/
# ITERATING OVER POOL (OVER URLS ONE BY ONE)
# FOR EACH URL WE RUN LOGIC INSIDE `walk` BLOCK
- walk:
    to: links
    pool: somepool
    do: 
    - find:
        path: .somepath
        do:
              
            ---
do:
# DECLARE VARIABLE `repeatable` AND SET IT TO `yes`
- variable_set:
    field: repeatable
    value: 'yes'
# LETS IMAGINE THAT WEBSITE WE ARE SCRAPING IS NOT STABLE
# AND SOMETIMES DOESNT RETURN PROPER PAGE, OR JUST NOT AVAILABLE
# LETS PUT `walk` COMMAND TO THE LOOP USING VARIABLE `repeatable`
# COMMAND `walk` WILL BE REPEATED UNTIL SPECIFIC CSS PATH `.somepath`
# IS NOT FOUND ON THE LOADED PAGE
- walk:
    repeat: <%repeatable%>
    to: http://www.somesite.com/
    do:
    - find:
        path: .somepath
        do:
        # CSS PATH IS FOUND, LETS CLEAR VARIABLE TO STOP LOOPING `walk` COMMAND
        - variable_clear: repeatable
            
              ---
do:
# LOAD PAGE LOCATED AT GIVEN URL WITH COMMAND `walk`
- walk:
    to: http://www.somesite.com/
    # WE ARE GOING TO SEND SOME HEADERS WITH PAGE REQUEST
    headers:
        Cookie: JSESSIONID=1234123412321; OTHERCOOKIE=<%somevar%>;
        Accept: text/xml
    do:
    - find:
        path: .somepath
        do:
              

POST

To do POST request, you need to use specifically formed dictionary in to parameter:

Parameter Description
post URL of web resource, where your POST request with data formed as X-WWW-FORM-URLENCODED should be sent to.
json URL of web resource, where your POST request with data formed as APPLICATION/JSON should be sent to.
headers A dictionary where you can include any headers that will be sent to the server with the request. You can use any standard and non-standard headers, except for user-agent. User agent header is populated with value you define in config section of the digger's configuration. Attention, headers for POST requests should be used in the to scope, not in the root walk scope as for GET requests.
data A dictionary with all fields/values of query that should be sent with the request. Field names and values ​​are allowed to use variables and arguments to substitute data. The maximum nesting level of the dictionary is 2. If your data in JSON format should have a deeper level of nesting, use the payload parameter.
payload A string in the JSON format, which is passed instead of the data parameter for APPLICATION / JSON queries.

Few examples of POST requests.

            ---
config:
  debug: 2
do:
- walk:
  to:
      post: https://mockbin.org/request
      data:
          fizz: buzz
  do:
            
Time Level Message
2017-10-23 22:02:30:452 info Scrape is done
2017-10-23 22:02:30:436 debug Page content: <html><head></head><body><body_safe> <bodysize>9</bodysize> <clientipaddress>1.1.1.1</clientipaddress> <cookies></cookies> <headers> <accept-encoding>gzip</accept-encoding> <cf-connecting-ip>1.1.1.1</cf-connecting-ip> <cf-visitor>{&#34;scheme&#34;:&#34;https&#34;}</cf-visitor> <connect-time>2</connect-time> <connection>close</connection> <content-length>9</content-length> <content-type>application/x-www-form-urlencoded</content-type> <host>mockbin.org</host> <total-route-time>0</total-route-time> <user-agent>Surf/1.0 (Linux 3.19.0-65-generic; go1.9)</user-agent> <via>1.1 vegur</via> <x-forwarded-for>1.1.1.1, 1.1.1.1</x-forwarded-for> <x-forwarded-port>80</x-forwarded-port> <x-forwarded-proto>http</x-forwarded-proto> <x-request-start>1508785350353</x-request-start> </headers> <headerssize>556</headerssize> <httpversion>HTTP/1.1</httpversion>
<method>POST</method>
<postdata>
<mimetype>application/x-www-form-urlencoded</mimetype>
<params>
<fizz>buzz</fizz>
</params>
<text>fizz=buzz</text>
</postdata>
<querystring></querystring> <starteddatetime>2017-10-23T19:02:30.355Z</starteddatetime> <url>https://mockbin.org/request</url> </body_safe></body></html>
2017-10-23 22:02:29:405 info Retrieving page (POST): https://mockbin.org/request
2017-10-23 22:02:29:398 info Starting scrape
2017-10-23 22:02:29:382 debug Setting up default proxy
2017-10-23 22:02:29:367 debug Setting up surf
2017-10-23 22:02:29:336 info Starting digger: meta-lang-post-x-www [1862]

Note, since the mockbin.org server sends the response in JSON format, the digger has made the conversion of the response to XML.

              ---
config:
    debug: 2
do:
# LETS INITIALIZE COUPLE VARIABLES
- variable_set:
    field: field_name
    value: age
- variable_set:
    field: field_value
    value: 25
- walk:
    to:
        json: https://mockbin.org/request
        data:
            fizz: buzz
            <%field_name%>: <%field_value%>
    do:
              
Time Level Message
2017-10-24 01:31:08:538 info Scrape is done
2017-10-24 01:31:08:523 debug Page content: <html><head></head><body><body_safe> <bodysize>26</bodysize> <clientipaddress>1.1.1.1</clientipaddress> <cookies></cookies> <headers> <accept-encoding>gzip</accept-encoding> <cf-connecting-ip>1.1.1.1</cf-connecting-ip> <cf-visitor>{&#34;scheme&#34;:&#34;https&#34;}</cf-visitor> <connect-time>1</connect-time> <connection>close</connection> <content-length>26</content-length> <content-type>application/json</content-type> <host>mockbin.org</host> <total-route-time>0</total-route-time> <user-agent>Surf/1.0 (Linux 3.19.0-65-generic; go1.9)</user-agent> <via>1.1 vegur</via> <x-forwarded-for>1.1.1.1, 1.1.1.1</x-forwarded-for> <x-forwarded-port>80</x-forwarded-port> <x-forwarded-proto>http</x-forwarded-proto> <x-request-start>1508797868503</x-request-start> </headers> <headerssize>539</headerssize> <httpversion>HTTP/1.1</httpversion>
<method>POST</method>
<postdata>
<mimetype>application/json</mimetype>
<params></params>
<text>{&#34;age&#34;:&#34;25&#34;,&#34;fizz&#34;:&#34;buzz&#34;}</text>
</postdata>
<querystring></querystring> <starteddatetime>2017-10-23T22:31:08.509Z</starteddatetime> <url>https://mockbin.org/request</url> </body_safe></body></html>
2017-10-24 01:31:08:052 info Retrieving page (POST/JSON): https://mockbin.org/request
2017-10-24 01:31:08:044 debug Variable field_value has been set to value: 25
2017-10-24 01:31:08:035 debug Variable field_name has been set to value: age
2017-10-24 01:31:08:028 info Starting scrape
2017-10-24 01:31:08:015 debug Setting up default proxy
2017-10-24 01:31:08:002 debug Setting up surf
2017-10-24 01:31:07:971 info Starting digger: meta-lang-post-json [1863]
              ---
config:
    debug: 2
do:
- variable_set:
    field: age
    value: 25
- walk:
    to:
        json: https://mockbin.org/request
        payload: '{"fizz":"buzz","age":"<%age%>"}'
    do:
              
Time Level Message
2017-10-24 02:00:06:387 info Scrape is done
2017-10-24 02:00:06:374 debug Page content: <html><head></head><body><body_safe> <bodysize>26</bodysize> <clientipaddress>1.1.1.1</clientipaddress> <cookies></cookies> <headers> <accept-encoding>gzip</accept-encoding> <cf-connecting-ip>1.1.1.1</cf-connecting-ip> <cf-visitor>{&#34;scheme&#34;:&#34;https&#34;}</cf-visitor> <connect-time>1</connect-time> <connection>close</connection> <content-length>26</content-length> <content-type>application/json</content-type> <host>mockbin.org</host> <total-route-time>0</total-route-time> <user-agent>Surf/1.0 (Linux 3.19.0-65-generic; go1.9)</user-agent> <via>1.1 vegur</via> <x-forwarded-for>1.1.1.1, 1.1.1.1</x-forwarded-for> <x-forwarded-port>80</x-forwarded-port> <x-forwarded-proto>http</x-forwarded-proto> <x-request-start>1508799606293</x-request-start> </headers> <headerssize>540</headerssize> <httpversion>HTTP/1.1</httpversion>
<method>POST</method>
<postdata>
<mimetype>application/json</mimetype>
<params></params>
<text>{&#34;fizz&#34;:&#34;buzz&#34;,&#34;age&#34;:&#34;25&#34;}</text>
</postdata>
<querystring></querystring> <starteddatetime>2017-10-23T23:00:06.298Z</starteddatetime> <url>https://mockbin.org/request</url> </body_safe></body></html>
2017-10-24 02:00:05:098 info Retrieving page (POST/JSON): https://mockbin.org/request
2017-10-24 02:00:05:089 debug Variable age has been set to value: 25
2017-10-24 02:00:05:081 info Starting scrape
2017-10-24 02:00:05:069 debug Setting up default proxy
2017-10-24 02:00:05:062 debug Setting up surf
2017-10-24 02:00:05:035 info Starting digger: meta-lang-post-payload [1864]

In the next part, we'll learn the find method. It is used to navigate the DOM structure of the loaded document.