Methods for Working with the Register

Parse

Using the parse command, you can extract data from the current block and write it to the register. Later these data can be changed, normalized and set to variables or to object fields.

Please note:
Parse command always returns some value to the register or an empty string if there is no content to extract in the block. The command can return text or HTML content of the block, as well as the value of the attribute of the current HTML element (the root element of the block).

Some examples of using the command:

          - find:
    path: a.somepath
    do:
    # PARSE TEXT CONTENT OF THE CURRENT BLOCK
    - parse

    # PARSE `href` ATTRIBUTE OF THE CURRENT ELEMENT `a`
    - parse:
        attr: href

    # PARSE HTML CONTENT OF THE CURRENT BLOCK
    - parse:
        format: html
          
          # USING THE FILTERS (AS REGULAR EXPRESSIONS) TO EXTRACT ONLY REQUIRED PARTS OF DATA
- find:
    path: .somepath
    do:
    - parse:
        filter:
        - .+=(\d+)
        - (.+)

- find:
    path: .somepath
    do:
    - parse:
        attr: class
        filter:
        - .+=(\d+)
        - (.+)
          

The parse command supports the following parameters:

Parameter Description
format The format of the extracted data: text or html. If the parameter is not provided, the text will be extracted by default.
attr If used, specific attribute's value of the root tag of the current block will be extracted. If omitted, the entire block content will be extracted.
filter One or more regular expressions to extract only certain data from the content. The extracted data must be enclosed in parentheses inside the regular expression (defined as a group). If more than one regular expression is specified, then they are used in order until filter has matches in the content.
joinby The value of this parameter is used to join the groups found by the filter. If the parameter is missing and the filter finds several groups, they will be joined by an empty string.

Now let's look at more detailed examples of using the parse command. As source, let's use the following fragment of an html document:

          <ul class="list" >
  <li class="list-item" id="1">Some text</li>
  <li class="list-item" id="item=2"><a href="http://somesite.com/">Link</a></li>
  <li class="list-item" id="item=3">Some other text</li>
</ul>
          

Usages:

              # FIND `ul` AND FILL REGISTER WITH ITS TEXT CONTENT
- find:
    path: .list
    do:
    - parse
    # REGISTER CONTENT: "Some textLinkSome other text"
              
              # FIND `a` TAGS AND GET `href` ATTRIBUTE VALUE TO THE REGISTER
- find:
    path: .list-item > a
    do:
    - parse:
        attr: href
    # REGISTER CONTENT: "http://somesite.com/"
                            
              # FIND `ul` AND USE ITS HTML CONTENT TO SET TO THE REGISTER
- find:
    path: ul
    do:
    - parse:
        format: html
    # REGISTER CONTENT:
    # <li class="list-item" id="1" >Some text</li>
    # <li class="list-item" id="2"><a href="http://somesite.com/">Link</a></li>
    # <li class="list-item" id="3" >Some other text</li>
                
              # SIMPLE TEXT FILTER
- find:
    path: .list-item
    do:
    - parse:
        filter:
        - Some\s*(\S+)\s*text
    # FOR THE FIRST TWO `li` ELEMENTS FILTER FOUNDS NOTHINFG AND WILL RETURN EMPTY STRING TO THE REGISTER ("")
    # BUT FOR THIRD ELEMENT FILTER WILL RETURN VALUE "other"

# LET'S FIND `li` AND EXTRACT ONLY DIGITS FROM `id` ATTRIBUTE
- find:
    path: .list-item
    do:
    - parse:
        # EXTRACTING DATA FROM ATTRIBUTE `id`
        attr: id
        filter:
            # IF FIRST FILTER EXTRACT AT LEAST 1 GROUP, THEN
            # FILTERING PROCESS IS STOPPING AND EXTRACTED DATA IS PLACED TO THE REGISTER
            # PLEASE NOTE: IF MULTIPLE GROUPS ARE FOUND, THEY WILL BE JOINED TOGETHER
            # IF THERE WAS FLLOWING IN OUR EXAMPLE:
            # <li class="list-item" id="item=2sub=3"><a href="http://somesite.com/">Link</a></li>
            # YOU WOULD GET `23` IN THE REGISTER
            - .+=(\d+)
            # IF FIRST FILTER HAS NO MATCHES, SECOND FILTER USED AND SO ON..
            - (.+)

# FIND `li` AND FILL OUT THE REGISTER WITH LETTERS FROM ATRRIBUTE `class`
- find:
    path: .list-item
    slice: 0
    do:
    - parse:
        # EXTRACTING VALUE FROM ATTRIBUTE `class`
        attr: class
        # SELECT EACH LETTER FROM VALUE AND JOIN ALL GROUPS WITH COMMA
        filter: ([A-Za-z]{1})
        joinby: ','
    # NOW WE HAVE "l,i,s,t,i,t,e,m" IN THE REGISTER
              

Next, we'll tell you about another command that can be used to set the value of the register: register_set.