Methods for Working with the Register

Normalize

The normalize command is used to manipulate data in the register.

Please note:
The normalize command works only with the register.
It gets data from the register as input and returns results of manipulations back to the register. Some normalization modes requires additional arguments to be supplied. For example replace_substring mode requires a list of regular expressions to search and corresponding values for replacement.

Example of usage:

          - normalize:
    # SPECIFYING MODE WE ARE GOING TO USE
    routine: replace_substring
    # SUPPLY ADDITIONAL PARAMETERS IF REQUIRED
    args:
        - ^\s+|\s+$: ''
          

Currently the following modes (routines) are supported:

Mode Description
replace_substring Searches in the register all occurrences of the given substring and replace all matches with given value. The required substring can be given in the form of a regular expression. The format for the single pair: substring_to_change: replacement_value. Pairs are passed in the args parameter. You can send more than one pair, in this case its better to use list of pairs instead of dictionary, as using dictionary doesnt guarantee the order how they will be processed opposite to the list - there order of element always will be static. Search and replace in this case will occur sequentially for each pair sent.
replace_matched Searches in the register at least one occurrence of the required substring and if it is found - changes the whole value of the register to the specified one. The desired substring can be specified as a regular expression. The format for the single pair: substring_to_change: replacement_value. Pairs are passed in the args parameter. You can send more than one pair, in this case its better to use list of pairs instead of dictionary, as using dictionary doesnt guarantee the order how they will be processed opposite to the list - there order of element always will be static. Search and replace in this case will occur sequentially until first match. If no matches found, register value will stay unchanged.
increment Increases the value of the register by 1, the register must have an integer value.
decrement Decreases the value of the register by 1, the register must have an integer value.
capitalize Changes the first letter of each word in the contents of the register to the capital letter. The exception is short connecting words (or, and, of, etc.).
upper_first Changes the first letter of the contents of the register to the capital letter.
upper Changes all the letters of the contents of the register to capital letters.
lower Changes all letters of the contents of the register to lowercase.
url If URL is relative, it will normalize the URL and make it absolute.
escape_html Converts a sequences of characters that are not valid in HTML to HTML entities.
unescape_html Converts all HTML entities to the corresponding characters.
urlencode Encodes the contents of the register to be used as a parameter for the GET request in the URL.
json2xml Converts register content in JSON format to XML format.
transit2xml Converts register content in Transit+JSON format to XML format. Works similar to previous mode.
base64 Encodes register content to Base64.
base64zlib_decode Decodes register content from Base64ZLIB format.
md5 Calculates MD5 hash of the register content.
signature Calculates signature of the register content, using given algorythm and cypher, and then encode to the given encoding format.
date_format Manipulates with date and time.

Let's overview all modes. Let's imagine that we have the following HTML source:

          <div class="container">
    <ul class="list">
        <li class="item" id="li1">Text and text</li>
        <li class="item" id="li2">text and text</li>
        <li class="item" id="li3">01/21/2017</li>
        <script>
          var items = {"items":[{"item": {"somefield1": "text", "somefield2": "another text"}}]};
        </script>
    </ul>
    <a href="sandbox.html">Link</a>
</div>
          

Modes replace_substring and replace_matched works following way:

              # FINDS FIRST `li`
- find:
    path: "li#li1"
    do:
    # USE `parse` TO FILL THE REGISTER WITH TEXT CONTENT OF CURRENT BLOCK, REGISTER VALUE: "Text and text"
    - parse

    # DO SOME NORMALIZATION
    - normalize:
        routine: replace_substring
        args:
            # REPLACE WORD "text" TO "some another text"
            - text: 'some another text'
            # REGISTER VALUE: "Text and some another text"

            # REPLACE WORD "and" TO "or"
            - and: or
            # REGISTER VALUE: "Text or some another text"

            # REPLACE WORD "another" TO "other"
            - another: other

    # REGISTER VALUE: "Text or some other text"
              
              # PLEASE NOTE:
# `replace_matched`, OPPOSITE TO `replace_substring`, WORKS UNTIL FIRST MATCH

# FINDS FIRST `li`
- find:
    path: "li#li1"
    do:
    # PARSE TEXT TO THE REGISTER
    - parse

    # NORMALIZE DATA
    - normalize:
        routine: replace_matched
        args:
            # LETS REPLACE VALUE OF THE REGISTER TO "some another text"
            # IF THERE IS WORD "text" IN THE REGISTER VALUE
            - text: 'some another text'
            # REGISTER VALUE: "some another text"

            # SINCE MATCH HAPPENED, ALL ARGUMENTS BELOW WILL BE IGNORED
            - and: or
            - another: other

    # REGISTER VALUE: "some another text"
              

Examples for modes increment, decrement, capitalize, upper_first, upper and lower:

              # TO USE THIS MODE YOU NEED TO HAVE INTEGER VALUE IN THE REGISTER (1,2,158,203040523421 AND SO ON)
# FINDS FIRST `li`
- find:
    path: "li#li1"
    do:
    # GET ATTRIBUTE `id` VALUE AND EXTRACT JUST DIGITS
    - parse:
        attr: id
        filter: (\d+)
    # REGISTER VALUE: "1"

    # USE NORMALIZATION
    - normalize:
        routine: increment

    # REGISTER VALUE: "2"
              
              # TO USE THIS MODE YOU NEED TO HAVE INTEGER VALUE IN THE REGISTER (1,2,158,203040523421 AND SO ON)
# FINDS SECOND `li`
- find:
    path: "li#li2"
    do:
    # GET ATTRIBUTE `id` VALUE AND EXTRACT JUST DIGITS
    - parse:
        attr: id
        filter: (\d+)
    # REGISTER VALUE: "2"

    # DOING DECREMENT
    - normalize:
        routine: decrement

    # REGISTER VALUE: "1"
              
              # LETS CAPITALIZE STRING IN THE REGISTER
# FINDS FIRST `li`
- find:
    path: "li#li1"
    do:
    # PARSE TEXT TO THE REGISTER
    - parse
    # REGISTER VALUE: "Text and text"

    # APPLY CAPITALIZE ROUTINE
    - normalize:
        routine: capitalize

    # REGISTER VALUE: "Text and Text"
              
              # CHANGE ONLY FIRST LETTER OF STRING IN THE REGISTER TO UPPERCASE
# FINDS SECOND `li`
- find:
    path: "li#li2"
    do:
    # PARSE TEXT CONTENT TO THE REGISTER
    - parse
    # REGISTER VALUE: "text and text"

    # NORMALIZE CONTENT OF THE REGISTER
    - normalize:
        routine: upper_first

    # REGISTER VALUE: "Text and text"
              
              # BRING ALL LETTERS OF REGISTER CONTENT TO UPPER CASE
# FINDS SECOND `li`
- find:
    path: "li#li2"
    do:
    # PARSE TEXT TO THE REGISTER
    - parse
    # REGISTER VALUE: "text and text"

    # APPLY NORMALIZATION
    - normalize:
        routine: upper

    # REGISTER VALUE: "TEXT AND TEXT"
              
              # BRING ALL LETTERS OF REGISTER CONTENT TO LOWER CASE
# FINDS FIRST `li`
- find:
    path: "li#li1"
    do:
    # PARSE TEXT
    - parse
    # REGISTER VALUE: "Text and text"

    # NORMALIZE REGISTER
    - normalize:
        routine: lower

    # REGISTER VALUE: "text and text"
              

Modes url, escape_html and unescape_html:

              # CHANGING RELATIVE LINKS TO ABSOLUTE
# FINDS `a`
- find:
    path: a
    do:
    # EXTRACT VALUE OF ATTRIBUTE `href`
    - parse:
        attr: href
    # REGISTER VALUE: sandbox.html

    # LETS NORMALIZE URL
    - normalize:
        routine: url

    # REGISTER VALUE: https://www.diggernaut.com/sandbox.html
              
              # CONVERTING SYMBOLS LIKE <, >, " TO HTML ENTITIES
# FINDS FIRST `li`
- find:
    path: "li#li1"
    do:
    # PARSE HTML CONTENT FROM CURRENT BLOCK
    - parse:
        format: html
    # REGISTER VALUE: <li class="item" id="1">Text and text</li>

    # APPLY NORMALIZATION
    - normalize:
        routine: escape_html

    # REGISTER VALUE:: 
                
              # REPLACE ALL HTML ENTITIES TO CORRSPONDING CHARACTERS LIKE <, >, " ETC
# FINDS FIRST `li`
- find:
    path: "li#li1"
    do:
    # LETS PUT LITERAL VALUE TO THE REGISTER
    - register_set:

    # NORMALIZE REGISTER VALUE
    - normalize:
        routine: unescape_html

    # REGISTER VALUE: <li class="item" id="1">Text and text</li>
              

Mode json2xml:

              # CONVERTING JSON DOCUMENT TO THE XML BLOCK
# FINDS FIRST `script`
- find:
    path: script
    slice: 0
    do:
    # PARSE TEXT
    - parse
    # REGISTER VALUE: var items = {"items":[{"item": {"somefield1": "text", "somefield2": "another text"}}]};

    # REMOVING STUFF WE DONT NEED
    - normalize:
        routine: replace_substring
        args:
            - \s*var\s*items\s*=\s*: ''
            - \s*;\s*$: '' 
    # REGISTER NOW HAS JSON CONTENT: {"items":[{"item": {"somefield1": "text", "somefield2": "another text"}}]}

    # CONVERT IT TO XML
    - normalize:
        routine: json2xml
    # REGISTER NOW HAS:
    # <body_safe>
    #   <items>
    #     <item>
    #       <somefield1>text</somefield1>
    #       <somefield2>another text</somefield2>
    #     </item>
    #   </items>
    # </body_safe>

    # PARSE XML CONTENT, CONVERT IT TO THE BLOCK AND SWITCH TO THIS NEW BLOCK CONTEXT
    - to_block

    # NOW WE ARE IN THE NEW BLOCK AND CAN TRAVERSE OVER ITS DOM STRUCTURE
    - find:
        path: item > somefield1
        do:
        - parse
        # REGISTER VALUE: "text"
              

In example above we used the to_block command, which creates a new block from the contents of the register. For more information about this command, see Creating a new block.

To work with normalize command in date_format mode, you need to pass additional arguments:

Argument Description
format_in The template for the input data (value ​​in the register), so digger could parse provided date. The table below shows all the tags that you can use in the template.
format_out The template for the output data (how the modified data will be written to the register). The table below shows all the tags that you can use in the template.
add_years Adds the specified number of years to the value in the register. You can specify both positive and negative integer values.
add_months Adds the specified number of months to the value in the register. You can specify both positive and negative integer values.
add_days Adds the specified number of days to the value in the register. You can specify both positive and negative integer values.
add_hours Adds the specified number of hours to the value in the register. You can specify both positive and negative integer values.
add_minutes Adds the specified number of minutes to the value in the register. You can specify both positive and negative integer values.
add_seconds Adds the specified number of seconds to the value in the register. You can specify both positive and negative integer values.
timezone Converts the date and time into the specified time zone (TZ). For example: America / New_York

The table below shows all possible tags that can be used in templates (format_in, format_out) and examples of usage:

Tag Description Template Example Value Sample
%a abbreviation for weekdays, eg Mon or Fri %a, %d %B Fri, 20 February
%A weekday, eg Monday or Friday %A, %d %B Friday, 20 February
%b month abbreviation, eg Feb or Sep %A, %d %b Friday, 20 Jun
%B month name, eg February or September %A, %d %B Friday, 20 June
%C number of century, takes values from 00 to 99 %С/%y 20/17
%d day of month, takes values from 01 to 31 %Y-%m-%d 2017-10-01
%D preset template, same as %m/%d/%y %D 05/08/17
%e day of month, takes values from 1 to 31 %e %B 5 January
%F preset template, same as %Y-%m-%d %F 2017-10-01
%g 2-digit number of year according to ISO-8601:1988 standard %g 17
%G 4-digit number of year according to ISO-8601:1988 standard %G 2017
%h same as %b% %A, %d %h Friday, 20 Jun
%H hour in 24-hours system, takes values from 00 to 23 %H:%M:%S 08:35:26
%I hour in 12-hours system, takes values from 01 to 12 %H:%M:%S 08:35:26
%j number of day of year, takes values from 1 to 366 Today is %j day of year Today is 183 day of year
%k hour in 24-hours system, takes values from 0 to 23 %k hrs %M mnt 8 hrs 35 mnt
%l hour in 12-hours system, takes values from 1 to 12 %l hrs %M mnt 8 hrs 35 mnt
%m number of month, takes values from 01 to 12 %Y-%m-%d 2017-10-01
%l minutes, takes values from 00 to 59 %l hrs %M mnt 8 hrs 35 mnt
%n new line symbol %Y%n%m 2017\n10
%p value AM or PM depending on time, used with 12-hours time system %I%p 8AM
%P value am or pm depending on time, used with 12-hours time system %I%P 8am
%r same as %I:%M:%S %p %r 04:12:37 PM
%R same as %H:%M %R 22:35
%s Unix timestamp, shows number of seconds since start of epoch (1 january 1970) %s 1506867213
%S seconds, takes values from 00 to 59 %H:%M:%S 08:35:26
%t tabulation symbol %Y%t%m 2017\t10
%T same as %H:%M:%S %T 08:35:26
%u number of weekday from 1 (monday) to 7 (sunday) Today is %u week day Today is 5 week day
%U number of week of year, if week starts with Sunday, takes values from 00 to 53 It was %U week It was 23 week
%V number of week of year by ISO standard, if week starts with Monday, takes values from 01 to 53. If week with 1 Jan has 4 or more days in new year, this week is counted as first week of new year, in other case its counted as last week of previous year. It was %V week It was 23 week
%w number of day of week from 0 (sunday) to 6 (saturday) Today is %w day of week Today is 5 day of week
%W number of week of year, if week starts with monday, takes values from 00 to 53 It was %W week It was 23 week
%y 2-digits number of year %m/%d/%y 10/01/17
%Y 4-digits number of year %Y-%m-%d 2017-10-01
%z time correction value to UTC time. Showing in format like +HHMM or -HHMM, where + means east from GMT, - means west from GMT, HH - number of hours, MM - number of minutes. %z +0300
%Z abbreviation fo timezone %Z PST
%+ same as %a %b %e %H:%M:%S %Z %Y %+ Mon Sep 20 13:24:55 PST 2017
%% symbol % %Y%%%m 2017%10

Example of usage:

              # DATES MANIPULATIONS
# FINDS THIRD `li`
- find:
    path: 'li#li3'
    do:
    # PARSE TEXT
    - parse
    # REGISTER VALUE: 01/21/2017

    # CONVERT DATE TO UNIT TIMESTAMP
    - normalize:
        routine: date_format
        args:
            format_in: "%m/%d/%Y"
            format_out: "%s"
    # REGISTER VALUE: 1484956800

    # -----------------------------------------------------
    # LETS PARSE TEXT AGAIN
    - parse
    # REGISTER VALUE: 01/21/2017

    # ADD 2 YEARS TO THE DATE
    - normalize:
        routine: date_format
        args:
            format_in: "%m/%d/%Y"
            add_years: 2
            format_out: "%m/%d/%Y"
    # REGISTER VALUE: 01/21/2019

    # -----------------------------------------------------
    # PARSE TEXT ONE MORE TIME
    - parse

    # SUBSTRACT 1 YEAR AND CHANGE OUTPUT FORMAT
    - normalize:
        routine: date_format
        args:
            format_in: "%m/%d/%Y"
            add_years: -1
            format_out: "%B,%d %Y"
    # REGISTER VALUE: January,21 2016

    # -----------------------------------------------------
    # NOW LETS SET LITERAL VALUE TO THE REGISTER
    - register_set: '01/21/2017 00:00:00 +0400 UTC'

    # CONVERT IT TO THE TIMEZONE: America/New_York
    - normalize:
        routine: date_format
        args:
            format_in: "%m/%d/%Y %T %z %Z"
            timezone: 'America/New_York'
            format_out: "%m/%d/%Y %T %z %Z"
    # REGISTER VALUE: 01/20/2017 19:00:00 -0500 EST

    # -----------------------------------------------------
    # AS WE KNOW add_* HAS VARIATIONS:
    # add_years   - YEARS
    # add_months  - MONTHES
    # add_days    - DAYS
    # add_hours   - HOURS
    # add_minutes - MINUTES
    # add_seconds - SECONDS
    # YOU CAN USE THEM ALL TOGETHER
    - register_set: '01/21/2017 00:00:00 +0400 UTC'
    - normalize:
        routine: date_format
        args:
            format_in: "%m/%d/%Y %T %z %Z"
            add_years: 10
            add_months: 5
            add_days: 1
            add_hours: 6
            add_minutes: 15
            add_seconds: 50
            format_out: "%m/%d/%Y %T %z %Z"
    # REGISTER VALUE: 06/22/2027 06:15:50 +0000 UTC
                

To work with normalize command in signature mode, you need to pass following arguments:

Argument Description
algo Used algorythm. Currently supported only HMAC.
cypher Used cypher. Currently supported only SHA256.
encode Used encoder. Currently supported only Base64.
secret Secret key.

Usage example:

              # CREATE SIGNATURE
# FINDS FIRST `li`
- find:
    path: 'li#li1'
    do:
    # PARSE TEXT
    - parse

    # REGISTER VALUE: "Text and text"
    - normalize:
        routine: signature
        args:
            algo: HMAC
            cypher: SHA256
            encode: Base64
            secret: A6A7A8A9

    # REGISTER VALUE: eQZxtV7Ae/wu3Enx8C2po9L7j3cefePEwoFZiTmkb7M=
              

base64, md5 modes:

              # FINDS FIRST `li`
- find:
    path: 'li#li1'
    do:
    # PARSE TEXT
    - parse
    # REGISTER VALUE: "Text and text"

    - normalize:
        routine: base64
    # REGISTER VALUE: VGV4dCBhbmQgdGV4dA==
              
              # FINDS FIRST `li`
- find:
    path: 'li#li1'
    do:
    # PARSE TEXT
    - parse
    # REGISTER VALUE: "Text and text"

    - normalize:
        routine: md5
    # REGISTER VALUE: be8bde1051516b04402f02e00b4687b7
              

base64zlib_decode mode:

              # LETS SET ENCODED VALUE DIRECTLY TO THE REGISTER
# ITS ENCODED STRING `some text`
- register_set: eJwrzs9NVShJrSgBABHoA5o=

# DECODE VALUE
- normalize:
    routine: Base64ZLIBDecode

# REGISTER VALUE: some text
              

In the next section, we'll discuss how to work with variables, arguments, and objects.