Methods for Working with the Register

Normalize

The normalize command is used to manipulate data in the register.

Please note:
The normalize command works only with the register.
It gets data from the register as input and returns results of manipulations back to the register. Some normalization modes requires additional arguments to be supplied. For example replace_substring mode requires a list of regular expressions to search and corresponding values for replacement.

Example of usage:

          - normalize:
    # SPECIFYING MODE WE ARE GOING TO USE
    routine: replace_substring
    # SUPPLY ADDITIONAL PARAMETERS IF REQUIRED
    args:
        - ^\s+|\s+$: ''

Currently the following modes (routines) are supported:

Mode	Description
replace_substring	Searches in the register all occurrences of the given substring and replace all matches with given value. The required substring can be given in the form of a regular expression. The format for the single pair: substring_to_change: replacement_value. Pairs are passed in the args parameter. You can send more than one pair, in this case its better to use list of pairs instead of dictionary, as using dictionary doesnt guarantee the order how they will be processed opposite to the list - there order of element always will be static. Search and replace in this case will occur sequentially for each pair sent.
replace_matched	Searches in the register at least one occurrence of the required substring and if it is found - changes the whole value of the register to the specified one. The desired substring can be specified as a regular expression. The format for the single pair: substring_to_change: replacement_value. Pairs are passed in the args parameter. You can send more than one pair, in this case its better to use list of pairs instead of dictionary, as using dictionary doesnt guarantee the order how they will be processed opposite to the list - there order of element always will be static. Search and replace in this case will occur sequentially until first match. If no matches found, register value will stay unchanged.
increment	Increases the value of the register by 1, the register must have an integer value.
decrement	Decreases the value of the register by 1, the register must have an integer value.
capitalize	Changes the first letter of each word in the contents of the register to the capital letter. The exception is short connecting words (or, and, of, etc.).
upper_first	Changes the first letter of the contents of the register to the capital letter.
upper	Changes all the letters of the contents of the register to capital letters.
lower	Changes all letters of the contents of the register to lowercase.
url	If URL is relative, it will normalize the URL and make it absolute.
escape_html	Converts a sequences of characters that are not valid in HTML to HTML entities.
unescape_html	Converts all HTML entities to the corresponding characters.
urlencode	Encodes the contents of the register to be used as a parameter for the GET request in the URL.
urldecode	Decodes the contents of the register. Its a reverse mode for the previous one.
json2xml	Converts register content in JSON format to XML format.
transit2xml	Converts register content in Transit+JSON format to XML format. Works similar to previous mode.
base64	Encodes register content to Base64.
base64_decode	Decodes register content from Base64.
base64gzip_decode	Decodes register content from Base64GZIP format.
base64zip_decode	Decodes register content from Base64ZIP format.
base64deflate_decode	Decodes register content from Base64Deflate format.
base64zlib_decode	Decodes register content from Base64ZLIB format.
md5	Calculates MD5 hash of the register content.
signature	Calculates signature of the register content, using given algorythm and cypher, and then encode to the given encoding format.
date_format	Manipulates with date and time.

Let's overview all modes. Let's imagine that we have the following HTML source:

          <div class="container">
    <ul class="list">
        <li class="item" id="li1">Text and text</li>
        <li class="item" id="li2">text and text</li>
        <li class="item" id="li3">01/21/2017</li>
        <script>
          var items = {"items":[{"item": {"somefield1": "text", "somefield2": "another text"}}]};
        </script>
    </ul>
    <a href="sandbox.html">Link</a>
</div>

Modes replace_substring and replace_matched works following way:

Substring
Matched

              # FINDS FIRST `li`
- find:
    path: "li#li1"
    do:
    # USE `parse` TO FILL THE REGISTER WITH TEXT CONTENT OF CURRENT BLOCK, REGISTER VALUE: "Text and text"
    - parse

    # DO SOME NORMALIZATION
    - normalize:
        routine: replace_substring
        args:
            # REPLACE WORD "text" TO "some another text"
            - text: 'some another text'
            # REGISTER VALUE: "Text and some another text"

            # REPLACE WORD "and" TO "or"
            - and: or
            # REGISTER VALUE: "Text or some another text"

            # REPLACE WORD "another" TO "other"
            - another: other

    # REGISTER VALUE: "Text or some other text"

              # PLEASE NOTE:
# `replace_matched`, OPPOSITE TO `replace_substring`, WORKS UNTIL FIRST MATCH

# FINDS FIRST `li`
- find:
    path: "li#li1"
    do:
    # PARSE TEXT TO THE REGISTER
    - parse

    # NORMALIZE DATA
    - normalize:
        routine: replace_matched
        args:
            # LETS REPLACE VALUE OF THE REGISTER TO "some another text"
            # IF THERE IS WORD "text" IN THE REGISTER VALUE
            - text: 'some another text'
            # REGISTER VALUE: "some another text"

            # SINCE MATCH HAPPENED, ALL ARGUMENTS BELOW WILL BE IGNORED
            - and: or
            - another: other

    # REGISTER VALUE: "some another text"

Examples for modes increment, decrement, capitalize, upper_first, upper and lower:

increment
decrement
capitalize
upper_first
upper
lower

              # TO USE THIS MODE YOU NEED TO HAVE INTEGER VALUE IN THE REGISTER (1,2,158,203040523421 AND SO ON)
# FINDS FIRST `li`
- find:
    path: "li#li1"
    do:
    # GET ATTRIBUTE `id` VALUE AND EXTRACT JUST DIGITS
    - parse:
        attr: id
        filter: (\d+)
    # REGISTER VALUE: "1"

    # USE NORMALIZATION
    - normalize:
        routine: increment

    # REGISTER VALUE: "2"

              # TO USE THIS MODE YOU NEED TO HAVE INTEGER VALUE IN THE REGISTER (1,2,158,203040523421 AND SO ON)
# FINDS SECOND `li`
- find:
    path: "li#li2"
    do:
    # GET ATTRIBUTE `id` VALUE AND EXTRACT JUST DIGITS
    - parse:
        attr: id
        filter: (\d+)
    # REGISTER VALUE: "2"

    # DOING DECREMENT
    - normalize:
        routine: decrement

    # REGISTER VALUE: "1"

              # LETS CAPITALIZE STRING IN THE REGISTER
# FINDS FIRST `li`
- find:
    path: "li#li1"
    do:
    # PARSE TEXT TO THE REGISTER
    - parse
    # REGISTER VALUE: "Text and text"

    # APPLY CAPITALIZE ROUTINE
    - normalize:
        routine: capitalize

    # REGISTER VALUE: "Text and Text"

              # CHANGE ONLY FIRST LETTER OF STRING IN THE REGISTER TO UPPERCASE
# FINDS SECOND `li`
- find:
    path: "li#li2"
    do:
    # PARSE TEXT CONTENT TO THE REGISTER
    - parse
    # REGISTER VALUE: "text and text"

    # NORMALIZE CONTENT OF THE REGISTER
    - normalize:
        routine: upper_first

    # REGISTER VALUE: "Text and text"

              # BRING ALL LETTERS OF REGISTER CONTENT TO UPPER CASE
# FINDS SECOND `li`
- find:
    path: "li#li2"
    do:
    # PARSE TEXT TO THE REGISTER
    - parse
    # REGISTER VALUE: "text and text"

    # APPLY NORMALIZATION
    - normalize:
        routine: upper

    # REGISTER VALUE: "TEXT AND TEXT"

              # BRING ALL LETTERS OF REGISTER CONTENT TO LOWER CASE
# FINDS FIRST `li`
- find:
    path: "li#li1"
    do:
    # PARSE TEXT
    - parse
    # REGISTER VALUE: "Text and text"

    # NORMALIZE REGISTER
    - normalize:
        routine: lower

    # REGISTER VALUE: "text and text"

Modes url, escape_html and unescape_html:

url
escape_html
unescape_html

              # CHANGING RELATIVE LINKS TO ABSOLUTE
# FINDS `a`
- find:
    path: a
    do:
    # EXTRACT VALUE OF ATTRIBUTE `href`
    - parse:
        attr: href
    # REGISTER VALUE: sandbox.html

    # LETS NORMALIZE URL
    - normalize:
        routine: url

    # REGISTER VALUE: https://www.diggernaut.com/sandbox.html

              # CONVERTING SYMBOLS LIKE <, >, " TO HTML ENTITIES
# FINDS FIRST `li`
- find:
    path: "li#li1"
    do:
    # PARSE HTML CONTENT FROM CURRENT BLOCK
    - parse:
        format: html
    # REGISTER VALUE: <li class="item" id="1">Text and text</li>

    # APPLY NORMALIZATION
    - normalize:
        routine: escape_html

    # REGISTER VALUE::

              # REPLACE ALL HTML ENTITIES TO CORRSPONDING CHARACTERS LIKE <, >, " ETC
# FINDS FIRST `li`
- find:
    path: "li#li1"
    do:
    # LETS PUT LITERAL VALUE TO THE REGISTER
    - register_set:

    # NORMALIZE REGISTER VALUE
    - normalize:
        routine: unescape_html

    # REGISTER VALUE: <li class="item" id="1">Text and text</li>

Mode json2xml:

JSON2XML

              # CONVERTING JSON DOCUMENT TO THE XML BLOCK
# FINDS FIRST `script`
- find:
    path: script
    slice: 0
    do:
    # PARSE TEXT
    - parse
    # REGISTER VALUE: var items = {"items":[{"item": {"somefield1": "text", "somefield2": "another text"}}]};

    # REMOVING STUFF WE DONT NEED
    - normalize:
        routine: replace_substring
        args:
            - \s*var\s*items\s*=\s*: ''
            - \s*;\s*$: '' 
    # REGISTER NOW HAS JSON CONTENT: {"items":[{"item": {"somefield1": "text", "somefield2": "another text"}}]}

    # CONVERT IT TO XML
    - normalize:
        routine: json2xml
    # REGISTER NOW HAS:
    # <body_safe>
    #   <items>
    #     <item>
    #       <somefield1>text</somefield1>
    #       <somefield2>another text</somefield2>
    #     </item>
    #   </items>
    # </body_safe>

    # PARSE XML CONTENT, CONVERT IT TO THE BLOCK AND SWITCH TO THIS NEW BLOCK CONTEXT
    - to_block

    # NOW WE ARE IN THE NEW BLOCK AND CAN TRAVERSE OVER ITS DOM STRUCTURE
    - find:
        path: item > somefield1
        do:
        - parse
        # REGISTER VALUE: "text"

In example above we used the to_block command, which creates a new block from the contents of the register. For more information about this command, see Creating a new block.

To work with normalize command in date_format mode, you need to pass additional arguments:

Argument	Description
format_in	The template for the input data (value in the register), so digger could parse provided date. The table below shows all the tags that you can use in the template.
format_out	The template for the output data (how the modified data will be written to the register). The table below shows all the tags that you can use in the template.
add_years	Adds the specified number of years to the value in the register. You can specify both positive and negative integer values.
add_months	Adds the specified number of months to the value in the register. You can specify both positive and negative integer values.
add_days	Adds the specified number of days to the value in the register. You can specify both positive and negative integer values.
add_hours	Adds the specified number of hours to the value in the register. You can specify both positive and negative integer values.
add_minutes	Adds the specified number of minutes to the value in the register. You can specify both positive and negative integer values.
add_seconds	Adds the specified number of seconds to the value in the register. You can specify both positive and negative integer values.
timezone	Converts the date and time into the specified time zone (TZ). For example: America / New_York

The table below shows all possible tags that can be used in templates (format_in, format_out) and examples of usage:

Tag	Description	Template Example	Value Sample
%a	abbreviation for weekdays, eg Mon or Fri	%a, %d %B	Fri, 20 February
%A	weekday, eg Monday or Friday	%A, %d %B	Friday, 20 February
%b	month abbreviation, eg Feb or Sep	%A, %d %b	Friday, 20 Jun
%B	month name, eg February or September	%A, %d %B	Friday, 20 June
%C	number of century, takes values from 00 to 99	%С/%y	20/17
%d	day of month, takes values from 01 to 31	%Y-%m-%d	2017-10-01
%-d	day of month, takes values from 1 to 31	%Y-%m-%-d	2017-10-1
%_d	day of month, takes values from 1 or 01 to 31	%Y-%m-%_d	2017-10-1 or 2017-10-01
%D	preset template, same as %m/%d/%y	%D	05/08/17
%e	day of month, takes values from 1 to 31	%e %B	5 January
%F	preset template, same as %Y-%m-%d	%F	2017-10-01
%g	2-digit number of year according to ISO-8601:1988 standard	%g	17
%G	4-digit number of year according to ISO-8601:1988 standard	%G	2017
%h	same as %b%	%A, %d %h	Friday, 20 Jun
%H	hour in 24-hours system, takes values from 00 to 23	%H:%M:%S	08:35:26
%I	hour in 12-hours system, takes values from 01 to 12	%I:%M:%S	08:35:26
%-I	hour in 12-hours system, takes values from 1 to 12	%-I:%M:%S	8:35:26
%_I	hour in 12-hours system, takes values from 1 or 01 to 12	%_I:%M:%S	8:35:26 or 08:35:26
%j	number of day of year, takes values from 1 to 366	Today is %j day of year	Today is 183 day of year
%k	hour in 24-hours system, takes values from 0 to 23	%k hrs %M mnt	8 hrs 35 mnt
%l	hour in 12-hours system, takes values from 1 to 12	%l hrs %M mnt	8 hrs 35 mnt
%m	number of month, takes values from 01 to 12	%Y-%m-%d	2017-10-01
%-m	number of month, takes values from 1 to 12	%Y-%-m-%d	2017-1-01
%_m	number of month, takes values from 1 or 01 to 12	%Y-%_m-%d	2017-1-01 or 2017-01-01
%l	minutes, takes values from 00 to 59	%l hrs %M mnt	8 hrs 35 mnt
%n	new line symbol	%Y%n%m	2017\n10
%p	value AM or PM depending on time, used with 12-hours time system	%I%p	8AM
%P	value am or pm depending on time, used with 12-hours time system	%I%P	8am
%r	same as %I:%M:%S %p	%r	04:12:37 PM
%R	same as %H:%M	%R	22:35
%s	Unix timestamp, shows number of seconds since start of epoch (1 january 1970)	%s	1506867213
%S	seconds, takes values from 00 to 59	%H:%M:%S	08:35:26
%t	tabulation symbol	%Y%t%m	2017\t10
%T	same as %H:%M:%S	%T	08:35:26
%u	number of weekday from 1 (monday) to 7 (sunday)	Today is %u week day	Today is 5 week day
%U	number of week of year, if week starts with Sunday, takes values from 00 to 53	It was %U week	It was 23 week
%V	number of week of year by ISO standard, if week starts with Monday, takes values from 01 to 53. If week with 1 Jan has 4 or more days in new year, this week is counted as first week of new year, in other case its counted as last week of previous year.	It was %V week	It was 23 week
%w	number of day of week from 0 (sunday) to 6 (saturday)	Today is %w day of week	Today is 5 day of week
%W	number of week of year, if week starts with monday, takes values from 00 to 53	It was %W week	It was 23 week
%y	2-digits number of year	%m/%d/%y	10/01/17
%Y	4-digits number of year	%Y-%m-%d	2017-10-01
%z	time correction value to UTC time. Showing in format like +HHMM or -HHMM, where + means east from GMT, - means west from GMT, HH - number of hours, MM - number of minutes.	%z	+0300
%Z	abbreviation fo timezone	%Z	PST
%+	same as %a %b %e %H:%M:%S %Z %Y	%+	Mon Sep 20 13:24:55 PST 2017
%%	symbol %	%Y%%%m	2017%10

Example of usage:

date_format

              # DATES MANIPULATIONS
# FINDS THIRD `li`
- find:
    path: 'li#li3'
    do:
    # PARSE TEXT
    - parse
    # REGISTER VALUE: 01/21/2017

    # CONVERT DATE TO UNIT TIMESTAMP
    - normalize:
        routine: date_format
        args:
            format_in: "%m/%d/%Y"
            format_out: "%s"
    # REGISTER VALUE: 1484956800

    # -----------------------------------------------------
    # LETS PARSE TEXT AGAIN
    - parse
    # REGISTER VALUE: 01/21/2017

    # ADD 2 YEARS TO THE DATE
    - normalize:
        routine: date_format
        args:
            format_in: "%m/%d/%Y"
            add_years: 2
            format_out: "%m/%d/%Y"
    # REGISTER VALUE: 01/21/2019

    # -----------------------------------------------------
    # PARSE TEXT ONE MORE TIME
    - parse

    # SUBSTRACT 1 YEAR AND CHANGE OUTPUT FORMAT
    - normalize:
        routine: date_format
        args:
            format_in: "%m/%d/%Y"
            add_years: -1
            format_out: "%B,%d %Y"
    # REGISTER VALUE: January,21 2016

    # -----------------------------------------------------
    # NOW LETS SET LITERAL VALUE TO THE REGISTER
    - register_set: '01/21/2017 00:00:00 +0400 UTC'

    # CONVERT IT TO THE TIMEZONE: America/New_York
    - normalize:
        routine: date_format
        args:
            format_in: "%m/%d/%Y %T %z %Z"
            timezone: 'America/New_York'
            format_out: "%m/%d/%Y %T %z %Z"
    # REGISTER VALUE: 01/20/2017 19:00:00 -0500 EST

    # -----------------------------------------------------
    # AS WE KNOW add_* HAS VARIATIONS:
    # add_years   - YEARS
    # add_months  - MONTHES
    # add_days    - DAYS
    # add_hours   - HOURS
    # add_minutes - MINUTES
    # add_seconds - SECONDS
    # YOU CAN USE THEM ALL TOGETHER
    - register_set: '01/21/2017 00:00:00 +0400 UTC'
    - normalize:
        routine: date_format
        args:
            format_in: "%m/%d/%Y %T %z %Z"
            add_years: 10
            add_months: 5
            add_days: 1
            add_hours: 6
            add_minutes: 15
            add_seconds: 50
            format_out: "%m/%d/%Y %T %z %Z"
    # REGISTER VALUE: 06/22/2027 06:15:50 +0000 UTC

To work with normalize command in signature mode, you need to pass following arguments:

Argument	Description
algo	Used algorythm. Currently supported only HMAC.
cypher	Used cypher. Currently supported only SHA256.
encode	Used encoder. Currently supported only Base64.
secret	Secret key.

Usage example:

signature

              # CREATE SIGNATURE
# FINDS FIRST `li`
- find:
    path: 'li#li1'
    do:
    # PARSE TEXT
    - parse

    # REGISTER VALUE: "Text and text"
    - normalize:
        routine: signature
        args:
            algo: HMAC
            cypher: SHA256
            encode: Base64
            secret: A6A7A8A9

    # REGISTER VALUE: eQZxtV7Ae/wu3Enx8C2po9L7j3cefePEwoFZiTmkb7M=

base64, md5 modes:

base64
md5

              # FINDS FIRST `li`
- find:
    path: 'li#li1'
    do:
    # PARSE TEXT
    - parse
    # REGISTER VALUE: "Text and text"

    - normalize:
        routine: base64
    # REGISTER VALUE: VGV4dCBhbmQgdGV4dA==

              # FINDS FIRST `li`
- find:
    path: 'li#li1'
    do:
    # PARSE TEXT
    - parse
    # REGISTER VALUE: "Text and text"

    - normalize:
        routine: md5
    # REGISTER VALUE: be8bde1051516b04402f02e00b4687b7

base64zlib_decode mode:

base64zlib_decode

              # LETS SET ENCODED VALUE DIRECTLY TO THE REGISTER
# ITS ENCODED STRING `some text`
- register_set: eJwrzs9NVShJrSgBABHoA5o=

# DECODE VALUE
- normalize:
    routine: Base64ZLIBDecode

# REGISTER VALUE: some text

In the next section, we'll discuss how to work with variables, arguments, and objects.