Conditional Flow

Using If

When you scraping something, you often end up with situations when some operations should be done depending on certain condition. To check the condition, you can use the if command. This command works with strings, integer and floating-point values. The command compare the value of the register with explicitly given value as parameter, so command can only be used in a block context.

You can use following parameters:

Parameter Description
match|eq|gt|lt|nlt Modes, which command work with:
match - checks whether the value of the register contains the passed parameter;
eq - checks whether the passed value is equal to the value in the register;
gt - checks whether the value of the register is greater than the passed parameter;
lt - checks whether the value of the register is less than the passed parameter;
nlt - checks whether the value of the register is not less than the passed parameter.
type Values type for comparison:
string - for string values comparison;
int - for integer values comparison;
float - for float values comparison.
If not passed, default string type is used.
do A block of commands to execute if the condition is met. Optional parameter.
else A block of commands to execute if the condition is not met. Optional parameter.

Examples of if notation:

          # COMPARISON USING REGULAR EXPRESSION
- if:
    match: regex
    # IF MATCHED REGEX - BLOCK `do` WILL BE EXECUTED
    do:
    ..
    ..
    # IF NOT - BLOCK `else` WILL BE EXECUTED
    else:
    ..
    ..
          
          # COMPARISON USING REGULAR EXPRESSION
# BUT ONLY `do` USED, WITHOUT `else` LOGIC BLOCK
# WORKS SAME WAY IN OTHER MODES
- if:
    match: regex
    # IF MATCHED REGEX - BLOCK `do` WILL BE EXECUTED
    do:
    ..
    ..  
          
          # COMPARISON USING REGULAR EXPRESSION
# BUT ONLY `else` USED
# WORKS SAME WAY IN OTHER MODES
- if:
    match: regex
    # IF NOT MATCHED - BLOCK `else` WILL BE EXECUTED
    else:
    ..
    ..
          
          # COMPARISON OF INTEGER VALUES
- if:
    eq: 0
    type: int
    # IF TRUE - BLOCK `do` WILL BE EXECUTED
    do:
    ..
    ..
    # IF FALSE - BLOCK `else` WILL BE EXECUTED
    else:
    ..
    ..
          
          # COMPARISON OF INTEGER VALUES
- if:
    gt: 0
    type: int
    # IF REGISTER VALUE IS GREATER THAN 0, BLOCK `do` WILL BE EXECUTED
    do:
    ..
    ..
    # IN OTHER CASE - BLOCK `else` WILL BE EXECUTED
    else:
    ..
    ..

- if:
    lt: 0
    type: int
    # IF REGISTER VALUE IS LESS THAN 0, BLOCK `do` WILL BE EXECUTED
    do:
    ..
    ..
    # IN OTHER CASE - BLOCK `else` WILL BE EXECUTED
    else:
    ..
    ..

- if:
    nlt: 0
    type: int
    # IF REGISTER VALUE IS NOT LESS THAN 0, BLOCK `do` WILL BE EXECUTED
    do:
    ..
    ..
    # IN OTHER CASE - BLOCK `else` WILL BE EXECUTED
    else:
    ..
    ..
          

Let's overview different cases of using if command, and use following HTML source for it:

          <ul class="list">
    <li class="list-item" id="1">Some text</li>
    <li class="list-item" id="item=2"><a href="http://somesite.com/">Link</a></li>
    <li class="list-item" id="item=3">Some other text</li>
</ul>
          

Example of match mode usage:

              # FIND ALL `li`
- find:
    path: li
    do:
    - parse

    # CHECK IF THERE IS WORD `text` IN THE REGISTER
    - if:
        match: text
        do:
        # IF TRUE, SET OBJECT FIELD WITH THE VALUE OF THE REGISTER
        - object_field_set:
            object: someobj
            field: somefield
              
              # FIND ALL `li`
- find:
    path: li
    do:
    - parse

    # CHECK IF THERE IS WORD `text` IN THE REGISTER
    - if:
        match: text
        # IF NOT FOUND, FIND `a`
        else:
        - find:
            path: a
            do:
            # PARSE ATTRIBUTE `href` TO THE REGISTER
            - parse:
                attr: href

            # NORMALIZE URL
            - normalize:
                routine: url

            # LOAD PAGE LOCATED AT THAT URL
            - walk:
                to: value
                do:
                ..
                ..
              
              # FIND ALL `li`
- find:
    path: li
    do:
    - parse

    # CHECK IF THERE IS WORD `text` IN THE REGISTER
    - if:
        match: text
        do:
        # IF TRUE, SET OBJECT FIELD WITH THE VALUE OF THE REGISTER
        - object_field_set:
            object: someobj
            field: somefield

        # IF NOT FOUND, FIND `a`
        else:
        - find:
            path: a
            do:
            # PARSE ATTRIBUTE `href` TO THE REGISTER
            - parse:
                attr: href

            # NORMALIZE URL
            - normalize:
                routine: url

            # LOAD PAGE LOCATED AT THAT URL
            - walk:
                to: value
                do:
                ..
                ..
              

Examples of gt, lt, nlt, eq modes usage:

              # FIND ALL `li`
- find:
    path: li
    do:
    # PARSE ATTRIBUTE `id` VALUE AND EXTRACT ONLY DIGITS
    - parse:
        attr: id
        filter:
            - (\d+)

    # CHECK IF VALUE OF THE REGISTER IS GREATER THAN `2`
    - if:
        gt: 2
        # SPECIFY TYPE `integer`
        type: int
        do:
        # IF ITS TRUE SET FIELD OF THE OBJECT TO THE REGISTER VALUE
        - object_field_set:
            object: someobj
            field: somefield
              
              # FIND ALL `li`
- find:
    path: li
    do:
    # PARSE ATTRIBUTE `id` VALUE AND EXTRACT ONLY DIGITS
    - parse:
        attr: id
        filter:
            - (\d+)

    # CHECK IF VALUE OF THE REGISTER IS LESS THAN `2`
    - if:
        lt: 2
        # SPECIFY TYPE `integer`
        type: int
        do:
        # IF ITS TRUE SET FIELD OF THE OBJECT TO THE REGISTER VALUE
        - object_field_set:
            object: someobj
            field: somefield
              
              # НАЙДЕМ ВСЕ `li`
- find:
    path: li
    do:
    # PARSE ATTRIBUTE `id` VALUE AND EXTRACT ONLY DIGITS
    - parse:
        attr: id
        filter:
            - (\d+)

    # CHECK IF VALUE OF THE REGISTER IS NOT LESS (GREATER OR EQUAL) THAN `2`
    - if:
        nlt: 2
        # SPECIFY TYPE `integer`
        type: int
        do:
        # IF ITS TRUE SET FIELD OF THE OBJECT TO THE REGISTER VALUE
        - object_field_set:
            object: someobj
            field: somefield
              
              # НАЙДЕМ ВСЕ `li`
- find:
    path: li
    do:
    # PARSE ATTRIBUTE `id` VALUE AND EXTRACT ONLY DIGITS
    - parse:
        attr: id
        filter:
            - (\d+)

    # CHECK IF VALUE OF THE REGISTER IS EQUAL `2`
    - if:
        eq: 1
        # SPECIFY TYPE `integer`
        type: int
        else:
        # IF ITS FALSE SET FIELD OF THE OBJECT TO THE REGISTER VALUE
        - object_field_set:
            object: someobj
            field: somefield
              

In the next chapter, you will learn how to use optical text recognition (OCR) and text extraction from images.