Entity Manipulations

Using Hashes

Hashs are used as dictionaries for various purposes, for example you can create a dictionary of pages already visited. So you can check whether you visited page already before loading it. You can also create reference data sets for a further population of objects with data. Hashs can not be used as data for substitution, but you can read the values ​​to the register and work with them there.

The hashmap_set command is used for setting field of the hash with value of the register (only in the block context) or directly:

              # SWITCHING TO THE BLOCK
- find:
    path: .somepath
    do:
    - parse

    # WRITING REGISTER VALUE TO THE HASH FIELD
    - hashmap_set:
        name: currency
        field: EUR

# WRITING VALUE TO THE HASH FIELD DIRECTLY
- hashmap_set:
    name: currency
    field: USD
    value: United States Dollar
              

The command hashmap_get is used to write the value of the hash field to the register:

              # SWITCHING TO THE BLOCK
- find:
    path: .somepath
    do:
    # READING HASH FIELD TO THE REGISTER
    - hashmap_get:
        name: currency
        field: EUR
              

Let's see how you can use a hash to prevent the collection of duplicated events (events that has Activity number). The source HTML for our scraper is available at this link.
Please note that Activity number 363101-09 is duplicated in the table, and we only need to collect the first record encountered and ignore all subsequent duplicates under the same number.

              ---
config:
    debug: 2
    agent: Firefox
do:
- walk:
    to: https://www.diggernaut.com/sandbox/meta-lang-hash-table-en.html
    do:
    - find:
        # LETS FIND ALL `tr` TAGS
        path: tbody > tr
        do:
        # CLEAR VARIABLE FOR KEEPING ACTIVITY NUMBER
        - variable_clear: number
        - find:
            path: td.col2
            do:
            - parse
            # SAVE NUMBER TO THE VARIABLE
            - variable_set: number
        # TRYING TO FIND HASH WITH NAME AS ACTIVITY NUMBER AND READ FIELD `name` TO THE REGISTER
        - hashmap_get:
            name: <%number%>
            field: name
        - if:
            # CHECK IF REGISTER IS NOT EMPTY
            match: \S
            # IF ITS EMPTY
            else:
            # CREATE OBJECT `item`
            - object_new: item
            - find:
                path: td.col3
                do:
                - parse
                # CREATE HASH WITH NAME AS ACTIVITY NUMBER AND SAVE REGISTER VALUE (IT HAS NAME OF ACTIVITY) TO THE FIELD `name`
                # THIS HASH WILL BE USED FUTHER FOR DUPLICATES CHECKING
                - hashmap_set:
                    name: <%number%>
                    field: name
                # SAVE VALUE OF THE REGISTER TO THE FIELD name OF THE OBJECT item
                - object_field_set:
                    object: item
                    field: name
            - find:
                path: td.col4
                do:
                - parse
                # SAVE LOCATION TO THE OBJECT
                - object_field_set:
                    object: item
                    field: location
            - find:
                path: td.col10
                do:
                - parse
                # SAVE STATUS OF EVENT TO THE OBJECT
                - object_field_set:
                    object: item
                    field: isAvailable
            # SAVE OBJECT TO THE DB
            - object_save:
                name: item
              
              <!DOCTYPE html>
<html lang="en">

<head>
    <meta charset="UTF-8">
    <title>Diggernaut | Meta-language | Hash table sample</title>
</head>

<body>
    <div class="result-content">
        <div>
            <h3>363101&nbsp;-&nbsp;Jr Golf Clinic Orange</h3>
        </div>
        <table cellspacing="2" border="1" cellpadding="5">
            <thead>
                <tr>
                    <th>Activity</th>
                    <th>Description</th>
                    <th>Location</th>
                    <th>Status</th>
                </tr>
            </thead>
            <tbody>
                <tr>
                    <td class="col2">
                        <span class="nowrap">363101-07</span>
                    </td>
                    <td class="col3">Jr Golf-Orange 4,5:31</td>
                    <td class="col4">Randall Oaks Golf Cl</td>
                    <td class="col10">
                        <span class="success arstatus">Available</span>
                    </td>
                </tr>
                <tr>
                    <td class="col2">
                        <span class="nowrap">363101-09</span>
                    </td>
                    <td class="col3">Jr Golf-Orange 4,5:30</td>
                    <td class="col4">Randall Oaks Golf Cl</td>
                    <td class="col10">
                        <span class="success arstatus">Available</span>
                    </td>
                </tr>
                <tr>
                    <td class="col2">
                        <span class="nowrap">363101-09</span>
                    </td>
                    <td class="col3">Jr Golf-Orange 5,5:30</td>
                    <td class="col4">Randall Oaks Golf Cl</td>
                    <td class="col10">
                        <span class="success arstatus">Available</span>
                    </td>
                </tr>
                <tr>
                    <td class="col2">
                        <span class="nowrap">363101-10</span>
                    </td>
                    <td class="col3">Jr Golf-Orange 5,6:23</td>
                    <td class="col4">Randall Oaks Golf Cl</td>
                    <td class="col10">
                        <span class="success arstatus">Available</span>
                    </td>
                </tr>
            </tbody>
        </table>
    </div>
</body>
</html>
              
              [
    {
        "item": {
            "isAvailable": "Available",
            "location": "Randall Oaks Golf Cl",
            "name": "Jr Golf-Orange 4,5:31"
        }
    },
    {
        "item": {
            "isAvailable": "Available",
            "location": "Randall Oaks Golf Cl",
            "name": "Jr Golf-Orange 4,5:30"
        }
    },
    {
        "item": {
            "isAvailable": "Available",
            "location": "Randall Oaks Golf Cl",
            "name": "Jr Golf-Orange 5,6:23"
        }
    }
]
            

Next, we consider how useful counters can be and what methods are provided for them.