Files

Switch to File Context

Sometimes there is a need to get some binary file (eg., some zipped document) from a website and save it to the cloud storage or a local drive. To do it, you must switch to file context using the file command. This command only works in block context and requires a base64 encoded file content in the register.

Let's look in details how we can do it. First, you need to load the file, go to the block with the encoded content, scrape the contents of the block to the register and switch to the file context:

                        # LOAD FILE
- walk:
    to: http://www.td-systems.com/download/pw.zip
    do:
    # FIND THE BLOCK WITH THE BASE64 ENCODED FILE CONTENT
    - find:
        path: file
        do:
        # SCRAPE THE CONTENT
        - parse
        # SWITCH TO THE FILE CONTEXT
        - file:
            do:
            # SAVE OR TRANSFER THE FILE
                        
                    

Save the File

You can save the file to a local drive on the computer (available only in the compiled digger), to the cloud storage (currently supported: Amazon S3, Yandex Object Storage) or FTP server. Save command is used to save the image.

The command supports following parameters:

Parameter Description
ext An extension that defines the type of file being saved. If omitted, the default value "ext" shall be used.
to The type of storage. The following types are currently supported: file, s3, yandex and ftp.

The file type saves the file to a local drive. This type will work only in compiled scrapers. When using storage of this type, the following parameters are required:

Parameter Description
name Filename without an extension. If not specified, a unique name will be generated.
path A path to the directory where you want to save the file. If not specified, the file will be saved to the current directory.
                        # LOAD FILE
- walk:
    to: http://www.td-systems.com/download/pw.zip
    do:
    # FIND THE BLOCK WITH THE BASE64 ENCODED FILE CONTENT
    - find:
        path: file
        do:
        # SCRAPE THE CONTENT
        - parse
        # SWITCH TO THE FILE CONTEXT
        - file:
            do:
            # SAVE TO THE FILE (e://myscripts/myscript.zip)
            - save:
                to: file
                ext: zip
                name: myscript
                path: 'e://myscripts'
                        
                    

The s3 type saves the file to the Amazon S3 cloud storage. When using storage of this type, the following parameters are required:

Parameter Description
key AWS S3 access key. Mandatory.
secret AWS S3 secret. Mandatory.
region AWS S3 region. Mandatory.
bucket AWS S3 bucket name. Mandatory.
token AWS S3 token. Optional.
name Filename without an extension. If not specified, a unique name will be generated.
path A path to the directory where you want to save the file. If not specified, the file will be saved to the root of the bucket.
                        # LOAD FILE
- walk:
    to: http://www.td-systems.com/download/pw.zip
    do:
    # FIND THE BLOCK WITH THE BASE64 ENCODED FILE CONTENT
    - find:
        path: file
        do:
        # SCRAPE THE CONTENT
        - parse
        # SWITCH TO THE FILE CONTEXT
        - file:
            do:
            # SAVE FILE TO THE S3 STORAGE (/scripts/myscript.zip)
            - save:
                to: s3
                key: AWSAJJDJJSJDJDJFK
                secret: AWSSERETTDHFJJJDJSKFJFJSJJFJJGKRI
                region: us-east-1
                bucket: mybucket
                name: myscript
                ext: zip
                path: '/scripts'
                        
                    

The yandex type saves the file to the Yandex Object Storage. When using storage of this type, the following parameters are required:

Parameter Description
key Yandex Object Storage access key. Mandatory.
secret Yandex Object Storage secret. Mandatory.
region Yandex Object Storage region. Mandatory.
bucket Yandex Object Storage bucket name. Mandatory.
token Yandex Object Storage token. Optional.
name Filename without an extension. If not specified, a unique name will be generated.
path A path to the directory where you want to save the file. If not specified, the file will be saved to the root of the bucket.
                        # LOAD FILE
- walk:
    to: http://www.td-systems.com/download/pw.zip
    do:
    # FIND THE BLOCK WITH THE BASE64 ENCODED FILE CONTENT
    - find:
        path: file
        do:
        # SCRAPE THE CONTENT
        - parse
        # SWITCH TO THE FILE CONTEXT
        - file:
            do:
            # SAVE FILE TO THE YANDEX STORAGE (/scripts/myscript.zip)
            - save:
                to: yandex
                key: AWSAJJDJJSJDJDJFK
                secret: AWSSERETTDHFJJJDJSKFJFJSJJFJJGKRI
                region: ru-central1
                bucket: mybucket
                name: myscript
                ext: zip
                path: '/scripts'
                        
                    

The ftp type saves the file to the FTP server. When using storage of this type, the following parameters are required:

Parameter Description
host IP address or hostname of the FTP server. Mandatory.
port Port of the FTP server, if omitted, default port 21 is used. Optional.
username FTP server username, if omitted empty username will be used. Optional.
password FTP server password, if omitted empty password will be used. Optional.
name Filename without an extension. If not specified, a unique name will be generated.
path A path to the directory where you want to save the file. If not specified, the file will be saved to the current directory after user is logged in to the FTP server.
                        # LOAD FILE
- walk:
    to: http://www.td-systems.com/download/pw.zip
    do:
    # FIND THE BLOCK WITH THE BASE64 ENCODED FILE CONTENT
    - find:
        path: file
        do:
        # SCRAPE THE CONTENT
        - parse
        # SWITCH TO THE FILE CONTEXT
        - file:
            do:
            # SAVE FILE TO THE FTP SERVER (scripts/myscript.zip)
            - save:
                to: ftp
                host: ftp.mywebsite.com
                port: 21
                username: mylogin
                password: mypassword
                name: myscript
                ext: zip
                path: 'scripts'
                        
                    

Please note that when you run the digger in the cloud and save files to cloud storages or FTP, a bandwidth quota is used that corresponds to your subscription plan. For example, on a free plan, the quota is 10 megabytes per month. Upon reaching the quota, the files will no longer be stored in the cloud storage / FTP. Also, the files will not be saved if your digger is in debug mode.

Next, we'll look at features designed to work with news websites and articles.