Other Commands

Reloading Pages

In the process of fetching pages from the site, there may be a situation where the server does not return you the page completely, or for example, block your proxy if you fetch pages too fast. In order to work through such situations, we added the page_reload command. It reloads the current page and works in block and page contexts.

Or you might need to reread the content of the page in the client, for example, when using a headless browser as a client, when you need to wait until all scripts on page finish page rendering. To do it, you can use the command page_reread.

Usage example:

              - find:
    path: body
    do:
    # CHECK IF PROXY IS BLOCKED
    - parse
    - if:
        # PLEASE NOTE THAT TEXT "request has been blocked" MAY BE DIFFERENT
        # IN YOUR CASE, AS IT DEPENDS ON SOURCE SITE
        # IN SUCH CASES YOU JUST NEED TO USE TEXT YOUR WEBSITE USES
        # TO TELL CLIENT THAT REQUEST HAS BEEN BLOCKED
        match: "request has been blocked"
        do:
        # SWITCHING PROXY
        - proxy_switch
        # RELOADING PAGE
        - page_reload
              
              - walk:
    to: http://somesite.com/page.html
    do:
    # OUR PAGE IS RENDERED WITH JS IN A FEW SECONDS AFTER LOADING, SO LET'S WAIT FOR 5 SEC
    - sleep: 5
    # REREAD PAGE CONTENT
    - page_reread
                

In the next chapter, you'll learn how to remove the URL from the cache of the loaded pages.