Mikhail Sisin Co-founder of cloud-based web scraping and data extraction platform Diggernaut. Over 10 years of experience in data extraction, ETL, AI, and ML.

New in Diggernaut: expanded functionality to work with Selenium, new static variables, and proxy management

2 min read

New in Diggernaut

For paid subscribers, it became possible to set the proxy type for use in diggers.

You can choose between data center, residential, IPv6, and Tor proxy pools. The default pool is always the data center, this pool contains mainly US proxies. As a rule, all proxies in this pool are fast and reliable. Residential proxies work much slower and are not as reliable, but they allow you to collect information from sites that are not allowed to access through data center proxies. IPv6 can be used for sites that support this protocol (eg Instagram, Google, Yandex). Tor is the slowest option, and also you cannot access some websites through Tor. On the other hand, Tor gives access to the darknet, which means you now can scrape the darknet. Users with a free account only have access to the Tor pool. However, you can always use your proxies, and now not only HTTP. We’ve added support for SOCKS4 and SOCKS5 proxies. So even with free accounts, there have more proxy options now. More details about the proxy settings can be found in our documentation.

We have added few new static variables to provide more flexibility when creating digger configurations.

You can use the responseCode variable to process the errors returned by the source website server. There is a three-digit numeric code of the server response stored in this variable after each request. For example, 200 – if everything is OK, or 500 – if an error occurred on the server, 503 – if access to the website is denied, and so on (for more details about response codes, you can read here). Then you can build logical constructs in the configuration to bypass the errors that occur if the website actively resists scraping or is simply unstable.

The filename variable can be useful to you if you transfer binaries and images using a digger. This variable will be automatically assigned the value of the name of the last saved file and you can put it to the dataset object to associate a specific record with the file.

If you are using update mode when saving objects, theobject_saved variable may come in handy. It allows you to find out whether the object was saved or not, and, depending on the outcome, make certain actions. The variable can be “yes” or “no”.

You can find more details about information on static variables here.

We have expanded the capabilities of working with Selenium on our Diggernaut platform.

Using the scrollto command, you can scroll the page to the specific element. Thus, the element will be visible in the browser window and it will be possible to perform various actions with it. For example, you can scroll to the button you want to click on. The execute command will allow you to run javascript snippets for manipulating elements on the web page. For example, you can find a specific element on the page and hide it, or, conversely, show it. This can be useful, for example, if the page has some kind of sticky header that overlaps part of the page, and when we scroll to the button, we cannot click on it, because it lays behind the header. Sometimes webpages have inline frames, you can now use the fetch_content command to access their content. In this case, the HTML content of the frame will be saved to the register, so you can create a block from the register content and work with it. You can refer to our documentation to find out how to work with Selenium.

Mikhail Sisin Co-founder of cloud-based web scraping and data extraction platform Diggernaut. Over 10 years of experience in data extraction, ETL, AI, and ML.

Leave a Reply

Your email address will not be published. Required fields are marked *