Extracting data from El Paso County Sheriff Blotter
This article can be used for educational purposes if you want to learn how to work with our Excavator application. We assume that you already have Excavator application installed, if not, please login to your Diggernaut account, then go to Visual Extractor section and follow instructions to install it.
So first lets open El Paso Sheriff Blotter website in Google Chrome browser, click on Excavator icon and then on “Start Excavator” button.
You will see application opened and in few second site will be loaded to the application. Once its done, we can start to work with it.
First, take a look at the CSS Selector option on top of right side of the application. There you can turn on and off classes, ids and attributes. Property will not be used in CSS pathing if its turned off. Usually we have classes turned on as it gives good results in most cases, but sometimes you may want to turn it off. On this page we can see that each table row has classes like .even and .odd. So if we will use classes in CSS selectors, we cannot select all table rows with single “Find” command, we will need to get through .even and .odd rows separately. But it also means you will need to duplicate main logic block twice. Its not good way to do. To avoid it we can turn Off classes for CSS selectors and we will be able to select all table rows with a single “Find” command.
Then lets change way how we load first page. Since we need multiple pages, and there is paginator on the page, would be better if we use links pool and iterate over URLs in this pool. So instead of walk to first page, lets first add it to the links tool and then walk to the pool. To do it we need to drag “Add to links” command.
Let’s copy & paste URL from “Walk” command to “Add to links” command. And finally change “Walk” command mode to “links”.
Now lets click on the first (or any other) cell in first row of the table with data. You will see that you selected only cell, but not whole row. Its true, very often you cannot select row by simply clicking on it, as cells overlay row entirely. To solve this problem and select row you can use “Select Parent” icon to select parent of cell, which is row. So lets click on it and you will see that you have now whole row selected.
Next step we need to pull “Find all” command to the walk block, as we need to select all rows in the table.
If you move mouse over “Find” command now, you will see that it selects all rows in the table. You also will see that “Find” command has “Parse” block inside. Since we are not going to parse content in the row and instead we will walk into each cell get data separately, we need to delete “Parse” command from “Find” block by clicking on delete icon.
Lets now turn on classes for CSS selectors as generally it helps a lot.
So we have a logic which loads URL and then proceed into each row, and we should create logic for data object population now. Each row in the table represend one data object we need. Each cell contains one or more fields for this data object. So first thing after we got into table row we should create new data object. To do it we need to simply drag “Object” command to the “Find” block.
Then we need to select option for “Object” command. Since we need to create new object, we have to select “New” and then give name for our object: “incident”.
We going to walk into each cell now and parse data for our data fields. First lets click on first cell inside the row we have selected. Now lets pull “Find” command.
If you move mouse over this new “Find” command, you will see that first cell of each row is highlighted. It means we did it right way. You can see that there is “Parse” command in the “Find” block we just pulled. This command parse text information from the selected block, to preview data that will be extracted with “Parse” you can click on the “Preview” icon. After it you can preview extracted text for each selected block.
Seems like we extracted call number information properly, lets then put it to the data structure field. We can do it by pulling “Object” command, then selecting “Field set” option and specifying field name and object name.
Then we can fold this find block by clicking on “Fold” icon.
Next cell has date and time data, lets extract them separately. Click on date.
Then lets do same we did for the first cell. When its done you should have something like:
After it lets do the same for time and other cells in the row.
Finally we need to save our data object by dragging “Object” command,
selecting “Save” option and specifying object name “incident”.
So we now have our main logic block done, if we’d run it right away, it will push URL to the links poll, then start iterate over this pool, so first thing it will go – walk to the first link in the pool. Then we have “Find” command which finds all table rows and iterate over them. Digger walks into each row and create new data object name “incident”, walk into each cell, parse data and save it to the specific field inside our data object, then object is saved to the DB. We will get bunch of records in our database, but there will be only rows from this particular page. What do we need to do to get data from all pages? There is paginator on the page and it has “Next” link.
Easiest way to have digger to jump to the next page is to get next page link and push it to the links pool. So lets click on the “Next” link and then drag “Find” command.
Parse command by default extracts text from selected block, and we need to extract link (or href attribute). So we need to select “Attribute” and attribute “href”. To ensure you selected it properly, you can use “Preview” function.
And finallly you need to pull “Add to links” command to push parsed URL to the pool
So now, after digger get all data from the page, it checks for next page link and if it exists, digger gets URL, push it to the pool and then walk to this URL. This way we get data from all pages.
Since we are done, we can click on “Save” button to save our digger to the Diggernaut account. Select project, give name for digger and URL of site you are scraping. Then click on “Save” button.
If all went well you will see “Success” message.
We are done, now you can login to your Diggernaut account, find digger you created and run it, then go to data and download file with data. Configuration file for digger can be downloaded here.
Co-founder of cloud based web scraping and data extraction platform Diggernaut