{"id":47,"date":"2016-09-21T12:37:53","date_gmt":"2016-09-21T12:37:53","guid":{"rendered":"https:\/\/blog.diggernaut.com\/?p=47"},"modified":"2019-01-12T21:43:15","modified_gmt":"2019-01-12T21:43:15","slug":"extract-data-from-ical-there-is-nothing-easier","status":"publish","type":"post","link":"https:\/\/www.diggernaut.com\/blog\/extract-data-from-ical-there-is-nothing-easier\/","title":{"rendered":"Extract data from iCal? It couldn&#8217;t be easier."},"content":{"rendered":"<p>Extract data from iCal? It couldn&#8217;t be easier.<\/p>\n<p>Today we write a script for scraping various resources which uses files in an iCal format to send event data.<\/p>\n<p>Apple invented this format, and now many websites let you export the calendar events in this format. In this case, you do not need to scrape the website and parse the HTML. You only need to get and parse a file in iCal format. It makes the whole process much more manageable.<\/p>\n<p><a href=\"https:\/\/www.diggernaut.com\">Diggernaut.com<\/a> natively support this format and automatically converts it to XML. So we can work with iCal data as with a regular HTML page.<\/p>\n<p>Let&#8217;s see how it works by extracting data from Science Fiction Conventions calendar, found by me on the icalshare.com website. Let&#8217;s start to write the config by defining some basic settings. First, we need to set a digger to the debug level 2. This is the only way we can see the source code of the converted file, and we need to check it so we could write the navigation instructions for walking to blocks with data we need to extract and collect.<\/p>\n<p>Calendar file we are going to use is: <a href=\"https:\/\/www.google.com\/calendar\/ical\/lirleni%40gmail.com\/public\/basic.ics\">https:\/\/www.google.com\/calendar\/ical\/lirleni%40gmail.com\/public\/basic.ics<\/a><\/p>\n<p>So, our config starts with:<\/p>\n<pre class=\"language-yaml line-numbers\"><code class=\"language-yaml\">---\nconfig:\n    debug: 2\n    agent: Firefox\ndo:\n- walk:\n    to: https:\/\/www.google.com\/calendar\/ical\/lirleni%40gmail.com\/public\/basic.ics\n    do:\n    - stop<\/code><\/pre>\n<p>In this code we set Debug level 2, configure digger to use Firefox as browser name and get iCal file. Now we need to login to our account at <a href=\"https:\/\/www.diggernaut.com\">Diggernaut.com<\/a>, select existing or create a new project and then create new digger, and save config we wrote to the digger.<\/p>\n<p>Make sure that digger is set to the Debug mode (in the Status column you should see Debug). If it&#8217;s not so, you need to switch digger to debug mode using the selector in the Status column. Then we need to start the digger and wait until it finishes the run. When it&#8217;s done, we need to check logs by clicking on the &#8220;Log&#8221; button.<\/p>\n<pre><code class=\"language-html\">&lt;html&gt;\n\n&lt;head&gt;&lt;\/head&gt;\n\n&lt;body&gt;\n    &lt;body_safe&gt;\n        &lt;event&gt;\n            &lt;alarmtime&gt;0s&lt;\/alarmtime&gt;\n            &lt;class&gt;PUBLIC&lt;\/class&gt;\n            &lt;created&gt;2009-03-07 20:15:36 +0000 UTC&lt;\/created&gt;\n            &lt;description&gt;Gaming Convention&lt;\/description&gt;\n            &lt;end&gt;2011-01-31 00:00:00 +0000 UTC&lt;\/end&gt;\n            &lt;id&gt;2a994c3e3b80f5af6e8fa178a3af45d4&lt;\/id&gt;\n            &lt;importedid&gt;b82f342c-0b54-11de-b762-000d936372a6&lt;\/importedid&gt;\n            &lt;location&gt;Champaign IL (USA)&lt;\/location&gt;\n            &lt;modified&gt;2011-02-20 23:29:49 +0000 UTC&lt;\/modified&gt;\n            &lt;rrule&gt;&lt;\/rrule&gt;\n            &lt;sequence&gt;4&lt;\/sequence&gt;\n            &lt;start&gt;2011-01-28 00:00:00 +0000 UTC&lt;\/start&gt;\n            &lt;status&gt;CONFIRMED&lt;\/status&gt;\n            &lt;summary&gt;WinterWar 38&lt;\/summary&gt;\n            &lt;wholedauyevent&gt;true&lt;\/wholedauyevent&gt;\n        &lt;\/event&gt;\n        &lt;event&gt;\n            &lt;alarmtime&gt;0s&lt;\/alarmtime&gt;\n            &lt;class&gt;&lt;\/class&gt;\n            &lt;created&gt;2009-01-10 17:33:11 +0000 UTC&lt;\/created&gt;\n            &lt;description&gt;Gaming convention&lt;\/description&gt;\n            &lt;end&gt;2011-08-08 00:00:00 +0000 UTC&lt;\/end&gt;\n            &lt;id&gt;5c8f16d772ede097822e73a0c2e51c6c&lt;\/id&gt;\n            &lt;importedid&gt;356F0F0C-FE52-47A8-AEAB-8E78F57D4F52&lt;\/importedid&gt;\n            &lt;location&gt;Indianapolis IN&lt;\/location&gt;\n            &lt;modified&gt;2011-02-20 23:29:48 +0000 UTC&lt;\/modified&gt;\n            &lt;rrule&gt;&lt;\/rrule&gt;\n            &lt;sequence&gt;10&lt;\/sequence&gt;\n            &lt;start&gt;2011-08-04 00:00:00 +0000 UTC&lt;\/start&gt;\n            &lt;status&gt;CONFIRMED&lt;\/status&gt;\n            &lt;summary&gt;GenCon&lt;\/summary&gt;\n            &lt;wholedauyevent&gt;true&lt;\/wholedauyevent&gt;\n        &lt;\/event&gt; ...\n<\/code><\/pre>\n<p>As you can see, page structure consists of <event> blocks. So all that we need is to go through all these blocks and pick all fields from each. So, let&#8217;s pick one block and reformat it so we could see better what we need to get as data fields and probably what filters to use.<\/event><\/p>\n<pre><code class=\"language-html\">&lt;event&gt;\n    &lt;alarmtime&gt;0s&lt;\/alarmtime&gt;\n    &lt;class&gt;PUBLIC&lt;\/class&gt;\n    &lt;created&gt;2009-03-07 20:15:36 +0000 UTC&lt;\/created&gt;\n    &lt;description&gt;Gaming Convention&lt;\/description&gt;\n    &lt;end&gt;2011-01-31 00:00:00 +0000 UTC&lt;\/end&gt;\n    &lt;id&gt;2a994c3e3b80f5af6e8fa178a3af45d4&lt;\/id&gt;\n    &lt;importedid&gt;b82f342c-0b54-11de-b762-000d936372a6&lt;\/importedid&gt;\n    &lt;location&gt;Champaign IL (USA)&lt;\/location&gt;\n    &lt;modified&gt;2011-02-20 23:29:49 +0000 UTC&lt;\/modified&gt;\n    &lt;rrule&gt;&lt;\/rrule&gt;\n    &lt;sequence&gt;4&lt;\/sequence&gt;\n    &lt;start&gt;2011-01-28 00:00:00 +0000 UTC&lt;\/start&gt;\n    &lt;status&gt;CONFIRMED&lt;\/status&gt;\n    &lt;summary&gt;WinterWar 38&lt;\/summary&gt;\n    &lt;wholedauyevent&gt;true&lt;\/wholedauyevent&gt;\n&lt;\/event&gt;\n<\/code><\/pre>\n<p>We are not going to pick all these fields, let&#8217;s get only summary, description, start date\/time, end date\/time and location. It&#8217;s very easy to do: first, we walk to the event block, create a data object, then we walk to the fields blocks, parse data and save it to the object fields and finally save the data object.<\/p>\n<pre class=\"language-yaml line-numbers\"><code class=\"language-yaml\">---\nconfig:\n    agent: Firefox\ndo:\n- walk:\n    to: https:\/\/www.google.com\/calendar\/ical\/lirleni%40gmail.com\/public\/basic.ics\n    do:\n    - find:\n        path: event\n        do:\n        - object_new: event\n        - find:\n            path: summary\n            do:\n            - parse\n            - normalize:\n                routine: replace_substring\n                args:\n                    \\\\: &#039;&#039;\n            - object_field_set:\n                object: event\n                field: summary\n        - find:\n            path: description\n            do:\n            - parse\n            - normalize:\n                routine: replace_substring\n                args:\n                    \\\\: &#039;&#039;\n            - object_field_set:\n                object: event\n                field: description\n        - find:\n            path: location\n            do:\n            - parse\n            - normalize:\n                routine: replace_substring\n                args:\n                    \\\\: &#039;&#039;\n            - object_field_set:\n                object: event\n                field: location\n        - find:\n            path: start\n            do:\n            - parse:\n                filter: (\\d{4}-\\d{2}-\\d{2}\\s+\\d{2}:\\d{2}:\\d{2})\n            - object_field_set:\n                object: event\n                field: start_date\n        - find:\n            path: end\n            do:\n            - parse:\n                filter: (\\d{4}-\\d{2}-\\d{2}\\s+\\d{2}:\\d{2}:\\d{2})\n            - object_field_set:\n                object: event\n                field: end_date\n        - object_save:\n            name: event<\/code><\/pre>\n<p>Lets put our config to the digger and run it. Once it&#8217;s done, let&#8217;s jump to the Data section and make sure that data we have scraped is in a good state. You should see there something like:<\/p>\n<pre><code class=\"language-txt\">Item #1 \nstart_date  2011-01-28 00:00:00\nsummary WinterWar 38\ndescription Gaming Convention\nend_date    2011-01-31 00:00:00\nlocation    Champaign IL (USA)\nItem #2 \nstart_date  2011-08-04 00:00:00\nsummary GenCon\ndescription Gaming convention\nend_date    2011-08-08 00:00:00\nlocation    Indianapolis IN\nItem #3 \nstart_date  2011-03-11 00:00:00\nsummary Madicon 20\ndescription Science Fiction, with a large proportion of Gaming\nend_date    2011-03-14 00:00:00\nlocation    Harrisonburg VA, USA\n<\/code><\/pre>\n<p>If data is good, let&#8217;s switch our digger to the Active mode, as in Debug mode you cannot download data, all you can do in Debug mode is to review a limited set of data. Let&#8217;s start digger again and wait for completion. Then go to the Data section again and download data in the format we need. The sample in XLSX format you can <a href=\"https:\/\/blog.diggernaut.com\/wp-content\/uploads\/2016\/09\/digger_589_data.xlsx\">download here<\/a>.<\/p>\n<p>As you can see, it&#8217;s straightforward to work with iCal at <a href=\"https:\/\/www.diggernaut.com\">Diggernaut<\/a>!<\/p>","protected":false},"excerpt":{"rendered":"<p>Extract data from iCal? It couldn&#8217;t be easier. Today we write a script for scraping various resources which uses files in an iCal format to send event data. Apple invented this format, and now many websites let you export the calendar events in this format. In this case, you do not need to scrape the [&hellip;]<\/p>","protected":false},"author":4,"featured_media":50,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9,2],"tags":[],"class_list":["post-47","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-learning-meta-language","category-web-scraping"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts\/47","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/comments?post=47"}],"version-history":[{"count":11,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts\/47\/revisions"}],"predecessor-version":[{"id":689,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts\/47\/revisions\/689"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/media\/50"}],"wp:attachment":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/media?parent=47"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/categories?post=47"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/tags?post=47"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}