{"id":171,"date":"2016-11-21T15:12:19","date_gmt":"2016-11-21T15:12:19","guid":{"rendered":"https:\/\/blog.diggernaut.com\/?p=171"},"modified":"2019-01-12T18:15:14","modified_gmt":"2019-01-12T18:15:14","slug":"using-json-schema-to-validate-your-data","status":"publish","type":"post","link":"https:\/\/www.diggernaut.com\/blog\/using-json-schema-to-validate-your-data\/","title":{"rendered":"Using JSON schema to validate your data"},"content":{"rendered":"<p>Recently we added a couple of neat functions which let you work with data more efficiently. So one of these functions is JSON schema support. JSON schema can be used in many cases, e.g., if you need to ensure that digger still works appropriately and data you are getting is still in good state, or if you need to get just specific records and skip others. For example, if you are gathering some events, you may want to get the only event that not canceled or has open slots, if a website has information about it, you can easily set rules in a JSON scheme to pick only records you need.<\/p>\n<p>So what is JSON schema? As <a href=\"http:\/\/json-schema.org\/\">json-schema.org<\/a> states: \u201cJSON Schema is a vocabulary that allows you to annotate and validate JSON documents.\u201d. I would recommend you to learn more about it from the above site, as we are not going to cover syntax and JSON schema usage in this article. You can quickly learn it and play with it in debug mode at <a href=\"https:\/\/www.diggernaut.com\">Diggernaut<\/a> without paying a dime for it.<\/p>\n<p>So how can you set JSON schema for a digger? First, you need to login to your Diggernaut account, then go to Projects > Diggers, find digger you need and click on \u201cConfig\u201d button.<\/p>\n<p><a href=\"https:\/\/blog.diggernaut.com\/wp-content\/uploads\/2016\/11\/jsonschema1.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/blog.diggernaut.com\/wp-content\/uploads\/2016\/11\/jsonschema1.png\" alt=\"jsonschema1\" width=\"1642\" height=\"327\" class=\"alignnone size-full wp-image-172\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2016\/11\/jsonschema1.png 1642w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2016\/11\/jsonschema1-768x153.png 768w\" sizes=\"auto, (max-width: 1642px) 100vw, 1642px\" \/><\/a><\/p>\n<p>It opens editor panel where you usually put in digger config. You can see that it has 2 additional tabs now. You need to click on the \u201cValidator\u201d tab.<\/p>\n<p><a href=\"https:\/\/blog.diggernaut.com\/wp-content\/uploads\/2016\/11\/jsonschema2.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/blog.diggernaut.com\/wp-content\/uploads\/2016\/11\/jsonschema2.png\" alt=\"jsonschema2\" width=\"1633\" height=\"676\" class=\"alignnone size-full wp-image-173\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2016\/11\/jsonschema2.png 1633w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2016\/11\/jsonschema2-768x318.png 768w\" sizes=\"auto, (max-width: 1633px) 100vw, 1633px\" \/><\/a><\/p>\n<p>Then you have to put your JSON schema and click on the \u201cSave\u201d button.<\/p>\n<p><a href=\"https:\/\/blog.diggernaut.com\/wp-content\/uploads\/2016\/11\/jsonschema3.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/blog.diggernaut.com\/wp-content\/uploads\/2016\/11\/jsonschema3.png\" alt=\"jsonschema3\" width=\"1615\" height=\"650\" class=\"alignnone size-full wp-image-175\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2016\/11\/jsonschema3.png 1615w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2016\/11\/jsonschema3-768x309.png 768w\" sizes=\"auto, (max-width: 1615px) 100vw, 1615px\" \/><\/a><\/p>\n<p>Next time your digger is running, it applies your JSON scheme for data validation. To understand it better, you may want to look into digger config we used for tests:<\/p>\n<pre class=\"language-yaml line-numbers\"><code class=\"language-yaml\">---\nconfig:\n    debug: 2\ndo:\n  - link_add: &#039;https:\/\/diggernaut.com\/sandbox\/&#039;\n  - walk:\n      to: links\n      do:\n        - sleep: 1\n        - find:\n            path: .result-content\n            do:\n              - variable_clear: name\n              - variable_clear: descr\n              - find:\n                  path: h3\n                  do:\n                    - parse\n                    - variable_set: name\n              - find:\n                  path: p\n                  do:\n                    - parse\n                    - variable_set: descr\n              - find:\n                  path: table\n                  do:\n                    - find:\n                        path: &#039;tbody > tr&#039;\n                        do:\n                          - object_new: item\n                          - variable_get: name\n                          - object_field_set:\n                              object: item\n                              field: name\n                          - variable_get: descr\n                          - object_field_set:\n                              object: item\n                              field: descr\n                          - find:\n                              path: .col2\n                              do:\n                                - parse\n                                - object_field_set:\n                                    object: item\n                                    field: number\n                          - find:\n                              path: .col3\n                              do:\n                                - parse\n                                - object_field_set:\n                                    object: item\n                                    field: short_descr\n                          - find:\n                              path: .col4\n                              do:\n                                - parse\n                                - object_field_set:\n                                    object: item\n                                    field: location\n                          - find:\n                              path: .col5\n                              do:\n                                - object_new: date\n                                - find:\n                                    path: &#039; .nowrap:nth-child(1)&#039;\n                                    do:\n                                      - parse\n                                      - object_field_set:\n                                          object: date\n                                          field: start\n                                - find:\n                                    path: &#039; .nowrap:nth-child(2)&#039;\n                                    do:\n                                      - parse\n                                      - object_field_set:\n                                          object: date\n                                          field: end\n                                - object_save:\n                                    name: date\n                                    to: item\n                          - find:\n                              path: .col6\n                              do:\n                                - object_new: time\n                                - find:\n                                    path: &#039; .nowrap:nth-child(1)&#039;\n                                    do:\n                                      - parse\n                                      - object_field_set:\n                                          object: time\n                                          field: start\n                                - find:\n                                    path: &#039; .nowrap:nth-child(2)&#039;\n                                    do:\n                                      - parse\n                                      - object_field_set:\n                                          object: time\n                                          field: end\n                                - object_save:\n                                    name: time\n                                    to: item\n                          - find:\n                              path: .col7\n                              do:\n                                - parse\n                                - object_field_set:\n                                    object: item\n                                    field: days\n                          - find:\n                              path: .col8\n                              do:\n                                - parse:\n                                    filter:\n                                      - &quot;\\\\s*\\\\$\\\\s*(\\\\d+)\\\\\/&quot;\n                                      - &quot;\\\\s*\\\\$\\\\s*(\\\\d+)&quot;\n                                - object_field_set:\n                                    object: item\n                                    type: int\n                                    field: member_fee\n                                - parse:\n                                    filter:\n                                      - &quot;\\\\s*\\\\\/\\\\s*\\\\$\\\\s*(\\\\d+)&quot;\n                                      - &quot;\\\\s*\\\\$\\\\s*(\\\\d+)&quot;\n                                - object_field_set:\n                                    object: item\n                                    type: int\n                                    field: non_member_fee\n                          - find:\n                              path: .col9\n                              do:\n                                - parse\n                                - object_field_set:\n                                    object: item\n                                    field: ages\n                          - find:\n                              path: .col10\n                              do:\n                                - parse\n                                - object_field_set:\n                                    object: item\n                                    field: is_available\n                          - find:\n                              path: .ajaxLoad.info-icon.tooltips\n                              do:\n                                - parse:\n                                    attr: href\n                                - walk:\n                                    to: value\n                                    do:\n                                      - find:\n                                          path: &#039;tr:nth-of-type(2) td:nth-of-type(2)&#039;\n                                          do:\n                                            - parse\n                                            - object_field_set:\n                                                object: item\n                                                field: gender\n                          - object_save:\n                              name: item\n        - find:\n            path: .next a\n            do:\n              - parse:\n                  attr: href\n              - link_add<\/code><\/pre>\n<p>And JSON scheme we used for it:<\/p>\n<pre><code class=\"language-js\">{\n    &quot;$schema&quot;: &quot;http:\/\/json-schema.org\/draft-04\/schema#&quot;,\n    &quot;title&quot;: &quot;Activities&quot;,\n    &quot;description&quot;: &quot;Park district activities&quot;,\n    &quot;type&quot;: &quot;object&quot;,\n    &quot;properties&quot;: {\n        &quot;item&quot;: {\n            &quot;type&quot;: &quot;object&quot;,\n            &quot;properties&quot;: {\n                &quot;number&quot;: {\n                    &quot;description&quot;: &quot;The unique identifier for an activity&quot;,\n                    &quot;type&quot;: &quot;string&quot;\n                },\n                &quot;name&quot;: {\n                    &quot;description&quot;: &quot;Activity name&quot;,\n                    &quot;type&quot;: &quot;string&quot;\n                },\n                &quot;descr&quot;: {\n                    &quot;description&quot;: &quot;Activity description&quot;,\n                    &quot;type&quot;: &quot;string&quot;\n                },\n                &quot;gender&quot;: {\n                    &quot;description&quot;: &quot;Gender specification for an activity&quot;,\n                    &quot;type&quot;: &quot;string&quot;\n                },\n                &quot;short_descr&quot;: {\n                    &quot;description&quot;: &quot;Activity short description&quot;,\n                    &quot;type&quot;: &quot;string&quot;\n                },\n                &quot;ages&quot;: {\n                    &quot;description&quot;: &quot;Allowed ages&quot;,\n                    &quot;type&quot;: &quot;string&quot;\n                },\n                &quot;days&quot;: {\n                    &quot;description&quot;: &quot;Weekdays when activity takes place&quot;,\n                    &quot;type&quot;: &quot;string&quot;\n                },\n                 &quot;member_fee&quot;: {\n                    &quot;description&quot;: &quot;Fee for members&quot;,\n                    &quot;type&quot;: &quot;number&quot;\n                },\n                 &quot;non_member_fee&quot;: {\n                    &quot;description&quot;: &quot;Fee for non-members&quot;,\n                    &quot;type&quot;: &quot;number&quot;\n                },\n                 &quot;is_available&quot;: {\n                    &quot;description&quot;: &quot;Shows if activity is still available&quot;,\n                    &quot;type&quot;: &quot;string&quot;\n                },\n                 &quot;location&quot;: {\n                    &quot;description&quot;: &quot;Location where activity takes place&quot;,\n                    &quot;type&quot;: &quot;string&quot;\n                },\n                 &quot;dates&quot;: {\n                    &quot;type&quot;: &quot;array&quot;,\n                    &quot;items&quot;: {\n                        &quot;type&quot;: &quot;object&quot;,\n                        &quot;properties&quot;: {\n                             &quot;start&quot;: {\n                                &quot;description&quot;: &quot;Start date for activity session&quot;,\n                                &quot;type&quot;: &quot;string&quot;\n                            },\n                             &quot;end&quot;: {\n                                &quot;description&quot;: &quot;End date for activity session&quot;,\n                                &quot;type&quot;: &quot;string&quot;\n                            }\n                        },\n                        &quot;required&quot;: [&quot;start&quot;,&quot;end&quot;]\n                    },\n                    &quot;minItems&quot;: 1,\n                    &quot;uniqueItems&quot;: true\n                },\n                 &quot;time&quot;: {\n                    &quot;type&quot;: &quot;array&quot;,\n                    &quot;items&quot;: {\n                        &quot;type&quot;: &quot;object&quot;,\n                        &quot;properties&quot;: {\n                             &quot;start&quot;: {\n                                &quot;description&quot;: &quot;Start time for activity event&quot;,\n                                &quot;type&quot;: &quot;string&quot;\n                            },\n                             &quot;end&quot;: {\n                                &quot;description&quot;: &quot;End time for activity event&quot;,\n                                &quot;type&quot;: &quot;string&quot;\n                            }\n                        },\n                        &quot;required&quot;: [&quot;start&quot;,&quot;end&quot;]\n                    },\n                    &quot;minItems&quot;: 1,\n                    &quot;uniqueItems&quot;: true\n                }\n           },\n            &quot;required&quot;: [&quot;number&quot;,&quot;name&quot;,&quot;gender&quot;]\n        }\n    },\n    &quot;required&quot;: [&quot;item&quot;]\n\n}\n<\/code><\/pre>","protected":false},"excerpt":{"rendered":"<p>Recently we added a couple of neat functions which let you work with data more efficiently. So one of these functions is JSON schema support. JSON schema can be used in many cases, e.g., if you need to ensure that digger still works appropriately and data you are getting is still in good state, or [&hellip;]<\/p>","protected":false},"author":4,"featured_media":178,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2,23],"tags":[7,20,6,11],"class_list":["post-171","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-web-scraping","category-website-and-api","tag-data","tag-data-extraction","tag-scraping","tag-web-scraping"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts\/171","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/comments?post=171"}],"version-history":[{"count":4,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts\/171\/revisions"}],"predecessor-version":[{"id":671,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts\/171\/revisions\/671"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/media\/178"}],"wp:attachment":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/media?parent=171"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/categories?post=171"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/tags?post=171"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}