{"id":366,"date":"2018-02-10T00:32:47","date_gmt":"2018-02-10T00:32:47","guid":{"rendered":"https:\/\/www.diggernaut.com\/blog\/?p=366"},"modified":"2019-01-12T16:38:42","modified_gmt":"2019-01-12T16:38:42","slug":"scraping-fashion-retail-data-machine-learning-purposes-bloomingdales","status":"publish","type":"post","link":"https:\/\/www.diggernaut.com\/blog\/scraping-fashion-retail-data-machine-learning-purposes-bloomingdales\/","title":{"rendered":"Scraping fashion retail data for machine learning purposes from Bloomingdales"},"content":{"rendered":"<p>Bloomingdale\u2019s is a multi-brand store chain, founded in April 1872. At the moment, the network is owned by Macy\u2019s, Inc. Using the parser below, you can scrape a lot of fashion retail data, including prices and images the bloomingdales.com online store. This data can be used for brand research, computer vision and any other machine learning problematic.<\/p>\n<p><strong>Approx number of goods:<\/strong> 350000<br>\n<strong>Approx number of page requests:<\/strong> 350000<br>\n<strong>Recommended subscription plan:<\/strong> Medium<\/p>\n<p><strong>PLEASE NOTE!<\/strong> The number of requests can exceed the number of products, because data about variations, images, etc. can be scraped from other resources and will require additional requests. Also part of the product data can be delivered using XHR requests, which also increases the total number of required page requests.<\/p>\n<h3>How to use the web scraper to extract data about goods and prices from bloomingdales.com<\/h3>\n<p>To use the web scraper for Bloomingdale\u2019s store website, you must have an account with our Diggernaut service. You can just simply follow this comprehensive guide:<\/p>\n<ol>\n<li>Go through this <a href=\"https:\/\/www.diggernaut.com\/accounts\/signup\/\">registration link<\/a> to open free account with <a href=\"https:\/\/www.diggernaut.com\">Diggernaut<\/a><\/li>\n<li>After registering and confirming the email address, you will need to <a href=\"https:\/\/www.diggernaut.com\/accounts\/login\/\">log in to your account<\/a><\/li>\n<li>Create a project with any name and description, if you do not know how to do it, please refer to our <a href=\"https:\/\/www.diggernaut.com\/dev\/website-projects-create-new-project.html\">documentation<\/a><\/li>\n<li>Switch to the created project and create a digger with any name, if you do not know how to do it, please refer to our <a href=\"https:\/\/www.diggernaut.com\/dev\/website-projects-create-new-digger.html\">documentation<\/a><\/li>\n<li>Copy the following digger configuration to the clipboard and paste it into the digger you created, if you do not know how to do it, refer to our <a href=\"https:\/\/www.diggernaut.com\/dev\/website-projects-digger-config.html\">documentation<\/a><\/li>\n<li><strong>PLEASE NOTE!<\/strong> Basic proxy servers may not work with this site and you may need to use your own proxy servers. You will need to specify proxy server to the specific location in the digger configuration as commented. If you feel confused about this item, please contact us using the <a href=\"https:\/\/helpdesk.diggernaut.com\/\">support system<\/a> or using our online chat, we will be glad to help you.<\/li>\n<li>Switch the mode of the digger from Debug to Active, if you do not know how to do it, please refer to our <a href=\"https:\/\/www.diggernaut.com\/dev\/website-projects-edit-digger.html\">documentation<\/a><\/li>\n<li>Run your digger and wait until the completion, if you do not know how to do it, please refer to our <a href=\"https:\/\/www.diggernaut.com\/dev\/website-projects-run-digger.html\">documentation<\/a><\/li>\n<li>Download the scraped dataset in the format you need, if you do not know how to do it, please refer to our <a href=\"https:\/\/www.diggernaut.com\/dev\/website-projects-scraped-data.html\">documentation<\/a><\/li>\n<\/ol>\n<p>You can also setup a schedule for running your scraper and collect data regularly.<\/p>\n<h3>Scraping configuration for the digger<\/h3>\n<pre class=\"language-yaml line-numbers\"><code class=\"language-yaml\">---\nconfig:\n    debug: 2\n    agent: Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/61.0.3163.100 Safari\/537.36\n    proxy: #USE YOUR OWN PROXY LIST HERE, READ DOCUMENTATION ON HOW TO USE IT OR CONTACT US\ndo:\n- walk:\n    to: https:\/\/www.bloomingdales.com\/index\n    headers:\n        accept: text\/html,application\/xhtml+xml,application\/xml;q=0.9,image\/webp,image\/apng,*\/*;q=0.8\n        accept-language: ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7\n        cache-control: no-cache\n    do:\n    - find: \n        path: &#039;#globalFlyouts a&#039;\n        do: \n        - pool_clear: main\n        - parse:\n            attr: href\n            filter:\n                - \\?id=(\\d+)\n        - variable_set: pur\n        - variable_set: \n            field: first\n            value: 1\n        - if:\n            match: (\\d)\n            do:\n            - register_set: https:\/\/www.bloomingdales.com\/api\/navigation\/categories\/facet?categoryId=&facet=false&pageIndex=1&bcomNavPPP=undefine\n            - link_add:\n                pool: main\n            - walk:\n                to: links\n                pool: main\n                headers:\n                    accept: text\/html,application\/xhtml+xml,application\/xml;q=0.9,image\/webp,image\/apng,*\/*;q=0.8\n                    accept-language: ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7\n                    cache-control: no-cache\n                do:\n                - find: \n                    path: productids\n                    do: \n                    - parse\n                    - if:\n                        match: (\\d)\n                        do:\n                        - register_set: https:\/\/www.bloomingdales.com\/shop\/product\/?ID=&CategoryID=\n                        - link_add:\n                            pool: sub\n                - find: \n                    path: productcount\n                    do: \n                    - parse\n                    - if:\n                        match: (\\d)\n                        do:\n                        - variable_set: count\n                        - variable_get: first\n                        - if:\n                            match: (\\d+)\n                            do:\n                            - variable_clear: first\n                            - eval:\n                                routine: js\n                                body: (function () {var pages = []; for (var i=2; i*90 ; i++) {pages.push(i)}; return pages.join(&quot;,&quot;);})();\n                            - to_block\n                            - split:\n                                context: text\n                                delimiter: &quot;,&quot;\n                            - find: \n                                path: .splitted \n                                do: \n                                - parse\n                                - register_set: https:\/\/www.bloomingdales.com\/api\/navigation\/categories\/facet?categoryId=&facet=false&pageIndex=&bcomNavPPP=undefine\n                                - link_add:\n                                    pool: main\n- walk:\n    to: links\n    headers:\n        accept: text\/html,application\/xhtml+xml,application\/xml;q=0.9,image\/webp,image\/apng,*\/*;q=0.8\n        accept-language: ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7\n        cache-control: no-cache\n    pool: sub\n    do:\n    - proxy_switch\n    - cookie_reset\n    - variable_clear: allli\n    - variable_clear: descr\n    - variable_clear: n\n    - object_new: product\n    - find: \n        in: doc\n        path: head \n        do: \n        - eval:\n            routine: js\n            body: &#039;(function (){var d = new Date(); return d.toISOString()})();&#039;\n        - object_field_set:\n            object: product\n            field: date\n        - static_get: url\n        - filter:\n            args:\n             - (.+\\?[idID]+=\\d+)\\&\n        - object_field_set:\n            object: product\n            field: url\n        - register_set: Bloomingdale\n        - object_field_set:\n            object: product\n            field: brand\n    - find: \n        path: &#039;#productId&#039; \n        do: \n        - parse:\n            attr: value\n        - if:\n            match: (\\d)\n            do:\n            - object_field_set:\n                object: product\n                field: sku\n    - find: \n        path: &#039;#brandNameLink&#039; \n        do: \n        - parse\n        - space_dedupe\n        - trim\n        - object_field_set:\n            object: product\n            field: brand\n    - find: \n        path: &#039;#productName, #productTitle&#039; \n        do: \n        - variable_get: n\n        - if:\n            match: (\\d)\n            else:\n            - parse\n            - space_dedupe\n            - trim\n            - object_field_set:\n                object: product\n                field: name\n            - variable_set: \n                field: n\n                value: 1\n    - find: \n        path: .selectedFOB\n        do: \n        - parse\n        - space_dedupe\n        - trim\n        - normalize:\n            routine: lower\n        - object_field_set:\n            object: product\n            field: category\n            joinby: &quot;|&quot;\n    - find: \n        path: &#039;script#pdp_data&#039; \n        do: \n        - parse\n        - normalize:\n            routine: json2xml\n        - to_block\n        - find: \n            path: colorwayadditionalimages > *, colorwayprimaryimages > *, additionalimages, imagesource\n            do: \n            - parse\n            - split:\n                context: text\n                delimiter: &#039;,&#039;\n            - find: \n                path: .splitted \n                do: \n                - parse\n                - if:\n                    match: (\\S)\n                    do:\n                    - register_set: https:\/\/images.bloomingdales.com\/is\/image\/BLM\/products\/\n                    - object_field_set:\n                        object: product\n                        field: images\n                        joinby: &quot;|&quot;\n        - find: \n            path: colorfamily > * \n            do: \n            - parse\n            - if:\n                match: (\\S)\n                do:\n                - object_field_set:\n                    object: product\n                    field: variations\n                    joinby: &quot;|&quot;\n        - find: \n            path: product > seokeywords\n            slice: 0:-2\n            do: \n            - parse\n            - space_dedupe\n            - trim\n            - normalize:\n                routine: lower\n            - if:\n                match: (\\S)\n                do:\n                - object_field_set:\n                    object: product\n                    field: category\n                    joinby: &quot;|&quot;\n        - find: \n            path: longdescription \n            do: \n            - parse\n            - space_dedupe\n            - trim\n            - if:\n                match: (\\S)\n                do:\n                - object_field_set:\n                    object: product\n                    field: description\n        - find: \n            path: product > price \n            do: \n            - parse\n            - space_dedupe\n            - trim\n            - if:\n                match: (\\d)\n                do:\n                - object_field_set:\n                    object: product\n                    field: price\n                    type: float\n                - register_set: USD\n                - object_field_set:\n                    object: product\n                    field: currency\n    - object_save:\n        name: product<\/code><\/pre>\n<h3>Sample of scraped data<\/h3>\n<p>Below is a sample of a dataset with several products in JSON format (so you can easily review it and see data structure). The dataset can be downloaded as CSV, XLSX, XML, or any other text format using the templates.<\/p>\n<pre><code class=\"language-js\">[{\n    &quot;product&quot;: {\n        &quot;brand&quot;: &quot;Michael Aram&quot;,\n        &quot;category&quot;: &quot;home|#homegoals&quot;,\n        &quot;currency&quot;: &quot;USD&quot;,\n        &quot;date&quot;: &quot;2017-12-07T21:43:20.868Z&quot;,\n        &quot;description&quot;: &quot;In the designer&#039;s own words, the Molten collection is distinguished by \\&quot;streamlined, timeless shapes... objects which reverberate with the skill of their maker and yet do not fit into a traditional interpretation of craft. The pieces possess a soulfulness and organic energy only possible through the handmade process.\\&quot;&quot;,\n        &quot;images&quot;: &quot;https:\/\/images.bloomingdales.com\/is\/image\/BLM\/products\/5\/optimized\/8722225_fpx.tif|https:\/\/images.bloomingdales.com\/is\/image\/BLM\/products\/5\/optimized\/8722225_fpx.tif&quot;,\n        &quot;name&quot;: &quot;Michael Aram Molten 5-Piece Place Setting&quot;,\n        &quot;price&quot;: 80,\n        &quot;sku&quot;: &quot;1197985&quot;,\n        &quot;url&quot;: &quot;https:\/\/www.bloomingdales.com\/shop\/product\/michael-aram-molten-5-piece-place-setting?ID=1197985&quot;,\n        &quot;variations&quot;: &quot;Silver&quot;\n    }\n}\n,{\n    &quot;product&quot;: {\n        &quot;brand&quot;: &quot;Lagostina&quot;,\n        &quot;category&quot;: &quot;home|#homegoals&quot;,\n        &quot;currency&quot;: &quot;USD&quot;,\n        &quot;date&quot;: &quot;2017-12-07T21:43:24.007Z&quot;,\n        &quot;description&quot;: &quot;Showcasing Lagostina&#039;s core values of impeccable Italian craftsmanship, technical innovation and elegant design, this ultrastrong grill pan triple-wall construction and a sturdy grooved surface for perfect grilling and searing.&quot;,\n        &quot;images&quot;: &quot;https:\/\/images.bloomingdales.com\/is\/image\/BLM\/products\/6\/optimized\/8694046_fpx.tif|https:\/\/images.bloomingdales.com\/is\/image\/BLM\/products\/6\/optimized\/8694046_fpx.tif&quot;,\n        &quot;name&quot;: &quot;Lagostina Accademia Bistecchiera 11\\&quot; Grill Pan&quot;,\n        &quot;price&quot;: 180,\n        &quot;sku&quot;: &quot;1205514&quot;,\n        &quot;url&quot;: &quot;https:\/\/www.bloomingdales.com\/shop\/product\/lagostina-accademia-bistecchiera-11-grill-pan?ID=1205514&quot;\n    }\n}\n,{\n    &quot;product&quot;: {\n        &quot;brand&quot;: &quot;Iittala&quot;,\n        &quot;category&quot;: &quot;home|#homegoals&quot;,\n        &quot;currency&quot;: &quot;USD&quot;,\n        &quot;date&quot;: &quot;2017-12-07T21:43:26.117Z&quot;,\n        &quot;description&quot;: &quot;Designed by Kaj Franck for Iittala, the Kartio carafe is a perfect balance of pure material and simple geometric form. Stripped of the superfluous, it is clean and timeless.&quot;,\n        &quot;images&quot;: &quot;https:\/\/images.bloomingdales.com\/is\/image\/BLM\/products\/5\/optimized\/1232615_fpx.tif|https:\/\/images.bloomingdales.com\/is\/image\/BLM\/products\/5\/optimized\/1232615_fpx.tif&quot;,\n        &quot;name&quot;: &quot;Iittala Kartio Carafe\/Pitcher, 1 quart&quot;,\n        &quot;price&quot;: 100,\n        &quot;sku&quot;: &quot;1239359&quot;,\n        &quot;url&quot;: &quot;https:\/\/www.bloomingdales.com\/shop\/product\/iittala-kartio-carafe-pitcher-1-quart?ID=1239359&quot;,\n        &quot;variations&quot;: &quot;White&quot;\n    }\n}]\n<\/code><\/pre>","protected":false},"excerpt":{"rendered":"<p>Bloomingdale\u2019s is a multi-brand store chain, founded in April 1872. At the moment, the network is owned by Macy\u2019s, Inc. Using the parser below, you can scrape a lot of fashion retail data, including prices and images the bloomingdales.com online store. This data can be used for brand research, computer vision and any other machine [&hellip;]<\/p>","protected":false},"author":4,"featured_media":368,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[31,30,2],"tags":[],"class_list":["post-366","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ecommerce-scraping","category-free-scrapers","category-web-scraping"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts\/366","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/comments?post=366"}],"version-history":[{"count":3,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts\/366\/revisions"}],"predecessor-version":[{"id":649,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts\/366\/revisions\/649"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/media\/368"}],"wp:attachment":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/media?parent=366"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/categories?post=366"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/tags?post=366"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}