Scraping Anthropologie.com and collecting products data

Anthropologie is an American clothing retailer. Currently, the company manages more than 200 stores around the world and offers a carefully selected assortment of clothing, jewelry, underwear, home furnishings and decor, beauty products and gifts. In August 1992, Richard Hein came up with the idea to open a clothing store for creative and educated women aged 30-45 years, so the Anthropologie store appeared. Scraping anthropologie.com with Diggernaut is easy process, you can use provided web scraper to collect product and price data from the online store.

Approx number of goods: 50000
Approx number of page requests: 50000
Recommended subscription plan: Small

PLEASE NOTE! The number of requests can exceed the number of products, because data about variations, images, etc. can be scraped from other resources and will require additional requests. Also part of the product data can be delivered using XHR requests, which also increases the total number of required page requests.

How to use the web scraper to extract data about goods and prices from anthropologie.com

To use the web scraper for Anthropologie store’s website, you must have an account with our Diggernaut service. You can just simply follow this comprehensive guide:

  1. Go through this registration link to open free account with Diggernaut
  2. After registering and confirming the email address, you will need to log in to your account
  3. Create a project with any name and description, if you do not know how to do it, please refer to our documentation
  4. Switch to the created project and create a digger with any name, if you do not know how to do it, please refer to our documentation
  5. Copy the following digger configuration to the clipboard and paste it into the digger you created, if you do not know how to do it, refer to our documentation
  6. Switch the mode of the digger from Debug to Active, if you do not know how to do it, please refer to our documentation
  7. Run your digger and wait until the completion, if you do not know how to do it, please refer to our documentation
  8. Download the scraped dataset in the format you need, if you do not know how to do it, please refer to our documentation

You can also setup a schedule for running your scraper and collect data regularly.

Scraping configuration for the digger

---
config:
    debug: 2
    agent: Firefox
do:
- walk:
    to: https://www.anthropologie.com
    do:
    - find: 
        path: .c-main-navigation__li--level-1 
        do: 
        - find: 
            path: span
            slice: 0
            do: 
            - parse
            - space_dedupe
            - trim
            - normalize:
                routine: lower
            - variable_set: cat1
        - find: 
            path: .c-main-navigation__li--level-2 
            do: 
            - variable_clear: subcat
            - find: 
                path: .c-main-navigation__a--level-2
                do: 
                - parse
                - space_dedupe
                - trim
                - normalize:
                    routine: lower
                - variable_set: cat2
            - find: 
                path: .c-main-navigation__li--level-3 a 
                do: 
                - parse
                - space_dedupe
                - trim
                - normalize:
                    routine: lower
                - variable_set: cat3
                - variable_set:
                    field: subcat
                    value: 1
                - parse:
                    attr: href
                - pool_clear: main
                - link_add:
                    pool: main
                - walk:
                    to: links
                    pool: main
                    do:
                    - find: 
                        path: .js-pagination__arrow--next
                        slice: 0
                        do: 
                        - parse:
                            attr: href
                        - link_add:
                            pool: main
                    - find: 
                        path: .c-product-tile__image-link 
                        do: 
                        - parse:
                            attr: href
                            filter:
                                - (.+)\?
                                - (.+)
                        - normalize:
                            routine: url
                        - walk:
                            to: value
                            do:
                            - find: 
                                path: body
                                do: 
                                - object_new: product
                                - eval:
                                    routine: js
                                    body: '(function (){var d = new Date(); return d.toISOString()})();'
                                - object_field_set:
                                    object: product
                                    field: date
                                - register_set: Anthropologie
                                - object_field_set:
                                    object: product
                                    field: brand
                                - static_get: url
                                - object_field_set:
                                    object: product
                                    field: url
                                - find: 
                                    path: meta[> img.c-product-image 
                                    do: 
                                    - parse:
                                        attr: src
                                        filter:
                                            - (.+)\?
                                            - (.+)
                                    - normalize:
                                        routine: url
                                    - object_field_set:
                                        object: product
                                        field: images
                                        joinby: "|"
                                    
                                - find: 
                                    path: script:matches(window\.productData) 
                                    do: 
                                    - parse:
                                        filter:
                                            - window.productData\s*=\s*\'\s*(.+)\s*\'\s*;
                                    - normalize:
                                        routine: Base64ZLIBDecode
                                    - normalize:
                                        routine: json2xml
                                    - to_block
                                    - find: 
                                        path: body_safe 
                                        do: 
                                        - find: 
                                            path: primaryslice:hasChild(displaylabel:matches(Color)) 
                                            do: 
                                            - find: 
                                                path: sliceitems > displayname
                                                do: 
                                                - parse
                                                - space_dedupe
                                                - trim
                                                - object_field_set:
                                                    object: product
                                                    field: variations
                                                    joinby: "|"
                                            - find: 
                                                path: sliceitems
                                                do: 
                                                - variable_clear: iid
                                                
                                                - find: 
                                                    path:  id
                                                    slice: 0
                                                    do: 
                                                    - parse
                                                    - variable_set: iid
                                                - find: 
                                                    path: images
                                                    do: 
                                                    - parse
                                                    - register_set: http://images.anthropologie.com/is/image/Anthropologie/_
                                                    - object_field_set:
                                                        object: product
                                                        field: images
                                                        joinby: "|"
                                                    
                                                
                                        - find: 
                                            path: product > stylenumber
                                            slice: 0
                                            do: 
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: product
                                                field: sku
                                        - find: 
                                            path: product > product > brand
                                            do: 
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: product
                                                field: brand
                                        - find: 
                                            path: product > product > displayname
                                            do: 
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: product
                                                field: name
                                        - find: 
                                            path: product > product > longdescription 
                                            do: 
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: product
                                                field: description
                                - variable_get: cat1
                                - if:
                                    match: (\S)
                                    do:
                                    - object_field_set:
                                        object: product
                                        field: category
                                        joinby: "|"
                                - variable_get: cat2
                                - if:
                                    match: (\S)
                                    do:
                                    - object_field_set:
                                        object: product
                                        field: category
                                        joinby: "|"
                                - variable_get: cat3
                                - if:
                                    match: (\S)
                                    do:
                                    - object_field_set:
                                        object: product
                                        field: category
                                        joinby: "|"
                                - object_save:
                                    name: product
            - variable_get: subcat
            - if:
                match: (1)
                else:
                - find: 
                    path: .c-main-navigation__a--level-2
                    do: 
                    - parse:
                        attr: href
                    - pool_clear: main
                    - link_add:
                        pool: main
                    - walk:
                        to: links
                        pool: main
                        do:
                        - find: 
                            path: .js-pagination__arrow--next
                            slice: 0
                            do: 
                            - parse:
                                attr: href
                            - link_add:
                                pool: main
                        - find: 
                            path: .c-product-tile__image-link 
                            do: 
                            - parse:
                                attr: href
                                filter:
                                    - (.+)\?
                                    - (.+)
                            - normalize:
                                routine: url
                            - walk:
                                to: value
                                do:
                                - find: 
                                    path: body
                                    do: 
                                - object_new: product
                                - eval:
                                    routine: js
                                    body: '(function (){var d = new Date(); return d.toISOString()})();'
                                - object_field_set:
                                    object: product
                                    field: date
                                - register_set: Anthropologie
                                - object_field_set:
                                    object: product
                                    field: brand
                                - static_get: url
                                - object_field_set:
                                    object: product
                                    field: url
                                - find: 
                                    path: meta[> img.c-product-image 
                                    do: 
                                    - parse:
                                        attr: src
                                        filter:
                                            - (.+)\?
                                            - (.+)
                                    - normalize:
                                        routine: url
                                    - object_field_set:
                                        object: product
                                        field: images
                                        joinby: "|"
                                    
                                - find: 
                                    path: script:matches(window\.productData) 
                                    do: 
                                    - parse:
                                        filter:
                                            - window.productData\s*=\s*\'\s*(.+)\s*\'\s*;
                                    - normalize:
                                        routine: Base64ZLIBDecode
                                    - normalize:
                                        routine: json2xml
                                    - to_block
                                    - find: 
                                        path: body_safe 
                                        do: 
                                        - find: 
                                            path: primaryslice:hasChild(displaylabel:matches(Color)) 
                                            do: 
                                            - find: 
                                                path: sliceitems > displayname
                                                do: 
                                                - parse
                                                - space_dedupe
                                                - trim
                                                - object_field_set:
                                                    object: product
                                                    field: variations
                                                    joinby: "|"
                                            - find: 
                                                path: sliceitems
                                                do: 
                                                - variable_clear: iid
                                                
                                                - find: 
                                                    path:  id
                                                    slice: 0
                                                    do: 
                                                    - parse
                                                    - variable_set: iid
                                                - find: 
                                                    path: images
                                                    do: 
                                                    - parse
                                                    - register_set: http://images.anthropologie.com/is/image/Anthropologie/_
                                                    - object_field_set:
                                                        object: product
                                                        field: images
                                                        joinby: "|"
                                                    
                                                
                                        - find: 
                                            path: product > stylenumber
                                            slice: 0
                                            do: 
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: product
                                                field: sku
                                        - find: 
                                            path: product > product > brand
                                            do: 
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: product
                                                field: brand
                                        - find: 
                                            path: product > product > displayname
                                            do: 
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: product
                                                field: name
                                        - find: 
                                            path: product > product > longdescription 
                                            do: 
                                            - parse
                                            - space_dedupe
                                            - trim
                                            - object_field_set:
                                                object: product
                                                field: description
                                - variable_get: cat1
                                - if:
                                    match: (\S)
                                    do:
                                    - object_field_set:
                                        object: product
                                        field: category
                                        joinby: "|"
                                - variable_get: cat2
                                - if:
                                    match: (\S)
                                    do:
                                    - object_field_set:
                                        object: product
                                        field: category
                                        joinby: "|"
                                - variable_get: cat3
                                - if:
                                    match: (\S)
                                    do:
                                    - object_field_set:
                                        object: product
                                        field: category
                                        joinby: "|"
                                - object_save:
                                    name: product

Sample of scraped data

Below is a sample of a dataset with several products in JSON format (so you can easily review it and see data structure). The dataset can be downloaded as CSV, XLSX, XML, or any other text format using the templates.

[{
    "product": {
        "brand": "Illume",
        "category": "gifts|features|the gift guide",
        "date": "2017-12-05T21:15:58.241Z",
        "description": "New from the fragrance masters at Illume, Anatomy of a Fragrance bath and beauty products are sophisticated, lighthearted luxuries. Each is crafted in Minnesota, where Illume combines their signature scents with beautiful packaging designed in-house. From lavish hand creams to triple-milled soaps to nature-inspired perfumes, their line is ready-made for gifting and indulging. **Honey Rose**: a warm, romantic scent with notes of lily of the valley, sandalwood and bergamot **Orchid Vanille**: a bright, fresh combination of orange blossom, jasmine, black currant and praline **Wildflower Bergamot**: A zesty blend of bergamot, lemon and mango layered with cedar and sandalwood",
        "images": "https://images.anthropologie.com/is/image/Anthropologie/44448363_040_b|http://images.anthropologie.com/is/image/Anthropologie/44448363_040_b|http://images.anthropologie.com/is/image/Anthropologie/44448363_070_b|http://images.anthropologie.com/is/image/Anthropologie/44448363_065_b",
        "name": "Anatomy of a Fragrance Gift Set",
        "sku": "44448363",
        "url": "https://www.anthropologie.com/shop/anatomy-of-a-fragrance-gift-set",
        "variations": "Wildflower Bergamot|Orchid Vanille|Honey Rose"
    }
}
,{
    "product": {
        "brand": "Capri Blue",
        "category": "gifts|features|the gift guide",
        "date": "2017-12-05T21:15:59.713Z",
        "description": "Capri Blue's iconic vessels and fragrances - proudly designed and poured in Mississippi - are a long-standing favorite at Anthropologie. The line pairs striking visuals with intoxicating scents to create beautifully aromatic products like soy-blended candles and vegan-formulated beauty care. **Volcano**: tropical fruits, sugared oranges, lemons and limes, redolent with lightly exotic mountain greens **Coastal**: notes of pineapple, verbena and coconut, accented by sparkling lemon, bergamot and grapefruit **Fir & Firewood**: a fruity, green aroma of apple, clove, fir, pine needle, white birch, cedar, vetiver and musk **Japanese Quince & Cedar**: aromatic cedar wood is embellished with sun-ripened cassis, sugared quince, accents of red currant and a splash of sparkling pomelo **Gardenia & Fig**: bright greens and fresh peach mingle with gardenia, rose, ylang ylang and coconut over a base of light musk **Cinnamon Toddy**: a mouthwatering medley of ripe apple, warm cinnamon, golden clove and grated nutmeg topped with notes of honey and maple **Spiced Cider**: nutmeg, clove and cinnamon are layered over fresh apple and juicy orange notes **Lagoon**: top notes of freesia, incense and tamarind blend over a musky base of cashmere, wood and vetiver **Grapefruit Neroli**: sun-kissed grapefruit, quince and tangerine over neroli, vanilla, orchid and currant",
        "images": "https://images.anthropologie.com/is/image/Anthropologie/19851559_033_b|https://images.anthropologie.com/is/image/Anthropologie/19851559_033_b10|http://images.anthropologie.com/is/image/Anthropologie/19851559_033_b|http://images.anthropologie.com/is/image/Anthropologie/19851559_033_b10|http://images.anthropologie.com/is/image/Anthropologie/19851559_090_b|http://images.anthropologie.com/is/image/Anthropologie/19851559_090_b10|http://images.anthropologie.com/is/image/Anthropologie/19851559_090_b15|http://images.anthropologie.com/is/image/Anthropologie/19851559_090_b16|http://images.anthropologie.com/is/image/Anthropologie/19851559_049_b|http://images.anthropologie.com/is/image/Anthropologie/19851559_026_b|http://images.anthropologie.com/is/image/Anthropologie/19851559_098_b|http://images.anthropologie.com/is/image/Anthropologie/19851559_040_b|http://images.anthropologie.com/is/image/Anthropologie/19851559_007_b|http://images.anthropologie.com/is/image/Anthropologie/19851559_007_b2",
        "name": "Capri Blue Iridescent Jar Candle",
        "sku": "19851559",
        "url": "https://www.anthropologie.com/shop/capri-blue-iridescent-jar-candle8",
        "variations": "Fir and Firewood|Spiced Cider|Volcano|Spiced Cider|Fir and Firewood|Volcano|Volcano"
    }
}
,{
    "product": {
        "brand": "Anthropologie",
        "category": "gifts|features|the gift guide",
        "date": "2017-12-05T21:16:00.340Z",
        "images": "https://images.anthropologie.com/is/image/Anthropologie/39336862_001_b3|https://images.anthropologie.com/is/image/Anthropologie/39336862_001_b|https://images.anthropologie.com/is/image/Anthropologie/39336862_001_b2|https://images.anthropologie.com/is/image/Anthropologie/39336862_001_b14|http://images.anthropologie.com/is/image/Anthropologie/39336862_001_b3|http://images.anthropologie.com/is/image/Anthropologie/39336862_001_b|http://images.anthropologie.com/is/image/Anthropologie/39336862_001_b2|http://images.anthropologie.com/is/image/Anthropologie/39336862_001_b14|http://images.anthropologie.com/is/image/Anthropologie/39336862_074_b|http://images.anthropologie.com/is/image/Anthropologie/39336862_074_b2|http://images.anthropologie.com/is/image/Anthropologie/39336862_074_b3|http://images.anthropologie.com/is/image/Anthropologie/39336862_074_b14|http://images.anthropologie.com/is/image/Anthropologie/39336862_010_b|http://images.anthropologie.com/is/image/Anthropologie/39336862_010_b2|http://images.anthropologie.com/is/image/Anthropologie/39336862_010_b15|http://images.anthropologie.com/is/image/Anthropologie/39336862_030_b|http://images.anthropologie.com/is/image/Anthropologie/39336862_030_b2|http://images.anthropologie.com/is/image/Anthropologie/39336862_030_b15|http://images.anthropologie.com/is/image/Anthropologie/39336862_040_b|http://images.anthropologie.com/is/image/Anthropologie/39336862_040_b2|http://images.anthropologie.com/is/image/Anthropologie/39336862_040_b3|http://images.anthropologie.com/is/image/Anthropologie/39336862_040_b14|http://images.anthropologie.com/is/image/Anthropologie/39336862_065_b|http://images.anthropologie.com/is/image/Anthropologie/39336862_065_b2|http://images.anthropologie.com/is/image/Anthropologie/39336862_065_b3|http://images.anthropologie.com/is/image/Anthropologie/39336862_051_b|http://images.anthropologie.com/is/image/Anthropologie/39336862_051_b2|http://images.anthropologie.com/is/image/Anthropologie/39336862_051_b10|http://images.anthropologie.com/is/image/Anthropologie/39336862_066_b|http://images.anthropologie.com/is/image/Anthropologie/39336862_066_b2|http://images.anthropologie.com/is/image/Anthropologie/39336862_066_b10",
        "name": "Slivered Geode Coaster",
        "sku": "39336862",
        "url": "https://www.anthropologie.com/shop/geode-coaster",
        "variations": "Black Quartz|Dyed Citron|White Quartz|Adventurian|Dyed Blue|Dyed Magenta|Amethyst|Rose quartz"
    }
}
,{
    "product": {
        "brand": "Floreat",
        "category": "gifts|features|the gift guide",
        "date": "2017-12-05T21:16:01.211Z",
        "images": "https://images.anthropologie.com/is/image/Anthropologie/43663541_000_b|https://images.anthropologie.com/is/image/Anthropologie/43663541_000_b2|https://images.anthropologie.com/is/image/Anthropologie/43663541_000_b3|https://images.anthropologie.com/is/image/Anthropologie/43663541_000_b4|http://images.anthropologie.com/is/image/Anthropologie/43663541_000_b|http://images.anthropologie.com/is/image/Anthropologie/43663541_000_b2|http://images.anthropologie.com/is/image/Anthropologie/43663541_000_b3|http://images.anthropologie.com/is/image/Anthropologie/43663541_000_b4|http://images.anthropologie.com/is/image/Anthropologie/43663541_049_b|http://images.anthropologie.com/is/image/Anthropologie/43663541_049_b2|http://images.anthropologie.com/is/image/Anthropologie/43663541_049_b3|http://images.anthropologie.com/is/image/Anthropologie/43663541_049_b4",
        "name": "Floreat Printed Sleep Pants",
        "sku": "43663541",
        "url": "https://www.anthropologie.com/shop/floreat-printed-sleep-pants",
        "variations": "ASSORTED|BLUE MOTIF"
    }
}]
Mikhail Sisin: Co-founder of cloud-based web scraping and data extraction platform Diggernaut. Over 10 years of experience in data extraction, ETL, AI, and ML.
Related Post