Mikhail Sisin Co-founder of cloud-based web scraping and data extraction platform Diggernaut. Over 10 years of experience in data extraction, ETL, AI, and ML.

Gathering product and price data from Bed, Bath and Beyond online store

8 min read

Gathering product and price data from Bed, Bath and Beyond online store

Bed Bath & Beyond is a chain of home-based stores in the USA, Puerto Rico, Canada and Mexico. In 1971, Warren Eilenberg and Leonard Feinstein opened a store called Bed ‘n Bath in Springfield, New Jersey. By 1985, they managed 17 stores in New York and California. To match growth, the company was renamed Bed Bath & Beyond. Gathering product and price data from bedbathandbeyond.com website using this web scraper will be easy.

Scraper updated on 02.15.2019 due to changes to the website layout

Approx number of goods: 200000
Approx number of page requests: 400000
Recommended subscription plan: Medium

PLEASE NOTE! The number of requests can exceed the number of products, because data about variations, images, etc. can be scraped from other resources and will require additional requests. Also part of the product data can be delivered using XHR requests, which also increases the total number of required page requests.

How to use the web scraper to extract data about goods and prices from bedbathandbeyond.com

To use the web scraper for Bed, Bath and Beyond store’s website, you must have an account with our Diggernaut service. You can just simply follow this comprehensive guide:
1. Go through this registration link to open free account with Diggernaut
2. After registering and confirming the email address, you will need to log in to your account
3. Create a project with any name and description, if you do not know how to do it, please refer to our documentation
4. Switch to the created project and create a digger with any name, if you do not know how to do it, please refer to our documentation
5. Copy the following digger configuration to the clipboard and paste it into the digger you created, if you do not know how to do it, refer to our documentation
6. PLEASE NOTE! Basic proxy servers may not work with this site and you may need to use your own proxy servers. You will need to specify proxy server to the specific location in the digger configuration as commented. If you feel confused about this item, please contact us using the support system or using our online chat, we will be glad to help you.
7. Switch the mode of the digger from Debug to Active, if you do not know how to do it, please refer to our documentation
8. Run your digger and wait until the completion, if you do not know how to do it, please refer to our documentation
9. Download the scraped dataset in the format you need, if you do not know how to do it, please refer to our documentation

You can also setup a schedule for running your scraper and collect data regularly.

Scraping configuration for the digger

---
config:
    debug: 2
    agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36
    proxy: #USE YOUR PROXY HERE LIKE 1.1.1.1:8888
do:
- variable_set:
    field: repeatwalk
    value: "yes"
- variable_set:
    field: repeatcat
    value: "yes"
- variable_set:
    field: repeatitems
    value: "yes"
## --------------------
## categories collector
- link_add: 'https://www.bedbathandbeyond.com'
- walk:
    to: links
    repeat_in_pool: <%repeatwalk%>
    headers:
        accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
        accept-encoding: deflate
        accept-language: ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7
        cache-control: no-cache
        pragma: no-cache
        upgrade-insecure-requests: 1
    do:
    - find:
        path: title
        do:
        - parse
        - if:
            match: Access Denied
            do:
            - proxy_switch
            else:
            - find:
                path: html
                in: doc
                do:
                ## removing repeat
                - variable_clear: repeatwalk
            - find:
                path: '#ctl00_InvalidRequest'
                in: doc
                do:
                - parse
                - if:
                    match: \S
                    do:
                    - proxy_switch
                    - variable_set:
                        field: repeatwalk
                        value: "yes"
            - find:
                path: body
                in: doc
                do:
                - variable_get: repeatwalk
                - if:
                    match: \S
                    else:
                    ## main logic
                    - find:
                        path: script:contains("window.__INITIAL_STATE")
                        do:
                        - parse
                        - space_dedupe
                        - trim
                        - filter: 
                            args: 'window\.__INITIAL_STATE__\s+\=\s+(.+)\s*;\s*window.__INITIAL_STATE__.sitespect'
                        - normalize:
                            routine: json2xml
                        - to_block
                        - find:
                            path: body_safe > navigation > data
                            slice: 0
                            do:
                            - find:
                                path: menu > items
                                slice: 0
                                do:
                                - find:
                                    path: url:contains("category")
                                    do:
                                    - parse:
                                        filter: ^([^\?]+)
                                    - normalize:
                                        routine: url
                                    - if:
                                        match: \/category\/[\w\-]+\/[a-zA-Z]+
                                        do:
                                        - normalize:
                                            routine: replace_substring
                                            args:
                                                - \/?$: '/1-96'
                                        - link_add:
                                            pool: categories
- walk:
    to: links
    pool: categories
    repeat_in_pool: <%repeatcat%>
    headers:
        accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
        accept-encoding: deflate
        accept-language: ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7
        cache-control: no-cache
        pragma: no-cache
        upgrade-insecure-requests: 1
    do:
    - find:
        path: title
        do:
        - parse
        - if:
            match: Access Denied
            do:
            - proxy_switch
            else:
            - find:
                path: html
                in: doc
                do:
                ## removing repeat
                - variable_clear: repeatcat
            - find:
                path: '#ctl00_InvalidRequest'
                in: doc
                do:
                - parse
                - if:
                    match: \S
                    do:
                    - proxy_switch
                    - variable_set:
                        field: repeatcat
                        value: "yes"
            - find:
                path: body
                in: doc
                do:
                - variable_get: repeatcat
                - if:
                    match: \S
                    else:
                    ## main logic to gather product item links
                    - find:
                        path: div.mt0.tealium-product-grid
                        in: doc
                        do:
                        ## collect all item links from page
                        - find:
                            path: 'div.tealium-product-tile > div[class*="ProductTile-"] > a[class*="PrimaryLink_"]'
                            do:
                            - parse:
                                attr: href
                                filter: ^([^\?]+)
                            - normalize:
                                routine: url
                            - link_add:
                                pool: items
                    ## and let's try to find next page here
                    - find:
                        path: a.Pagination__btnNext
                        in: doc
                        do:
                        - parse:
                            attr: aria-disabled
                        - if:
                            match: "true"
                            do:
                            ## next page is not found
                            else:
                            ## found next page
                            ## add new page link into pool
                            - static_get: url
                            - variable_set: url
                            - filter:
                                args: '\/(\d+)\-\d+$'
                            - variable_set: pageid
                            - eval:
                                routine: js
                                body: '(function () {
                                            var cnt = <%pageid%>;
                                            return cnt + 1;
                                        })();'
                            - variable_set: pageid
                            - variable_get: url
                            - normalize:
                                routine: replace_substring
                                args:
                                - \/\d+\-\d+\$: '/<%pageid%>-96'
                            - link_add:
                                pool: categories
- walk:
    to: links
    pool: items
    repeat_in_pool: <%repeatitems%>
    headers:
        accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
        accept-encoding: deflate
        accept-language: ru-RU,ru;q=0.9,en-US;q=0.8,en;q=0.7
        cache-control: no-cache
        pragma: no-cache
        upgrade-insecure-requests: 1
    do:
    - find:
        path: title
        do:
        - parse
        - if:
            match: Access Denied
            do:
            - proxy_switch
            else:
            - find:
                path: html
                in: doc
                do:
                ## removing repeat
                - variable_clear: repeatitems
            - find:
                path: '#ctl00_InvalidRequest'
                in: doc
                do:
                - parse
                - if:
                    match: \S
                    do:
                    - proxy_switch
                    - variable_set:
                        field: repeatitems
                        value: "yes"
    - find:
        path: body
        in: doc
        do:
        - variable_get: repeatitems
        - if:
            match: \S
            else:
            ## save item
            - object_new: product
            - find:
                path: script:contains("window.__INITIAL_STATE")
                do:
                - parse
                - space_dedupe
                - trim
                - filter: 
                    args: 'window\.__INITIAL_STATE__\s+\=\s+(.+)\s*;\s*window.__INITIAL_STATE__.sitespect'
                - normalize:
                    routine: json2xml
                - to_block
                - find:
                    path: body_safe > pdp > productdetails > data
                    slice: 0
                    do:
                    - variable_clear: pid
                    - variable_set:
                        field: brand
                        value: BedBathAndBeyond
                    - eval:
                        routine: js
                        body: '(function () {
                                    var d = new Date();
                                    return d.toISOString();
                                })();'
                    - object_field_set:
                        object: product
                        field: date
                    - static_get: url
                    - object_field_set:
                        object: product
                        field: url
                    - find:
                        path: brand_name
                        do:
                        - parse
                        - space_dedupe
                        - trim
                        - variable_set: brand
                    - variable_get: brand
                    - object_field_set:
                        object: product
                        field: brand
                    - find:
                        path: display_name
                        do:
                        - parse
                        - space_dedupe
                        - trim
                        - object_field_set:
                            object: product
                            field: name
                    - find:
                        path: description
                        do:
                        - parse
                        - space_dedupe
                        - trim
                        - if:
                            match: \w+
                            do:
                            - object_field_set:
                                object: product
                                field: description
                    - find:
                        path: product_id
                        do:
                        - parse:
                            filter: (\d+)
                        - space_dedupe
                        - trim
                        - object_field_set:
                            object: product
                            field: sku
                    - find:
                        path: low_price
                        do:
                        - parse:
                            filter: ([\d\.]+)
                        - object_field_set:
                            object: product
                            type: float
                            field: price
                    - find:
                        path: variations > all_colors
                        do:
                        - find:
                            path: color
                            do:
                            - parse
                            - space_dedupe
                            - trim
                            - if:
                                match: \w+
                                do:
                                - object_field_set:
                                    object: product
                                    joinby: "|"
                                    field: variations
                    - find:
                        path: image_id
                        do:
                        - parse
                        - normalize:
                            routine: replace_substring
                            args:
                                - '^\s*': 'https://b3h2.scene7.com/is/image/BedBathandBeyond/'
                        - variable_set: image_url
                        - register_set: <%image_url%>?scl=1
                        - object_field_set:
                            object: product
                            joinby: "|"
                            field: images
                    - find:
                        path: alt_img
                        do:
                        - split:
                            context: text
                            delimiter: ','
                        - find:
                            path: div.splitted
                            do:
                            - parse
                            - normalize:
                                routine: replace_substring
                                args:
                                    - '^\s*': 'https://b3h2.scene7.com/is/image/BedBathandBeyond/'
                            - variable_set: image_url
                            - register_set: <%image_url%>?scl=1
                            - object_field_set:
                                object: product
                                joinby: "|"
                                field: images
            - find:
                path: 'div#first > ul[class*="Breadcrumbs-"] > li > a'
                slice: 0:-1
                do:
                - parse
                - space_dedupe
                - trim
                - if:
                    match: \w+
                    do:
                    - object_field_set:
                        object: product
                        joinby: "|"
                        field: category
            - object_save:
                name: product

Sample of scraped data

Below is a sample of a dataset with several products in JSON format (so you can easily review it and see data structure). The dataset can be downloaded as CSV, XLSX, XML, or any other text format using the templates.

[{
    "product": {
        "brand": "Dyson",
        "category": "Gifts|Gifts by Category|Unique Gifts",
        "currency": "USD",
        "date": "2017-12-07T00:05:23.532Z",
        "description": "Dyson's Supersonic Hair Dryer uses intelligent heat control technology to help to prevent heat damage to your hair, preserving its natural shine. This high-speed and powerful hair dryer works to straighten and smooth delivering beautiful silky hair.",
        "images": "https://s7d9.scene7.com/is/image/BedBathandBeyond/145513347275522p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/98918847339040p?scl=1|https://s7d2.scene7.com/is/image/BedBathandBeyond/10160953308317m?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/10160953308317m?scl=1",
        "name": "Dyson Supersonic Hair Dryer",
        "price": 399.99,
        "url": "https://www.bedbathandbeyond.com/store/product/dyson-supersonic-hair-dryer/3308317",
        "variations": "IRON/FUCHSIA|WHITE/SILVER"
    }
}
,{
    "product": {
        "brand": "KitchenAid",
        "category": "Kitchen|Small Appliances|Mixers & Attachments",
        "currency": "USD",
        "date": "2017-12-07T00:05:25.430Z",
        "description": "This high-performance, 325 watt KitchenAid Artisan Stand Mixer is reason enough for you to get busy in the kitchen. With a 5 qt. ultra durable stainless steel mixing bowl and 10 speed settings, this tilt-back-head all-metal mixer is a kitchen essential.",
        "images": "https://s7d9.scene7.com/is/image/BedBathandBeyond/21686512370920p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/15710817825569p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/68875814073710p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/46977543004843p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/7366314872353p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/18935118698528p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/58050514872485p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/21685612370938p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/17041218088827p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/31002313317640p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/24925813080976p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/150305412370911p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/21685714017224p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/5789314222944p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/21686413324514p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/21686612963238p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/21685812370962p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/104721943004836p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/21685912370989p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/109395460419590p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/109395760419613p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/21686212863004p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/109395660419606p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/21685412370903p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/25119914872426p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/58001413227713p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/21686012370997p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/7366514872434p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/21686112371004p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/31722642049784p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/26824312371012p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/21685312366590p?scl=1|https://s7d1.scene7.com/is/image/BedBathandBeyond/150305412370911p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/150305412370911p?scl=1",
        "name": "KitchenAidВ® ArtisanВ® 5 qt. Stand Mixer",
        "price": 279.99,
        "url": "https://www.bedbathandbeyond.com/store/product/kitchenaid-reg-artisan-reg-5-qt-stand-mixer/102986",
        "variations": "ALMOND|AQUA|BLUE WILLOW|BORDEAUX|BOYSENBERRY|BROWN|BUTTERCUP|COBALT BLUE|CONTOUR SILVER|CRANBERRY|CRYSTAL BLUE|EMPIRE RED|GLOSS CINNAMON|GREEN APPLE|ICE|IMPERIAL BLACK|IMPERIAL GREY|LAVENDER|MAJESTIC YELLOW|MATTE BLACK|MATTE GRAY|METALLIC CHROME|OCEAN DRIVE|ONYX BLACK|PERSIMMON|PINK|PISTACHIO|SILVER|TANGERINE|WATERMELON|WHITE/SILVER|WHITE/WHITE"
    }
}
,{
    "product": {
        "brand": "All-Clad",
        "category": "Gifts|Gifts by Interest|Gifts for the Cook",
        "currency": "USD",
        "date": "2017-12-07T00:05:29.438Z",
        "description": "All-Clad is the first choice of serious cooks. Three-ply bonded construction has a pure aluminum core for even heat distribution and a non-reactive stainless-steel interior and exterior for stick-resistant and easy-to-clean benefits.",
        "images": "https://s7d1.scene7.com/is/image/BedBathandBeyond/1861812460112p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/1861812460112p?scl=1",
        "name": "All-Clad 12-Quart Stainless Steel Multi-Cooker",
        "price": 149.99,
        "sku": "12460112",
        "url": "https://www.bedbathandbeyond.com/store/product/all-clad-12-quart-stainless-steel-multi-cooker/1012460112"
    }
}
,{
    "product": {
        "brand": "Homedics",
        "category": "Health & Beauty|Massage & Relaxation|Massage",
        "currency": "USD",
        "date": "2017-12-07T00:05:30.079Z",
        "description": "Feel the soothing warmth of the HoMedics Shiatsu Neck and Shoulder Massager with the added heat to the shiatsu, vibrating, or combined settings. It's all customizable so you can feel comfortable and natural in your relaxation.",
        "images": "https://s7d1.scene7.com/is/image/BedBathandBeyond/46662342763468p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/46662342763468p?scl=1|https://s7d9.scene7.com/is/image/BedBathandBeyond/46662342763468p__1?scl=1",
        "name": "HoMedicsВ® Shiatsu Neck and Shoulder Massager with Heat",
        "price": 39.99,
        "sku": "42763468",
        "url": "https://www.bedbathandbeyond.com/store/product/homedics-reg-shiatsu-neck-and-shoulder-massager-with-heat/1042763468"
    }
}
,{
    "product": {
        "brand": "Presto",
        "category": "Gifts|Gifts by Category|Unique Gifts",
        "currency": "USD",
        "date": "2017-12-07T00:05:30.730Z",
        "description": "Make delicious, authentic pizza parlor pizza at home. With the exclusive Roto-bake technology you can choose exactly how bubbly the cheese should be and precisely how crispy or chewy you'd like the crust.",
        "images": "https://s7d1.scene7.com/is/image/BedBathandBeyond/397311975038p?scl=1",
        "name": "Presto Pizzazz Pizza Cooker",
        "price": 59.99,
        "sku": "11975038",
        "url": "https://www.bedbathandbeyond.com/store/product/presto-pizzazz-pizza-cooker/1011975038"
    }
}]
Mikhail Sisin Co-founder of cloud-based web scraping and data extraction platform Diggernaut. Over 10 years of experience in data extraction, ETL, AI, and ML.

Leave a Reply

Your email address will not be published. Required fields are marked *