Mikhail Sisin Co-founder of cloud-based web scraping and data extraction platform Diggernaut. Over 10 years of experience in data extraction, ETL, AI, and ML.

How to extract user generated content for an internet shop with a small budget

8 min read

How to extract user generated content for an internet shop with a small budget

You’ve probably seen galleries with user-generated content in various online stores that sell clothing, shoes, home products, etc. They are very helpful in selling a product because they allow a potential buyer to see how a particular product sits on a real person rather than on a model and allows the buyer to make a more conscious decision. You probably would like to extract user-generated content but don’t know how to do it with a limited budget.

Technical implementation of such a mechanism is following: the service aggregator collects user-generated images on the Internet, for example, in Instagram, determines the brand and model of the item or items shown in the photo, and delivers it in a particular feed. It may be costly to connect to such service for a small venue, so mainly large mono and multi-brand online stores can afford it.

The second option is to create such an aggregation service yourself, but this is a very time-consuming, long-term and expensive process, much more expensive than connecting to a similar service-aggregator for a single online store.

However, there is a budget option. Many brands and well-known online stores are already customers of such aggregators and have their feeds with user-generated photos and information about corresponding products. Therefore, if you sell products of similar brands, you can get information from these feeds, process the received data and use them in your online store to sell products of this brand.

You can say that coding scrapers for every site and brand if there are hundreds of them, is quite tedious and takes much time. However, you do not need to scrape the websites. You only need a feed with user content. Moreover, such feeds are provided by a limited set of aggregators. Therefore technically, you need to have only one scraper, with standard logic and use different URLs or parameters to pick up feeds for different stores and brands.

One such service is Like2Buy, a service provided by Curalate company. They serve more than 6000 online stores and brands. All feeds can be easily googled by typing “like2buy.curalate.com” in the search box and clicking on the link “show all results.” Also, just for your reference, we’ll list below a few stores and their IDs for use with our free web scraper, which we’ll share in this article.

This data can be useful not only for online stores but also for companies conducting research for brands, as well as companies working in the machine learning area.

So you need a free account with our Diggernaut service. You can follow this comprehensive guide:

  1. Go through this registration link to open free account with Diggernaut
  2. After registering and confirming the email address, you will need to log in to your account
  3. Create a project with any name and description, if you do not know how to do it, please refer to our documentation
  4. Switch to the created project and create a digger with any name, if you do not know how to do it, please refer to our documentation
  5. Copy the following digger configuration to the clipboard and paste it into the digger you created, if you do not know how to do it, refer to our documentation
  6. In the iterator configuration inside the digger config, enter one or more (comma separated) store IDs from the table below.
  7. Switch the mode of the digger from Debug to Active, if you do not know how to do it, please refer to our documentation
  8. Run your digger and wait until the completion, if you do not know how to do it, please refer to our documentation
  9. Download the scraped dataset in the format you need, if you do not know how to do it, please refer to our documentation

You can also set up a schedule for running your scraper and collect data regularly.

The scraper configuration is shown below. You can copy it to any of your diggers, put the ID from the store table (or a few at a time) and start your digger.

---
config:
    debug: 2
    agent: Firefox
iterator:
    type: csv
    name: shop
    value: # Set here single store ID or few store IDs separated by comma
do:
- walk:
    to: https://like2buy.curalate.com/<%shop%>/
    do:
    - pool_clear: sub
    - find:
        path: html
        do:
        - eval:
            routine: js
            body: '(function() {return "xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx".replace(/[xy]/g, function(e) {var t = 16 * Math.random() | 0, r = "x" === e ? t : 3 & t | 8; return r.toString(16)})})();'
        - variable_set: rid
        - register_set: http://api.curalate.com/v1/like2buy/<%shop%>/products.json?rid=<%rid%>
        - link_add:
            pool: sub
        - walk:
            to: links
            pool: sub
            do:
            - find:
                path: qbookmark
                do:
                - parse
                - register_set: http://api.curalate.com/v1/like2buy/<%shop%>/products.json?qBookmark=<%register%>&rid=<%rid%>
                - link_add:
                    pool: sub
            - find: 
                path: items 
                do: 
                - object_new: item
                - argument_get: shop
                - object_field_set:
                    object: item
                    field: shop
                - find:
                    path: largephotourl
                    slice: 0
                    do:
                    - parse
                    - normalize:
                        routine: url
                    - object_field_set:
                        object: item
                        field: image
                - find: 
                    path: products
                    do: 
                    - parse
                    - object_new: product
                    - find: 
                        path: destinationurl
                        do:
                        - parse
                        - object_field_set:
                            object: product
                            field: url
                    - find: 
                        path: name
                        do:
                        - parse
                        - space_dedupe
                        - trim
                        - object_field_set:
                            object: product
                            field: name
                    - object_save:
                        name: product
                        to: item
                - object_save:
                    name: item

As a result, you get a dataset with the following structure:

[{
    "item": {
        "image": "https://d28m5bx785ox17.cloudfront.net/v1/img/PPYWso07RgBC_UHzxcrgAO_Wk0twhD3XHvviHlJ7-ZY=/d/l",
        "product": [
            {
                "name": "Marco Faux-Leather Moto Jacket",
                "url": "https://shop.guess.com/en/catalog/view/women/jackets-and-outerwear/view-all/marco-faux-leather-moto-jacket/w74l10r72y1?utm_source=instagram&utm_medium=social&utm_campaign=like2buy&crl8_id=670ce9b5-3465-4372-b0fe-df6a0c71ed4b"
            },
            {
                "name": "CAN: Marco Faux-Leather Moto Jacket",
                "url": "https://www.guess.ca/en/catalog/view/women/jackets-and-outerwear/view-all/marco-faux-leather-moto-jacket/w74l10r72y1?utm_source=instagram&utm_medium=social&utm_campaign=like2buy&utm_content=w74l10r72y1&crl8_id=670ce9b5-3465-4372-b0fe-df6a0c71ed4b"
            }
        ],
        "shop": "guess"
    }
}
,{
    "item": {
        "image": "https://d28m5bx785ox17.cloudfront.net/v1/img/Wn0kXxTmnzmAy6hTP3_bynEdtv9Ph7Y0M9FOVyLen00=/d/l",
        "product": [
            {
                "name": "US: Silver-Tone Charm Bracelet Box Set",
                "url": "https://shop.guess.com/en/catalog/view/434044G21?utm_source=instagram&utm_medium=social&utm_campaign=like2buy&utm_content=434044G21&crl8_id=a0dd62bd-9024-4224-bcdf-323d6e6e601d"
            },
            {
                "name": "US: Boxed Rose Gold-Tone Charm Bracelet",
                "url": "https://shop.guess.com/en/catalog/view/434042G21?utm_source=instagram&utm_medium=social&utm_campaign=like2buy&utm_content=434042G21&crl8_id=a0dd62bd-9024-4224-bcdf-323d6e6e601d"
            },
            {
                "name": "US: GUESS 1981 Eau De Toilette, 3.4 oz.",
                "url": "https://shop.guess.com/en/catalog/view/accessories/women/fragrance/guess-1981-eau-de-toilette-3-4-oz/32667861000?utm_source=instagram&utm_medium=social&utm_campaign=like2buy&utm_content=32667861000&crl8_id=a0dd62bd-9024-4224-bcdf-323d6e6e601d"
            },
            {
                "name": "US: Metallic Mini Backpack Keychain",
                "url": "https://shop.guess.com/en/catalog/view/17GUP248?utm_source=instagram&utm_medium=social&utm_campaign=like2buy&utm_content=17GUP248&crl8_id=a0dd62bd-9024-4224-bcdf-323d6e6e601d"
            },
            {
                "name": "CAN: Boxed Gold-Tone Stud Earring Set",
                "url": "https://guess.ca/en/Catalog/View/434046GC21/?utm_source=instagram&utm_medium=social&utm_campaign=like2buy&utm_content=434046GC21&crl8_id=a0dd62bd-9024-4224-bcdf-323d6e6e601d#434046GC21"
            },
            {
                "name": "CAN: GUESS 1981 Eau De Toilette, 3.4 oz.",
                "url": "https://www.guess.ca/en/catalog/view/accessories/women/fragrance/guess-1981-eau-de-toilette-3-4-oz/32667861000?utm_source=instagram&utm_medium=social&utm_campaign=like2buy&utm_content=32667861000&crl8_id=a0dd62bd-9024-4224-bcdf-323d6e6e601d"
            },
            {
                "name": "CAN: Metallic Mini Backpack Keychain",
                "url": "https://www.guess.ca/en/Catalog/View/17GUP248/?utm_source=instagram&utm_medium=social&utm_campaign=like2buy&utm_content=17GUP248&crl8_id=a0dd62bd-9024-4224-bcdf-323d6e6e601d#17GUP248"
            },
            {
                "name": "EU: Holiday Delivery",
                "url": "https://www.guess.eu/en/CustomerCare/guaranteed-delivery/?utm_source=instagram&utm_medium=social&utm_campaign=like2buy&crl8_id=a0dd62bd-9024-4224-bcdf-323d6e6e601d"
            }
        ],
        "shop": "guess"
    }
}
,{
    "item": {
        "image": "https://d28m5bx785ox17.cloudfront.net/v1/img/oCSER6z1bD-KgCCgMcbH9Xk9OifDOvwuXgXNwAQmIeI=/d/l",
        "product": [
            {
                "name": "CAN: Lily Faux-Fur Coat",
                "url": "https://www.guess.ca/en/catalog/view/women/jackets-and-outerwear/faux-fur/lily-faux-fur-coat/w74l14w9t70?utm_source=instagram&utm_medium=social&utm_campaign=like2buy&utm_content=w74l14w9t70&crl8_id=9a9df613-d531-4252-a63e-2566d16dedd2"
            },
            {
                "name": "EU: Floral Faux-Fur Coat",
                "url": "https://www.guess.eu/en/catalog/view/women/apparel/coats-and-jackets/floral-faux-fur-coat/w74l14w9t70?color=dpid%3FCMP%3DSMC-INSTAGRAM-LIKETOBUY&crl8_id=9a9df613-d531-4252-a63e-2566d16dedd2"
            }
        ],
        "shop": "guess"
    }
}]

As you can see, our basic scraper extracts only the URL to the image, the names, and URLs of the products. By changing the scraper logic, you can extract other data available in the feed, as well as perform any manipulations with the extracted data, forming your dataset precisely as you need it. Below is the structure of one source feed object, so you can better navigate to compose CSS selectors to containers with data:

<items>
        <candelete>false</candelete>
        <caption_safe>Introducing the next generation of #GUESSConnect Smartwatches ⌚️? Powered by Android Wear (and compatible
                with iOS 9+), our fav feature is swiping through the hundreds of watch faces to pair perfectly
                with whatever you're wearing + the Google Assistant! ➡️ Click the link in our bio to
                discover more #GUESSWatches #LoveGUESS</caption_safe>
        <commentcount>182</commentcount>
        <isfeatured>true</isfeatured>
        <largephotourl>https://d28m5bx785ox17.cloudfront.net/v1/img/9w5j3aXjw6pKZUvbDwEAEB9wXM8RqUpsxHL3wHF0i5A=/d/l</largephotourl>
        <largevideourl>https://scontent.cdninstagram.com/vp/d9e6c226c2cadbf3bc45167c1f24fff9/5A3D679E/t50.2886-16/24383086_151063558867804_2812871925800370176_n.mp4</largevideourl>
        <likecount>13306</likecount>
        <mediumphotourl>https://d28m5bx785ox17.cloudfront.net/v1/img/9w5j3aXjw6pKZUvbDwEAEB9wXM8RqUpsxHL3wHF0i5A=/d/m</mediumphotourl>
        <mediumvideourl>https://scontent.cdninstagram.com/vp/d9e6c226c2cadbf3bc45167c1f24fff9/5A3D679E/t50.2886-16/24383086_151063558867804_2812871925800370176_n.mp4</mediumvideourl>
        <networkidentifier>f1ffd186-3ee1-42ec-b463-135b26139ab7</networkidentifier>
        <networkurl>https://www.instagram.com/p/BcNdy1oluYh/</networkurl>
        <originalfileidandsource>
                <fileid>9w5j3aXjw6pKZUvbDwEAEB9wXM8RqUpsxHL3wHF0i5A=</fileid>
                <osource>instagram</osource>
        </originalfileidandsource>
        <products>
                <croppedthumbnailimageurl>https://d28m5bx785ox17.cloudfront.net/v1/img/dXSPdD25vkxoHZMw7xCH21i3Xm5Bda6gi5-MMFGEBNI=/sc/350x350</croppedthumbnailimageurl>
                <destinationurl>https://shop.guess.com/en/catalog/browse/lifestyle/guess-connect-touch/?utm_source=instagram&utm_medium=social&utm_campaign=like2buy&crl8_id=f1ffd186-3ee1-42ec-b463-135b26139ab7</destinationurl>
                <fileid>dXSPdD25vkxoHZMw7xCH21i3Xm5Bda6gi5-MMFGEBNI=</fileid>
                <id>0</id>
                <imageurl>https://d28m5bx785ox17.cloudfront.net/v1/img/dXSPdD25vkxoHZMw7xCH21i3Xm5Bda6gi5-MMFGEBNI=/d/l</imageurl>
                <name>US: GUESS CONNECT</name>
                <position>1</position>
                <productstyleid>u_2765_00c88d1540a358f1f4cadff87341b5122c7ac0900f11568a7e434923c71aa2f4</productstyleid>
                <sourceimageurl>https://d28m5bx785ox17.cloudfront.net/v1/img/dXSPdD25vkxoHZMw7xCH21i3Xm5Bda6gi5-MMFGEBNI=</sourceimageurl>
        </products>
        <products>
                <croppedthumbnailimageurl>https://d28m5bx785ox17.cloudfront.net/v1/img/j3aEX6aK9BPSma4E8OrRXxT4JjCrcJn7zmhJ_rEFcPA=/sc/350x350</croppedthumbnailimageurl>
                <destinationurl>https://shop.guess.ca/en/catalog/browse/lifestyle/guess-connect-touch/?utm_source=instagram&utm_medium=social&utm_campaign=like2buy&crl8_id=f1ffd186-3ee1-42ec-b463-135b26139ab7</destinationurl>
                <fileid>j3aEX6aK9BPSma4E8OrRXxT4JjCrcJn7zmhJ_rEFcPA=</fileid>
                <id>0</id>
                <imageurl>https://d28m5bx785ox17.cloudfront.net/v1/img/j3aEX6aK9BPSma4E8OrRXxT4JjCrcJn7zmhJ_rEFcPA=/d/l</imageurl>
                <name>CAN: GUESS CONNECT</name>
                <position>2</position>
                <productstyleid>u_2765_8a7e0d6ae928e7b95cd25781dadb917ab9d5d5826cb0dd14c7425e5c9c99c5e5</productstyleid>
                <sourceimageurl>https://d28m5bx785ox17.cloudfront.net/v1/img/j3aEX6aK9BPSma4E8OrRXxT4JjCrcJn7zmhJ_rEFcPA=</sourceimageurl>
        </products>
        <storeid>938</storeid>
        <timeposted>1512240829000</timeposted>
</items>

Below, we list the stores and their IDs that use Like2Buy to deliver user-generated content. This list is incomplete, if you did not find the brand or store you are interested in, try to google, or ask us, we are always happy to help 🙂

Store or brand ID Store or brand ID
Aldo aldo_shoes Ann Taylor anntaylor
Anthropologie anthropologie Bed, Bath and Beyond bedbathandbeyond
Brilliant Earth brilliantearth Cartier cartier
CB2 cb2 Champion champion
Chobani chobani Chumbak chumbak
Crate and Barrel crateandbarrel Creative Recreation creativerecreation
Covergirl covergirl David’s Bridal davidsbridal
Disney disney Dune London dune_london
Farfetch farfetch Fawn Shoppe fawn_shoppe
Forever21 forever21,forever21men Fossil fossil
Free People freepeople Gap gap
Garage Clothing garageclothing Guess guess
HauteLook hautelook Herbal Essenses herbalessences
Hot Topic hottopic House of Lashes houseoflashes
J. Crew jcrew Karl Lagerfeld karllagerfeld
Kohl’s kohls Laura Mercier lauramercier
Lilly Pulitzer lillypulitzer Louis Vuitton louisvuitton
lululemon lululemon Lulus lulus
Macy’s macys Misspap misspap
Neiman Marcus neimanmarcus Next Com AU nextofficial_au
Nordstrom nordstrom Paint Nite paintnite
PB Teen pbteen Pendleton pendletonwm
Pier 1 pier1 Pottery Barn potterybarn
Raymour & Flanigan raymourflanigan Schoolhouse Electric & Supply Co schoolhouse
Schutz schutzshoes Sephora sephora
Sperry sperry Target target
The Bump thebump The Company Store thecompanystore
Topman topman TopShop topshop
Victoria’s Secret victoriassecret Vineyard Vines vineyardvines
West Elm westelm Williams Sonoma williamssonoma
Windsor windsorstore Z Gallerie zgallerie
Zumiez zumiez
Mikhail Sisin Co-founder of cloud-based web scraping and data extraction platform Diggernaut. Over 10 years of experience in data extraction, ETL, AI, and ML.

Leave a Reply

Your email address will not be published. Required fields are marked *