{"id":94,"date":"2016-10-22T22:37:15","date_gmt":"2016-10-22T22:37:15","guid":{"rendered":"https:\/\/blog.diggernaut.com\/?p=94"},"modified":"2019-01-12T20:12:24","modified_gmt":"2019-01-12T20:12:24","slug":"what-is-more-efficient-language-for-web-scraping-purposes","status":"publish","type":"post","link":"https:\/\/www.diggernaut.com\/blog\/what-is-more-efficient-language-for-web-scraping-purposes\/","title":{"rendered":"What is most efficient language for web scraping purposes"},"content":{"rendered":"<p>We decided to do this small test to find out what is more efficient (speed, CPU and RAM usage wise) programming language for web scraping purposes. We wrote all scraping scripts in the same manner, and we ran it in a single thread. Each scraper we ran for 10 minutes on the same machine, almost at the same time. We ran it on: Linux Ubuntu 14.04 (under Virtual Box), 1 CPU Core, 4Gb RAM.<\/p>\n<p>We compared following programming languages (frameworks): Golang + Diggernaut meta-language, Perl, PHP5, Python 2.7, Python + Scrapy, Ruby. As a target we used <a href=\"https:\/\/www.healthdata.gov\">U.S. Department of Health & Human Services<\/a> website.<\/p>\n<p>Let\u2019s look at the speed chart.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1182\" height=\"409\" src=\"https:\/\/blog.diggernaut.com\/wp-content\/uploads\/2016\/10\/chart1.png\" alt=\"chart1\" class=\"alignnone size-medium wp-image-95\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2016\/10\/chart1.png 1182w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2016\/10\/chart1-768x266.png 768w\" sizes=\"auto, (max-width: 1182px) 100vw, 1182px\" \/><\/p>\n<p>As you can see there are 3 leaders: Golang + Diggernaut was able to fetch almost 3K pages, Ruby \u2013 approx 2.5K and Python + Scrapy \u2013 approx 1.5K. Other languages are slow.<\/p>\n<p>However, if we look at the CPU usage chart, we can see a bit different picture.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1181\" height=\"410\" src=\"https:\/\/blog.diggernaut.com\/wp-content\/uploads\/2016\/10\/chart2.png\" alt=\"chart2\" class=\"alignnone size-medium wp-image-96\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2016\/10\/chart2.png 1181w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2016\/10\/chart2-768x267.png 768w\" sizes=\"auto, (max-width: 1181px) 100vw, 1181px\" \/><\/p>\n<p>First place here goes to PHP5 which used just 2.5% of CPU, then Golang +Diggernaut with 3.5% and third is Perl with approx 4%. Other languages are also close by, except Python + Scrapy \u2013 11% is a way too much we think.<\/p>\n<p>And last parameter we measured is RAM usage:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1185\" height=\"411\" src=\"https:\/\/blog.diggernaut.com\/wp-content\/uploads\/2016\/10\/chart3.png\" alt=\"chart3\" class=\"alignnone size-medium wp-image-97\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2016\/10\/chart3.png 1185w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2016\/10\/chart3-768x266.png 768w\" sizes=\"auto, (max-width: 1185px) 100vw, 1185px\" \/><\/p>\n<p>The winner here is Golang + Diggernaut with 26Mb, then Perl with 29Mb, and PHP5 with 39Mb. Ruby here is an outsider with 154Mb of RAM usage.<\/p>\n<p>So to summarize measures we score each language using 100-points score system. Each measure goes separately (best result gets 100 points, worst gets 0 points) and then we use average.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"1185\" height=\"410\" src=\"https:\/\/blog.diggernaut.com\/wp-content\/uploads\/2016\/10\/chart4.png\" alt=\"chart4\" class=\"alignnone size-medium wp-image-98\" srcset=\"https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2016\/10\/chart4.png 1185w, https:\/\/www.diggernaut.com\/blog\/wp-content\/uploads\/2016\/10\/chart4-768x266.png 768w\" sizes=\"auto, (max-width: 1185px) 100vw, 1185px\" \/><\/p>\n<p>Golang is a clear winner in this run.<\/p>\n<p>We decided to attach files we used for test, so you may try and ensure: <a href=\"https:\/\/blog.diggernaut.com\/wp-content\/uploads\/2016\/10\/scripts.zip\">scripts<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>We decided to do this small test to find out what is more efficient (speed, CPU and RAM usage wise) programming language for web scraping purposes. We wrote all scraping scripts in the same manner, and we ran it in a single thread. Each scraper we ran for 10 minutes on the same machine, almost [&hellip;]<\/p>","protected":false},"author":4,"featured_media":675,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[17,16,14,15,12,13,11],"class_list":["post-94","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-web-scraping","tag-benchmark","tag-golang","tag-perl","tag-php","tag-python","tag-ruby","tag-web-scraping"],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts\/94","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/comments?post=94"}],"version-history":[{"count":10,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts\/94\/revisions"}],"predecessor-version":[{"id":676,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/posts\/94\/revisions\/676"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/media\/675"}],"wp:attachment":[{"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/media?parent=94"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/categories?post=94"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.diggernaut.com\/blog\/wp-json\/wp\/v2\/tags?post=94"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}