Web Scraping with Go vs. Node.js

The tools to look at vast and varied information

It’s a dirty job, but someone has to do it. Let’s face it, data on the web is not standardized nor is it clean. Sometimes we need to scrape this unstructured data to enhance user experience or direct people toward information and that’s a problem. Fortunately we have more APIs these days and many sites are also adapting Open Graph tags, making better use of meta tags to help this process. Despite all that, we still have a need for responsible web scraping.

I’m going to keep this post less techincal and so you won’t see any code snippets or techniques in here. I’ll save those for some future posts. My goal here is to help you understand some of the options you have and how the process works.

This way, you can get a better understanding for what goes into the job before you hire someone. Or, if you’re a developer, this might shine a light on whether you want to be using Node.js or Go (or try your hand at something else).

Why we Scrape the Web

Sometimes we need to understand what web content is about in order to direct people to it. This is basically what search engines do. However, we have another issue here with the advent of mobile apps. We now have a need to use our own content in a new way. Perhaps in an unexpected way. You may not have an API built from your own web site, or you may wish to aggregate and direct users to multiple sites without APIs. So you’ll need to scrape those pages to understand what they are about.

Aside from trying to understand the internet better, web scraping can also be used on your own web pages for automated testing. We can scrape our sites to ensure content is appearing how we expect it to be and that everything is still loading fine.

The negative association with web scraping is unfortunately due to some people with the bright idea of theft. Some scrapers gather written content to re-post in hopes of driving traffic to ads. Fortunately, this accounts for less and less of web scraping activity. The web has come a long way with regard to digital rights and search engines are getting better at detecting these kind of sites.

The Alternative?

Yes, there’s actually an alternative to all this. It’s called the semantic web. It’s not a new idea, but unfortunately not many people adopted it. We’re starting to see a second attempt at it here with social media. Now that the web has a greater desire to share content, we’ve realized the need again. This time around we see things like Open Graph tags and other meta tag conventions.

I’m still holding out hope that we’ll start seeing more semantic data in web page meta tags. Unfortnately, the web just isn’t there yet.

How Web Scraping Works

It’s pretty simple really, but there’s some things to keep in mind. First, if a script is looking for just plain text it can make a simple request and not load any JavaScript or images from the page. This results in a very fast request and is what most search engines do. This is why one must be careful when using JavaScript to dynamically load in content. Search engines may not be indexing some of your content.

However, with Node.js and tools like PhantomJS we can also load the full content of a web site including the JavaScript. PhantomJS is a web browser that’s running on the server so we can’t see it. Though we can ask it for screenshots and because of this, you’ll often see PhantomJS used for automated testing. This of course takes longer and requires more server power when gathering data.

The fun begins when content has been loaded from a web page. Now what? Well, it’s going to depend on the application. If we’re trying to get a summary to share on social media, then we can often use pattern matching to look for the HTML title element in the source code. We can also look at the meta tags in the source code for description, author, and more. If present.

If the web page doesn’t have good meta tags, then you can see where it gets more tricky. We now don’t have a dedicated section that’s clear. If we wanted to get the author of a blog post, we would need to look for phrases like “written by” and such. This can involve a lot of natural language processing and get quite complex. Sometimes we can try a few matching rules and get by, but this requires a lot of trial and error.

The most important thing to understand is what works on one page, may not work for the next page due to differences in content.

Why Go & Node.js?

Simple. Both are capable of making multiple requests for content at the same time whereas a language like PHP would be doing one by one (for the most part). This makes these languages faster at gathering large amounts of data from the web.

Second, PhantomJS is a headless web browser that let’s us gather the entire page’s content including JavaScript. Go has something similar, but less mature, called WebLoop.

Last, both of these languages have DOM selector packages. This means code can easily be written to ingest HTML source code and very easily pick out the meta tags. Node.js has several such as Cheerio and JSDom. Go has Goquery (and probably some others).

Go’s Strengths

I honestly have to say that Go is just really good at pushing data around. Unlike Node.js, it doesn’t need a virtual machine to run the code - it compiles to machine code and just runs as a binary. This makes it faster in many ways, but more importantly it brings the application a bit closer to the machine. It provides the developer with more control over system resource usage and multi-threaded processes.

It’s fast to code in as well. Defining a structure for a web page really helps make sense of the data. Not that you couldn’t get a good sense of organization with Node.js, but it requires more discipline. Go works with data structures in a very clean way. This may sound odd given the web contains unstructured data, but with a little bit of thought - you can start to normalize data a good bit.

All this means cheaper web scraping and a better ability to do it on a large scale. If your application doesn’t need to do a lot of web scraping, then many of Go’s strengths are not going to matter for you.

Node.js’ Strengths

Node.js is going to use more RAM when web scraping that’s for sure…But what it lacks in system resource usage it makes up for in accessibility. In spades. Remember, Node.js is JavaScript and that means loading a web page with JavaScript is far easier. Go has far less tooling for this.

Node has far more packages than Go because the internet went wild for it. It just grew as a language very quickly and as such there will be more tools for analyzing a web page. Not just more DOM parsers, but more tools for natural language processing as well.

I think Node.js has a lower barrier to entry as well. Hiring a developer for Node.js is likely to be cheaper than hiring a Go developer. Not by a lot and much of that has to due with the rarity of Go developers, but it should still be cheaper. This means it might cost less to do your web scraping in Node.js despite it not being as efficient as Go when it comes to the system resources. You may spend a little more on hosting a web server, but save so much in other areas of the process.

Bonus Option, Services!

Yes, there are some SaaS’ out there for web scraping. Two come to mind here, Import.io and Kimono.

These services will not only turn unstructured data into structured data, but leave you with a nice API to use that data. Depending on your needs, one of these may be more cost effective than building your own solution.

Tags// , , ,
comments powered by Disqus