Curl with NodeJS
If you are familiar with Curl, the command line tool for transferring data with URL syntax, you might wonder how can that be done with NodeJS. Well, looking at the nodeJS API, it’s clear that HTTP.request gives you something that looks like a Curl equivalent. But, I want to write the response to a file and do some looping so that I can use an array of URLs instead of just making a request to one URL.
Looking around, I found a great nodeJS module called request which is a small abstraction on top of http.request and makes writing my script much easier. First, I added the request and filesystem module to my script plus the array of URLs I want to scrape. Then I did a for-loop to iterate through the array and inside the loop, I put the request function, passing the file and url parameters, and some console.log() calls to print what’s happening:
As you can see from the output, only the second url and file was scraped. This is because, in NodeJS, everything is asynchronous, so we need to pass the url and file from the loop to a separate function:
Another good nodeJS module for doing curl like web scraping is curlrequest. Of course, for some serious web scraping, I recommend using PhantomJS.