Grabbing HTML source code with PhantomJS or CasperJS
Have you ever tried to retrieve the source code of a page with PhantomJS or CasperJS? Sure, there is a method in the API to save the source to a file but what if you wanted to work with it inside your script? One way to approach this would be to work on the DOM with the page evaluate method. After all, you can retrieve the source code of any page from the DOM by finding the HTML object. Here’s how to do it with PhantomJS or CasperJS.
Take a look at this example below, we evaluate and return the document object and JSON stringify the callback.
If you print this out, you are going to see a massive JSON representation of the document object. It is huge, but if you know how to parse and search through this JSON, it will reveal everything mentioned above plus a lot more. document.all[0] contains the HTML object and since we would just grab outerHTML, there is no need to turn the object into a string with JSON stringify. document.all[0].outerHTML will give you the source code and here’s how that would look like in your PhantomJS or CasperJS script:
You can also grab cookie information (document.cookie), browser console (document.defaultView.console), browser history (document.defaultView.history), and a lot more. So, what about the external resources? After all, PhantomJS or CasperJS has retrieved this code in order to render the page so it has to exist somewhere. Let’s consider CSS, if you use an online JSON viewer on that document object JSON code, you will find CSS declarations from your external stylesheet(s). However, it’s not structured in one key like the outerHTML example for HTML source. Instead, you get the browser parsed individual stylesheet declarations and it would take some painful loops to concatenate these declarations into one source file. Thus, if you need to retrieve external source code, it is probably best to make another PhantomJS or CasperJS request and retrieve that way or use a proxy that listens and saves these requests.