Data, Maps, Usability, and Performance

Saving Website HTML, JS, and CSS with PHP

Last updated on July 16, 2012 in Development

Scraping websites with wget

Have you ever wanted to save an online website locally? In most browser you can do this with a couple of clicks but what if you wanted to do this programmatically? Maybe you want to save all the resources, not just the html page, but also the external JavaScript and style sheets and control where they are saved. If we are talking PHP, I generally think of web scraping via curl but in this situation I really need more flexibility and features. I don’t want to use regex to find external resources (or add-on an extra library like PHP Simple HTML DOM Parser), make multiple curl calls in a loop, etc… So, it’s time to take a look at wget.

I have a list of websites that I would like to save locally and they have external CSS and JS that I really want living in one folder. wget has an overwhelming amount of options so this can be done pretty easily. Now, I still want to hit a url to do this, not from command line, so I will use the php exec command.

In terms of the options, I am going to use -p –convert-links in order to save HTML, JS, and CSS resources and convert the links in the HTML page so they are referenced accordingly. This gives me crappy directory structure, however, so I also use the -nd option to disable the creation of directories when retrieving recursively. Finally, -P downloads is going to save the website I just scraped into the downloads directory. Pretty neat.

Here you can learn more about wget commands. And, here is the code:


Facebook Twitter Hacker News Reddit More...