Data, Maps, Usability, and Performance

Grab Wikipedia pictures by API with PHP

Last updated on July 13, 2012 in Development

Images via Wikipedia API

There are many examples online that show people how to retrieve content from Wikipedia but most of them focus on articles. What if you just wanted to grab the pictures from Wikipedia? You might want to do this because the images on Wikipedia are generally in the public domain (no copyright issues). They are also usually pretty good in terms of quality and available in many sizes. Wikipedia does have an API and it is pretty simple to create a small PHP script that will retrieve images from Wikipedia query.

If you search the API page for prop=images you will see that this returns all images contained on the given page(s). So, a query for Albert Einstein would look like this:

http://en.wikipedia.org/w/api.php?action=query&titles=Albert%20Einstein&prop=images&format=json

And it gives you 10 photos of Albert Einstein in JSON format. Now there are some parameters that you can add, for example imlimit controls how many images are returned (the default is 10). imcontinue could be used for pagination of the results and imdir controls the sorting (ascending, descending). For our example, we will leave most defaults but change imlimit to 5.

So, a little information about the code below. Since we are using PHP, I used a curl call but Wikipedia requires some User-Agent params to be passed with curl, without that it throws this error: “Scripts should use an informative User-Agent string with contact information, or they may be IP-blocked without notice.”

You also need to replace spaces with an underscore. After doing the curl call, I decode the JSON response with php json_decode function and then parse the json output to retrieve just the image information.

Now, the image information in the response looks like this:

File:Albert_Einstein_(Nobel).png

So, it obviously needs a domain and when I add this we get:

http://en.wikipedia.org/wiki/File:Albert_Einstein_(Nobel).png

The only issue left is that this takes me to a Wikipedia page with the image instead of giving me the actual image. To get the image url, we need to do another API call, and this is a call per each image page that we grab. prop=imageinfo can be used with a parameter to grab the actual url (iiprop=url), resulting in a call like this:

http://en.wikipedia.org/w/api.php?action=query&titles=File:Albert_Einstein_(Nobel).png&prop=imageinfo&iiprop=url&format=json

This now gives me another json response that I decode and parse to grab the url and we are finally finished. So for 5 images you are making 6 API calls. It’s too bad that the image url is not provided in the original API call, that would be nice and way more efficient.

Here is the php code and demo:

Tags: ,

Facebook Twitter Hacker News Reddit More...