April 11, 2005
Download a Web Gallery with wget
Ok, let's skip the obvious jokes about downloading pr0n. People look at galleries of photos that their friends upload on the web too. If you want to download the entire gallery it can be tedious and time consuming. Enter wget—a great and free application that will allow you to download files off the web in an automated process. Give it a list of things to download and it will return a directory full of the files you wanted. The following tutorial covers two parts: Using a perl filter to extract the image paths from a web gallery and downloading the images with wget.
Requirements and Installation
To use this tutorial, and get any mileage out of it, you'll need to download and install wget and you'll need to be comfortable in terminal to use it. Also, you'll need to own BBEdit or be able to modify this perl script to stand alone.
The easiest way to install wget on a mac is through fink, which can be downloaded here. It even comes with a nice GUI for downloading and installing packages called FinkCommander. Installing fink and downloading wget with it is simple but not in the scope of this article. So refer to the documentation.
Next, copy the script at the bottom of this page into a text file and save it to:
BBEdit > BBEdit Support > Unix Support > Unix Filters |
You'll now have a new item in the #! menu of BBEdit under Unix Filters. To run the filter you open a text file, select all and choose the menu item corresponding to the name of the file you just saved.
Part 1: Getting the Content Ready
First we need to find a page that lists the images and download the source. We'll use that to extract just what we want—the image paths. Some galleries have multiple index pages. In those cases you may want to paste the sources from the multiple pages into one text file, or download them and cat them together. But in this example, which uses a gallery here, there's a thumbnail frame that has everything in one place. Although the index page lists the thumbnails and not the full size images, we're going to fix that with a little perl. The script in this example was pre-configured to work with the photo galleries on this site to serve as an example and help get you started. Although you could test it here, I'd rather you didn't use my bandwidth for the sake of an experiment. But feel free to look around to see why I assigned certain values to the configuration variables at the top.
I am using the $prepend variable to make a relative path into an absolute one. You'll need to use all absolute paths to run wget, so this is helpful when you encounter a gallery that uses relative links. The script also allows a simple find and replace functionality that I'm using to change the list of thumbnails into a list of the full size images. In this case the gallery was created by Photoshop and the file names are the same in the directories named "images" and the "thumbnails." Therefore we can use the thumbnail frame to make a list of the full size images by changing "thumbnails/" to "images/" in the path as we extract the locations. I used the $find and $replace variables to do this.
Part 2: Downloading the Images
Now that you've extracted the full paths of the images in the gallery, and changed the path from the thumbnails to the full images, you're ready to download them. You should have a text file with one url per line. It's prudent to copy and paste one into a web browser to see if everything worked before you ask wget to download the whole thing.
If everything worked as planned you can save the list as a text file. I usually just call it list. You'll want to save it in the directory where you want to download the images to simplify the command you use to call wget. Now you can open terminal and change directories to where you saved the list. Then you can issue a command to wget to download each location named in the file. It wouldn't be a bad idea to read the wget man page and see all the options. In this example we're going to set a couple switches to make it behave in a certain way:
$ wget -i list -nc --limit-rate=20k |
The -i switch tells wget read the URLs from the file name listed after it—in this example "list." The -nc switch means no clobber, which is to say don't overwrite a file we downloaded with another of the same name. The second one will be ignored and not downloaded. This is helpful if you didn't test the list you made before sending it to wget. If you made a mistake you might get a folder full of index.html, index.html.1 index.html.2, etc. Each one would be a 404 not found response from the server instead of the image you were looking for. This way you'd just get one, and you'd save load on the server and bandwidth. The --limit-rate=20k switch seems self explanatory but basically we are going to play nice and limit how fast we pull down the gallery.
Conclusion
There's more than one way to do it and I suppose I could have written a script in perl to spider the website and do all the work without any research or configuration, but I see a few problems with that: 1) I'm too lazy and inexperienced with perl to write it efficiently; 2) spider programs make it easy to put a heavy load on a web server because they take everything not just what you want; and 3) there are probably many examples already out there. So although this may not be the easiest way, it's a good compromise, its polite to the host that your friend used to serve their gallery, and its customizable enough for use with different gallery types.
#!/usr/bin/perl -w # # Used as a unix filter in BBEdit to extract image paths # from html into a list for downloading with wget # # Written by Joshua McFarren 4/11/05 # This work is licensed under a Creative Commons License. # http://creativecommons.org/licenses/by-sa/2.0/ use strict; my $leader = 'http://'; my $extensions = 'jpg|gif|png'; my $prepend = 'http://www.mcfarren.org/photos/latindesign/'; my $find = 'thumbnails/'; my $replace = 'images/'; my $input = ""; while (<>) { $input .= $_; } while ( $input =~ m#src=["'](($leader)?([^/ ]+/)+[^.]+\.($extensions))#gi ) { my $temp = $prepend . $1; $temp =~ s#$find#$replace#; print "$temp\n"; } |
This work is licensed under a Creative Commons License.
Technorati tags: photography, photo, hacking, perl, wget, howto, how to, tutorialPosted by joshua at April 11, 2005 8:25 PM | TrackBack
CONFIGURATION FOR OPHOTO:
my $leader = 'http://';
my $extensions = 'jpg|gif|png';
my $prepend = '';
my $find = '_SM';
my $replace = '_ALB';
Posted by: Joshua McFarren at April 14, 2005 12:58 PM