TCM
UoC crest

Figure Extraction

From time to time it is useful to be able to extract figures from documents. Beware of copyright: doing so is not invariably legal (nor is it invariably illegal). Contacting the original author for an original copy may, in many cases, be the best answer to the legal and technical issues.

This page describes some hints and tips for some common senarios.

Original is PDF

The GUI vector drawing package Inkscape can open PDF reasonably reliably, and extract objects from a PDF file. It can be launched as inkscape from the command line, or is probably on a menu.

Currently (2024) it is probably best to choose the Poppler/Cairo import option, rather than the native Inkscape PDF import. One may first need to ungroup objects (select them, right click, ungroup), then perhaps move the bits you want off the canvas, delete everything left with a rectangular selection, and then press Shift+Ctrl+R to resize the page area around the remaining content. This can then be saved as SVG or PDF, or exported to PNG.

Original is PDF, avoiding Inkscape

PDF, like PostScript, can contain both vector and bitmap figures. The bitmap figures may be stored with DCT (JPEG) compression, or lossless compression. If the figure is stored as a bitmap, the program pdfimages -j will extract it. Typical usage is:

pdfimages -j -f 5 -l 6 file.pdf images
to extract all images from page 5 to page 6 in file.pdf and save them in files called images followed by a number and the extension .jpg, .pbm or .ppm. Omit both the -f and -l to work on the whole document.

Assuming one wanted EPS files, then bmp2eps will convert all three of these formats to EPS efficiently. It is very unlikely one wanted .pbm or .ppm, as these files are uncompressed and therefore huge. Converting them to .png (pnmtopng) or .gif (ppmtogif) would be a good idea.

If the original figure is in vector format, life is much harder. The simplest solution is to save the single page from Acroread as PostScript (print to file), and then try the other PostScript section below.

Original is PostScript

If the PostScript is generated by some sane system, such as LaTeX, and the figures are sanely included as EPS files, this is really easy.

One can load the PostScript in one's favourite editor, and search down to find the desired page. It will start with a line beginning:

%%Page
followed by its page number. One can then search on for a line beginning
%!PS-Adobe-x.0 EPSF-y.0
with various possibilities for 'x' and 'y'. This is the start of an EPS file embedded in the PostScript. One can delete everything before this point. One then needs to find the end of the EPS file. Lines starting
%%EOF
or
%%EndDocument
are likely candidates: strictly EOF will be the last line of the EPS file, whereas EndDocument is the first line after then end. Delete from this point to the end of the document, and the job is done.

One caveat: EPS files may have other EPS files embedded in them (especially if created with xfig). In this case, make sure that one finishes the cut at the correct point!

If the images required are in fact bitmaps, then the psimages command may well extract them as PNG and JPG files.

If this method does not work, try the other PostScript section below, or seek assistance from someone who speaks PostScript.

Original is a nasty form of PostScript

The old solution at this point involves a laser printer and a scanner. This is silly, as one ends up with images of finger-prints, squashed flies, and electrical noise induced in the scanning circuitary by the nearest experimentalist listening to Q103. It is better to do the processes entirely computationally.

First move to the /scratch disk: this process will create large files which you don't want in your home directory. It will be faster on a local disk too.

Then reduce the PostScript file to a single page: using gv and "Save Marked Pages" is an obvious strategy.

The scanning step is to use eps2gif to convert the PostScript to a bitmap. The eps2gif will cope with single page PostScript files (if they contain a bounding box), and will write formats other than GIF. A possible command-line would be:

eps2gif -res 2048 -ppm file.ps > temp.ppm
The resolution specified is the horizontal resolution: the corresponding vertical resolution will be calculated automatically. Antialiasing (with 2x2 oversampling) will happen automatically. One should probably avoid the GIF (limited to 256 colours) and JPEG (lossy compression) output formats at this point.

The resulting enormous image can then be loaded into xv or the gimp or some similar program and cropped as required. Unfortunately xv is very poor at writing EPS files, and the gimp is not always optimal, so it may be best to save as some form of bitmap, and to use bmp2eps to convert to EPS if required.

Remember that if the final document is to be viewed on screen, then resolutions above 1600 are probably unnecessary. If it is intended to end up on a WWW page, then a lower resolution is likely to be better. If it is to be printed, and it takes up a significant amount of an A4 page, then higher resolutions might be a good idea. Real perfectionists might try to trim off textual annotations, and re-insert then in a vector form using something like xfig, as text tends to suffer most obviously from being converted to a bitmap.

original is DOS/Windows EPS(I)

Certain Windows programs, including CorelDraw!, like including binary previews in their EPS output. To make this saner, one needs to extract using:

dosepsi2eps < old.eps > new.eps

One may also wish to remove DOS-style line endings. Beware that the source is likely to contain a mixture of DOS, UNIX and Mac style line endings, so deleting all carriage-returns will not work. It should be possible to use dos2unix to change DOS's carriage-return/new-line sequence to simply new-line, then

tr '\r' '\n' < new.eps > newer.eps
to change the Mac style line endings to UNIX style.

(If one leaves Mac-style line endings in the file, it is hard to manipulate with those UNIX editors which show that section just as a single line.)

The final word

You should, of course, acknowledge your sources.