PDF Explained
An Overview
PDF, or Portable Document Format, is a file format invented by Adobe. Adobe also produce free viewing and printing software for it (acroread), and charged-for writing software (distiller). Other readers and writers exist, such as xpdf for reading, and ghostscript / ghostview for both reading and writing.
PDF borrows much from the structure of PostScript, although PDF is very much simpler, and not designed to be human writable. PDF contains none of PostScript's flow control structure (no loops, no conditional execution), and it does usually contain binary data. Like PostScript, it is a resolution-independent description language. It is very closely based on the intermediate "display lists" that PostScript is traditionally converted to before the final rasterisation.
The significant advantage of PDF over PostScript is the ability to embed hyperlinks in the text, so one can click on items in a table of contents, or index, or simply in the main text, and go directly to the referenced item. The file format makes this particularly easy for the viewers to achieve, with a clear structure of which objects are needed for which pages, and a table of byte offsets from the beginning of the file for all objects. This cross-reference table is stored at the end of a PDF file, and means that truncated files are generally completely unreadable: most viewers read this table first. It also makes PDF files hard to edit by hand, for if one changes the length of any object, all subsequent objects need their entries in the cross-reference table updating. Similarly whitespace addition or removal will destroy a PDF file.
PDF versions
Currently (2007), eight versions of PDF have been defined.
Date | Version | Minimum Acroread version |
1993 | 1.0 | 1 |
1.1 | 2 | |
1.2 | 3 | |
1999 | 1.3 | 4 |
2001 | 1.4 | 5 |
2003 | 1.5 | 6 |
2004 | 1.6 | 7 |
2007 | 1.7 | 8 |
Version 1.1 added support for binary data, 1.2 support for flate (zlib) encoding, 1.3 gradient fills, and 1.4 added transparency (amongst other things in all cases). Since 1993 there has been just one new version of PostScript, and one of LaTeX: PDF, like MS Word, is still struggling towards stability and maturity.
Acroread reached its peak of cross-platform support with version 3, which was available for AIX, HP-UX, Irix, Linux (IA32), MacOS, OS/2, Solaris, Tru64, Win16 and Win32. Version 4 dropped support for OS/2 and Win16, version 5 dropped support for Irix and Tru64. Version 6 dropped AIX, HP-UX, Linux and Solaris. Acroread 7 dropped MacOS pre-X, but regained Linux and Solaris. Therefore one should avoid generating PDF versions newer than 1.2 unless one is gaining something from the extra features of 1.3 or 1.4. The initial version of Acroread was not freely distributable.
Acroread is distributed by Adobe, and other PDF viewers include Xpdf (requires X windows, runs on UNIX, VMS and OS/2), gv (UNIX) and GSview (MS Windows, Linux).
The current PDF standard, 1.7, runs to over 1200 pages: far too big!
PDF and Fonts
A PDF file may include fonts, or may request fonts and assume that the reader will have its own copy of those fonts. In this regard it is just like PostScript. Unlike PostScript, there is a well-defined mechanism for font substitution: the reader will always manage to find some font, and will not simply crash because it does not have the correct font. However, documents using unusual fonts should embed those fonts within themselves, otherwise the spacing of letters will look odd, and, for symbol fonts, garbage will result anyway. Most versions of Acroread include Arial, Courier and TimesNewRoman in normal, italic, bold, and bold italic, as well as Symbol and ZapfDingbats - 14 fonts in total, and fewer than the 35 which are 'standard' in PostScript.
(Adobe's current position is that all fonts should be embedded, but readers must be able to perform reasonable substitutions for these 14 for compatibility with the days when Adobe recommended that these fonts should not be embedded.)
Encapsulated PDF
No such thing exists. It is often possible for a single-page PDF file to be included as a figure in a larger PDF document (as shown by various forms of PDF LaTeX or dvipdf(m)). However, there is no clear specification for what a single-page PDF file should do in order to be friendly in this fashion. The bmp2eps program, when asked to produce PDF, does its best in this regard.
PDF to/from PostScript conversion
PDF to PostScript is clearly possible: how else could a PDF document be printed on a PostScript printer? PostScript to PDF usually works reasonably too. However, these are not good ways of generating PostScript or PDF in general, and oscillating the same document between PS and PDF is likely to cause trouble (and excessive file size). Rescuing embedded EPS figures from original PostScript is much easier than rescuing them from PDF or PostScript generated from PDF.
The program ps2pdf, part of the Ghostscript suite, does a good job at converting PS to PDF. By default it fails to embed the 14 "standard" fonts. If embedding these is required, try adding `-dPDFSETTINGS=/printer' to the command line. With skill, ps2pdf can be persuaded to do pretty much anything Adobe's commercial Distiller can do.
Beware! Distiller likes to nudge one towards producing "screen resolution" PDF files, at which point it may downsample images producing a small file size, but something which prints very poorly.
Figure extraction
See the plagiarism page.
Please send comments/corrections to mjr19.