archiva.netPaula Petrik

Electronic Researcher · Section

Cameras.

Notes on document-capture cameras for academic research — what to look for in an archive-day camera, the lighting and stand setup that makes the images actually usable, and the workflow from raw capture through OCR.

The problem

For most of the twentieth century, archive research meant either taking notes by hand or paying for photocopies. Both had limits: the notes were as good as your handwriting and your patience, and the photocopies were expensive, often forbidden, and limited to whatever the archive's single-feed copier could handle. The arrival of usable consumer digital cameras in the early 2000s changed the economics: suddenly you could capture two hundred pages in a morning, sit on the file for as long as you needed, and read the document at home in better light than the reading room allowed.

The transition was uneven. Many archives took years to formalize permission policies. Many cameras of the period produced images too low-resolution for OCR. The lighting in archive reading rooms is almost universally bad for photography. And there is a substantial gap between "captured the page" and "produced a file you will be able to use ten years from now."

What to look for

  • Resolution sufficient for OCR. A page-image captured at the equivalent of 300 DPI is usually OCR-readable. For a typical 8.5" × 11" document, that means roughly 2,500 × 3,300 pixels of actual captured page area — not the camera's nominal megapixel count, which includes whatever else is in the frame.
  • Manual focus, manual exposure. Auto-focus and auto-exposure both produce inconsistent captures across a stack of similar pages. The captures need to be consistent for batch OCR.
  • Silent or quiet shutter. Reading rooms are quiet; clattering shutters draw stares.
  • Battery life. An archive day is six to eight hours of continuous low-volume capture. Spare batteries are essential.
  • Tethering or fast-card workflow. Either shoot tethered to a laptop and review immediately, or shoot to a fast SD card and dump-and-review at lunch. Discovering at home that fifty pages were out of focus is the worst outcome.

Lighting and stand setup

Archive reading rooms typically have overhead fluorescents designed for reading paper, not photographing it. The consequences for captured images: glare on glossy or coated-paper documents, a green-yellow color cast, and variable exposure across the page. The mitigations:

  • A small folding copy stand if the archive permits it. Most do not.
  • Failing a stand, set the camera to manual exposure calibrated against a sample page at the start of the session, and recheck every twenty minutes.
  • White balance set manually on a neutral page (the white margin of a typical document) rather than auto.
  • Shoot RAW where the camera supports it; the color correction at home is much easier from RAW than from JPEG.

Workflow from capture to OCR

  1. Capture the document at the highest resolution and best available format (RAW or fine JPEG).
  2. At home, dump the files to a working directory named for the archive, the box, the folder, and the date.
  3. Crop and rotate so each image is a clean rectangle of page-only.
  4. Convert to TIFF or PDF for the OCR pass.
  5. Run OCR (ABBYY FineReader, Acrobat, or Tesseract for the open-source workflow). Always retain the page image alongside the OCR output — the OCR will be wrong on at least some pages, and you will need to check it against the original.
  6. File the output in a citation-aware way so you can re-find it from a footnote a year later.

Permissions

Always confirm the archive's photography policy before you arrive. Many archives that did not permit photography ten years ago now do, and many that do permit it require registration of the camera serial number, an affidavit on intended use, or a flat per-day fee. The reading-room staff will tell you. Do not assume.