Distant Reading and Middle Reading

Cameron Blevins has a very interesting posting about a way of “middle reading.” He’s collaboratively developed programming to support developing ImageGrid which is a grid overlay on historical documents (like newspapers) where the regions of interest can be identified to aid in creating context for the text from those sections.

The program tries to tackle one of the fundamental problem facing many digital humanists who analyze text: the gap between manual “close reading” and computational “distant reading.” In my case, I was trying to study the geography within a large corpus of nineteenth-century Texas newspapers. First I wrote Python scripts to extract place-names from the papers and calculate their frequencies. Although I had some success with this approach, I still ran into the all-too-familiar limit of historical sources: their messiness. Namely, nineteenth-century newspapers are extremely challenging to translate into machine-readable text. When performing Optical Character Recognition (OCR), the smorgasbord nature of newspapers poses real problems. Inconsistent column widths, a potpourri of advertisements, vast disparities in text size and layout, stories running from one page to another – the challenges go on and on and on. Consequently, extracting the word “Havana” from OCR’d text is not terribly difficult, but writing a program that identifies whether it occurs in a news story versus an advertisement is much harder. Given the quality of the OCR’d text in my particular corpus, deriving this kind of context proved next-to-impossible.
[…] We realized that this “middle-reading” approach could be readily adapted not just to my project, but to other kinds of humanities research. A cultural historian studying American consumption might use the program to analyze dozens of mail-order catalogs and quickly categorize the various kinds of goods – housekeeping, farming, entertainment, etc. – marketed by companies such as Sears-Roebuck. A classicist could analyze hundreds of Roman mosaics to quantify the average percentage of each mosaic dedicated to religious or military figures and the different colors used to portray each one.

This very interesting work (and the code for this is openly available for adaptation and use) clearly has immediate applications as well as longer-term applications for thinking about how to structure and identify contextual frames. This and work like it seem akin to the supports explained in “A framework for contextual information in digital collections” by Christopher A. (Cal) Lee, which is an excellent article that “sets out to investigate the meaning, role and implications of contextual information associated with digital collections.” Tools and techniques for middle-reading seem to have a role both for immediate and specific application for scholarly projects with already identified texts and as  a scaffold/framework for those supports for digital collections used in scholarship more generally.
Blevins mentions that he uses Texas newspapers in his work, and the University of North Texas Libraries (which does absolutely amazing work on digital collections and initiatives in support of public access and scholarship) are the hosts for the Texas Digital Newspaper Program, so there could be some very exciting future collaborations on ImageGrid or other work.