« [CSG Winter 2007] Small File File Systems - Jim Pepin | Main | [CSG Winter 2007] Storage at Indiana University »

January 3, 2007

[CSG Winter 2007] MBooks at University of Michigan

- Project partnership with Google publicly announced in 2004 December - scanning 7 million print volumes over 4-6 years. Direct scanning costs are borne by Google.

UM receives a copyof all digital files, including OCSR and metadata which can be used to build services. UM can share, with some restrictions. Each volume page produces 2.01 files on average - will be about 2.2 billion files, 380 TB of data. Sustained rate of 3.16 MB per second for four years.

Data characteristics - well defined file formats - image files are TIFF or JPEG 2000, OCR files and metadata are UTF-8 text. Indefinite retention. Files are largely static. Much material is in copyright, so requires security practices.

Mbooks service - can search and look at books online.

There's interest in using the OCR data for textual analysis research.

Technorati Tags: , , , ,

Posted by oren at January 3, 2007 3:32 PM

Comments

Post a comment

Thanks for signing in, . Now you can comment. (sign out)

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)


Remember me?