[CSG Winter 2007] MBooks at University of Michigan

| | Comments (0)

- Project partnership with Google publicly announced in 2004 December - scanning 7 million print volumes over 4-6 years. Direct scanning costs are borne by Google.

UM receives a copyof all digital files, including OCSR and metadata which can be used to build services. UM can share, with some restrictions. Each volume page produces 2.01 files on average - will be about 2.2 billion files, 380 TB of data. Sustained rate of 3.16 MB per second for four years.

Data characteristics - well defined file formats - image files are TIFF or JPEG 2000, OCR files and metadata are UTF-8 text. Indefinite retention. Files are largely static. Much material is in copyright, so requires security practices.

Mbooks service - can search and look at books online.

There's interest in using the OCR data for textual analysis research.

Technorati Tags: , , , ,

Leave a comment

About this Entry

This page contains a single entry by Oren Sreebny published on January 3, 2007 3:32 PM.

[CSG Winter 2007] Small File File Systems - Jim Pepin was the previous entry in this blog.

[CSG Winter 2007] Storage at Indiana University is the next entry in this blog.

Find recent content on the main index or look in the archives to find all content.

About Me
Powered by Movable Type 4.01