Nov 2010

Using OCR for pro fit and please Ure

One of my personal projects involves reading and revisiting a journey described in a stupidly rare book from the 1920s. I finally found two copies of it online – once as a 80MB pdf file and again as a scan from Google’s famous book scanning exercise.

This fantastic preamble comes from that Google scanning and OCR…

      This is a digital copy of a book lhal w;ls preserved for general
ions on library shelves before il was carefully scanned by Google
as pari of a project

      to make the world's books discoverable online.

      Il has survived long enough for the copyright to expire and the
book to enter the public domain. A public domain book is one thai
was never subject

      to copy right or whose legal copyright term has expired. Whether
a book is in the public domain may vary country to country.
Public domain books

      are our gateways to the past, representing a wealth of history,
culture and knowledge that's often dillicull lo discover.

      Marks, notations and other marginalia present in the original
volume will appear in this file - a reminder of this book's long
journey from the

      publisher lo a library and linally lo you.

      Usage guidelines

      Google is proud lo partner with libraries lo digili/e public
domain materials and make them widely accessible. Public domain
books belong to the
public and we are merely their custodians. Nevertheless, this
work is expensive, so in order lo keep providing this resource,
we have taken steps to
prevent abuse by commercial panics, including placing Icchnical
restrictions on automated querying.

It’s not looking good when your OCR can’t even process your boilerplate frontpage…

Tags: , , , ,

  • You were about to say...?


    (will not be published) (required)


    Please leave these two fields as-is:

    Protected by Invisible Defender. Showed 403 to 119,834 bad guys.