The images belong to the first batch of "The Commons" which is a new collection of the Internet Archive made up of photographs that came from over 600 million book pages digitally scanned by the Internet Archive organization. The pages amount to over 19 petabytes of data with over 14 million images expected to be accessible online.
Currently, Kalev Leetaru has successfully uploaded 2.6 million images to Flickr which is made searchable with the tags that have been automatically added. The images are said to have been difficult to access until this time.
As per Leetaru, digitization projects had so far placed more emphasis on words and ignored pictures.
"For all these years all the libraries have been digitizing their books, but they have been putting them up as PDFs or text searchable works," says Leetaru. "They have been focusing on the books as a collection of words. This inverts that."
The most impressive feature of the Internet Archive's project is the amount of detail that it places to each image. Apart from the descriptions by Flickr, the Internet Archive adds other details such as the book title, where the image came from, the publisher and the year it was published, author, and even subject whenever it's applicable.
Users who are searching for a certain image will receive page hyperlinks where the image had appeared which are all viewable through the Internet Archive's website. Furthermore, users will get a link to the book's description and to the other scanned images of the Internet Archive based on the given title.
Whenever available, the Internet Archive will also come up with lists of any text that comes with the image.
"The latter is especially powerful, as it allows to keyword search 500 years of images, instantly accessing particular topics or themes," stated Flickr in its blog.
Leetaru started working on the project while he was researching on communications technology at Georgetown University located in Washington, D.C. The project is part of a fellowship that is sponsored by Yahoo which owns the photo-sharing site Flickr.
The Internet Archive used a sort of an optical character recognition (OCR) that will analyze each of the 600 million scanned pages to convert the image of each word into searchable text. The software recognizes which parts of a page were pictures and discards them. Leetaru saves each one as a separate Jpeg picture file format. Each Jpeg with an associated text is then posted to a new Flickr page.