The Library of Congress has launched a new AI-powered image-based tool for searching through old newspapers, enabling anyone to find historic images from more than 16 million scanned newspaper pages. Newspaper Navigator builds upon the LOC’s existing Chronicling America project, the result being a visual content recognition model capable of finding a variety of images in digitized newspapers, including maps, comics, photographs, illustrations, advertisements and more.
The Chronicling America project is the LOC’s historic newspaper archive. With this tool, anyone can use optical recognition technology (OCR) to search through a vast archive of digitized newspapers dating back to the late 1700s. Newspaper Navigator builds upon this, introducing the ability to search for images rather than text. The object detection model was trained using annotated newspaper pages from the Chronicling America project, enabling it to extract the visual content from 16,358,041 newspaper pages.
The new tool was created by LOC 2020 Innovator in Residence Benjamin Charles Germain Lee who detailed the project in a new video. In addition to offering a search tool online, the LOC has released the extracted visual content as prepackaged datasets available to download from Github. This prepackaged content is split up by year and includes a variety of metadata alongside the images.
Users can search through more than 1.6 million images sourced from newspapers dated from the year 1900 to 1963. The results are fairly accurate, though the use of optical character recognition for extracting descriptions of the content can be lackluster if the quality of the scanned newspaper text is poor.
The interface includes some useful options, including links for downloading the images, viewing the full newspaper issues, learning more about the newspapers and getting citations for images. This assumes one is using the online search tool and not the prepackaged downloadable image datasets available on Github, of course.
Newspaper Navigator is ultimately the largest single dataset of extracted visual content sourced from historic newspapers that has ever been assembled, according to the full study. Machine learning technology has produced an unprecedented way to rapidly sort through digitized materials that would otherwise be far too expansive to search manually.
As for using the images found through Newspaper Navigator, the rights and reproduction terms are found under the wider Chronicling America project. According to the project’s About page, the LOC:
…believes that the newspapers in Chronicling America are in the public domain or have no known copyright restrictions. Newspapers published in the United States more than 95 years ago are in the public domain in their entirety. Any newspapers in Chronicling America that were published less than 95 years ago are also believed to be in the public domain, but may contain some copyrighted third party materials. Researchers using newspapers published less than 95 years ago should be alert for modern content (for example, registered and renewed for copyright and published with notice) that may be copyrighted.
This new tool joins the LOC’s vast digitized archive of photographs, prints and drawings, all of which are readily accessible through the LOC website. The Library provides a considerable amount of information on most of the digitized images, including everything from photo medium and genre to dates, photographers, location and image descriptions.