WARNING: This is an experimental API and can change in future.
Here is an example of searching inside a book using the searching within a book using the API.
This API is based on changing datanode hosts (i.e. an archive.org item's files live on data hosts which can change). To find the data node host of an item, go to archive.org/metadata/{identifier} and change the prefix ia800204
to the value of d1
or d2
accordingly. The path
variable in the url may also have to change to dir
value within the metadata
:
Information you need to search inside a book, with an example from the above search:
-
hostname: ia800204.us.archive.org (host where the book is stored)
-
item_id: designevaluation25clin (archive.org item ID)
-
doc: designevaluation25clin (most times this is the same as the item_id)
-
path: /27/items/designevaluation25clin (path of the book on this host)
-
q: "library science" (phrase to search for)
- callback: reply (optional callback for JSONP)
You can find the hostname and path using the archive.org locator service.
Example of output from API call:
reply( {
"ia": "designevaluation25clin",
"q": "\"library science\"",
"page_count": 224,
"body_length": 475677,
"leaf0_missing": true,
"matches": [
...
]
} )
The reply includes page count, this is the number of pages that were passed to the OCR.
Example of a match:
{
"text": "The first Clinic on Library Applications of Data Processing was held at the Illini Union on the Urbana-Champaign campus of the University of Illinois, April 28 - May 1, 1963 under the sponsorship of the University of Illinois Graduate School of {{{Library}}} {{{Science}}}. Writing in the Foreword to the Clinic proceedings, Herbert Goldhor (1964) provides the rationale for sponsoring such a Clinic:",
"par": [
{
"page": 14, "page_width": 2134, "page_height": 3328,
"b": 1090, "t": 700, "r": 2024, "l": 192,
"boxes": [
{ "r": 1560, "b": 957, "t": 899, "l": 1378 },
{ "r": 1767, "b": 957, "t": 899, "l": 1587 }
]
}
]
}
Each match contains a 'text' field. This is usually a complete paragraph. The matched words are surrounded by three braces either side, like {{{this}}}.
The other field is called par, it contains details of every page that is part of this match. Paragraphs can cross pages. Each par object provides a page number, page width, height, and coordinates for the paragraph on the page. The boxes field field lists the coordinates to draw around each word or part of word in the match.
Hyphenation means words can break across lines and across pages.
History
- Created October 22, 2010
- 12 revisions
March 15, 2022 | Edited by Mek | Edited without comment. |
March 15, 2022 | Edited by Mek | Edited without comment. |
October 26, 2016 | Edited by Brenton Cheng | Edited without comment. |
January 7, 2011 | Edited by Edward Betts | host and path of sample book changed |
October 22, 2010 | Created by Edward Betts | started page |