September 12, 2022 - Starting the Cataloging
The process of keying in the ISBNs into the Library of Congress search page was just too cumbersome. Time to automate, but we’ll take an old boss of ours advice, Greg Stroud pointed out it was just too hard to automate much more than 95% of a job. So, a LiveCode (successor to Apple’s Hypercard) stack was created and a script written to automatically send a query to the LOC web server and parse out what we needed from the result. We only try to search with the ISBN (the 95%) not attempt to search if that fails. When it doesn’t work, searching by title and author can return tens of thousands of results. So you have to keep trying various things to find the book you want to classify. Forgot to throttle to keep the Library of Congress web people happy. They don’t want to see more than 10 queries per minute, so I throttle the loop to only do 6 per minute to make sure they don’t block my IP address. Here’s the card script:
-- Use https://catalog.loc.gov/index.html to get the bibliography and Library of Congress classification.
--
-- Modification history:
-- Created 9/12/2022
on SendQuery
put card field "SearchField" into searchItems
put "" into card field "Results Field"
-- Break the search field's contents into ISBNS and Titles. Note that titles can be "This book by author". Anything that doesn't have a comma.
-- As of version 9.5.1 split searchItems by comma doesn't work. It fouls out on the first line, putting ONLY the title in
-- the first item, then the 1st lines ISBN with the second lines title. So, we have to break the lines up by hand.
-- Bug reported https://quality.livecode.com/show_bug.cgi?id=23934
put "" into ISBNS
put "" into Titles
put the number of lines of searchItems into temp
repeat with j = 1 to the number of lines of searchItems
put line j of searchItems into tempLine
split templine by comma
put templine[1] & return after Titles
-- Knock out common debris of spaces and dashes
put replaceText(templine[2], "[ -]", "") into curISBN
put curISBN & return after ISBNS
end repeat
repeat with j = 1 to the number of lines of searchItems
put card field "BaseURL" into baseURL
put baseURL & line j of ISBNS into searchURL
put searchURL into card field "SearchURL"
put URL (searchURL) into card field "RawResult"
-- 1st Check for Errors, nothing is going to work if we get the "No Connections" error.
put getLOCitem(card field "RawResult", card field "No Connection Grep Pattern") into NoConnectionError
if NoConnectionError is not empty then
put "Got pesky No Connections Available Error" & return after card field "Results Field"
break
end if
put getLOCitem(card field "RawResult", card field "IP Request Form Pattern") into IPRequestFormError
if IPRequestFormError is not empty then
put "Got IP Request form Error" & return after card field "Results Field"
break
end if
put getLOCitem(card field "RawResult", card field "Permalink Grep Pattern") into permLink
if permLink is empty then
put "Search Failed for ISBN: " & line j of ISBNS & " ," & line j of Titles & return after card field "Results Field"
set the foreColor of word 1 to 4 of last line of field "Results Field" to "red"
else
put getLOCitem(card field "RawResult", card field "Author Grep Pattern") into Author
put getLOCitem(card field "RawResult", card field "Book Title Grep Pattern") into Title
put getLOCitem(card field "RawResult", card field "Classification Grep Pattern") into Classification
-- Have to put a bar space if front the ISBN so Apple numbers won't think it's a number and trim off leading zeros. ISBN should really have
-- been International Standard Book IDENTIFIER! This bug in numbers has been around since 2015 https://discussions.apple.com/thread/6826306
put Author & tab & Title & tab & Classification & tab & "| " & line j of ISBNS & tab & permLink & return after card field "Results Field"
end if
-- Make less than 10 requests per minute
wait 10 seconds
end repeat
-- Trim last new line so Numbers won't think the ISBN is a number
delete the last character of card field "Results Field"
end SendQuery
-- Get the LOC permalink from the HTML page
function getLOCitem HTML, grepPattern
put "" into LOCitem
put matchText(HTML, grepPattern, LOCitem ) into theResult
return LOCitem
end getLOCitem
on closeCard
end closeCard
There’s also a tiny script in the “Send Query” Button:
on mouseUp
SendQuery
end mouseUp
Above is a screenshot of the single card LiveCode stack. By keeping all the greps and URLs in text fields, we can follow any changes the Library of Congress makes to their web site without making any code changes. What this stack is doing is called “Screen Scraping” because we’re grepping the returned HTML that a browser would display instead of using an API (application program interface) that the LOC provides as an extremely expensive subscription service. Even if we did subscribe, we’d still have to write code to handle its results.