===== Chapter 16: Scraping the Individual Entry Page =====
[[http://niche-canada.org/member-projects/zotero-guide/chapter16.html|HWZT chapter 16 (Scraping the Individual Entry Page)]]: As previously, the tutorial example works, but is rather casually presented, so here are some step-by-step instructions to produce the result of the tutorial section:
- open the single-result translator
- extend the single-result translator
- test the single-result translator on single-result pages
- test the single-result translator on a search-results page
To open the single-result translator:
- Close any running Scaffold 2.0 instances.
- Ensure the [[http://niche-canada.org/member-projects/zotero-guide/sample1.html|first sample page]] is open in your browser and has focus.
- Open Scaffold 2.0 from the Firefox main menu with Tools>Scaffold. This should popup dialog="Zotero Scaffold".
- Hit icon=Load (at the upper left of the dialog). This should popup dialog="Load Translator" displaying a single table mapping "Label" to "Creator".
- Scroll through the "Load Translator" table until you see Label=How to Write a Zotero Translator (single result), then hit button=OK.
- You will return to the main dialog="Zotero Scaffold". Check to see that tab=Metadata is properly populated.
- Click icon="Run detectWeb" (the eye) to ensure its current code still works: in the Test Frame you should get results like 12:00:00 detectWeb returned type "multiple"
- Leave the current instance of Scaffold 2.0 open, since we'll use the same translator in the next section.
To extend the single-result translator: add a doWeb function to the JavaScript in tab=Code. Obviously one can implement this several ways. The following code has been tested, and is somewhat more modular than that provided by HWZT. To use it, just
- replace the current contents of tab=Code with the code below:function detectWeb(doc, url) {
if (doc.title.match("Single Item")) {
return "book";
} else if (doc.title.match("Search Results")) {
return "multiple";
}
}
// The function used to save well formatted data to Zotero
function associateData (newItem, items, field, zoteroField) {
if (items[field]) {
newItem[zoteroField] = items[field];
}
}
function scrape(doc, url) {
// variable declarations
var newItem = new Zotero.Item('book');
newItem.url = doc.location.href;
newItem.title = "No Title Found";
var items = new Object();
var tagsContent = new Array();
// scrape page data, save to Zotero
getItems(doc, items, tagsContent);
getAuthors(newItem, items);
getImprints(newItem, items);
getTags(newItem, items, tagsContent);
saveToZotero(newItem, items);
}
function getItems(doc, items, tagsContent) {
// namespace code
var namespace = doc.documentElement.namespaceURI;
var nsResolver = namespace ?
function(prefix) {
if (prefix == 'x') return namespace; else return null;
} : null;
// populate "items" Object and save tags to an Array
var blankCell = "temp";
var headersTemp;
var headers;
var contents;
var myXPathObject = doc.evaluate('//td[1]', doc, nsResolver, XPathResult.ANY_TYPE, null);
var myXPathObject2 = doc.evaluate('//td[2]', doc, nsResolver, XPathResult.ANY_TYPE, null);
while (headers = myXPathObject.iterateNext()) {
headersTemp = headers.textContent;
if (!headersTemp.match(/\w/)) {
headersTemp = blankCell;
blankCell = blankCell + "1";
}
contents = myXPathObject2.iterateNext().textContent;
if (headersTemp.match("temp")) {
tagsContent.push(contents);
}
items[headersTemp.replace(/\s+/g, '')]=contents.replace(/^\s*|\s*$/g, '');
}
}
function getAuthors(newItem, items) {
//Formatting and saving "Author" field
if (items["PrincipalAuthor:"]) {
var author = items["PrincipalAuthor:"];
if (author.match("; ")) {
var authors = author.split("; ");
for (var i in authors) {
newItem.creators.push(Zotero.Utilities.cleanAuthor(authors[i], "author"));
}
} else {
newItem.creators.push(Zotero.Utilities.cleanAuthor(author, "author"));
}
}
}
function getImprints(newItem, items) {
// Format and save "Imprint" fields
if (items["Imprint:"]) {
items["Imprint:"] = items["Imprint:"].replace(/\s\s+/g, '');
if (items["Imprint:"].match(":")) {
var colonLoc = items["Imprint:"].indexOf(":");
newItem.place = items["Imprint:"].substr(1, colonLoc-1);
var commaLoc = items["Imprint:"].lastIndexOf(",");
var date1 =items["Imprint:"].substr(commaLoc + 1);
newItem.date = date1.substr(0, date1.length-1);
newItem.publisher = items["Imprint:"].substr(colonLoc+1, commaLoc-colonLoc-1);
} else {
newItem.publisher = items["Imprint:"];
}
}
}
function getTags(newItem, items, tagsContent) {
if (items["Subjects:"]) {
tagsContent.push(items["Subjects:"]);
}
for (var i = 0; i < tagsContent.length; i++) {
newItem.tags[i] = tagsContent[i];
}
}
function saveToZotero(newItem, items) {
// Associate and save well-formed data to Zotero
associateData (newItem, items, "Title:", "title");
associateData (newItem, items, "ISBN-10:", "ISBN");
associateData (newItem, items, "Collection:", "extra");
associateData (newItem, items, "Pages:", "pages");
newItem.repository = "NiCHE";
newItem.complete();
}
function doWeb(doc, url) {
// namespace code
var namespace = doc.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
if (prefix == 'x') return namespace; else return null;
} : null;
// variable declarations
var articles = new Array();
var items = new Object();
var nextTitle;
// If Statement checks if page is a Search Result, then saves requested Items
if (detectWeb(doc, url) == "multiple") {
var titles = doc.evaluate('//td[2]/a', doc, nsResolver, XPathResult.ANY_TYPE, null);
while (nextTitle = titles.iterateNext()) {
items[nextTitle.href] = nextTitle.textContent;
}
items = Zotero.selectItems(items);
for (var i in items) {
articles.push(i);
}
} else {
//saves single page items
articles = [url];
}
// process everything, calling function=scrape to do the heavy lifting
Zotero.Utilities.processDocuments(articles, scrape, function(){Zotero.done();});
Zotero.wait();
}
- Click icon=Save (second from left): your translator should save silently.
- Close all running Scaffold 2.0 instances.
To test the translator on a single-result page:
- Ensure the [[http://niche-canada.org/member-projects/zotero-guide/sample1.html|first sample page]] is open in your browser and has focus.
- Open Scaffold 2.0 from the Firefox main menu with Tools>Scaffold. This should popup dialog="Zotero Scaffold".
- Hit icon=Load (at the upper left of the dialog). This should popup dialog="Load Translator" displaying a single table mapping "Label" to "Creator".
- Scroll through the "Load Translator" table until you see Label=How to Write a Zotero Translator (single result), then hit button=OK.
- You will return to the main dialog="Zotero Scaffold". Check to see that tab=Metadata is properly populated.
- Click icon="Run doWeb" (the thunderbolt) to test the new code: in the Test Frame you should get results like 12:00:00 Returned item:
'itemType' => "book"
'creators' ...
'0' ...
'firstName' => "Alan"
'lastName' => "MacEachern"
'creatorType' => "author"
'1' ...
'firstName' => "William J."
'lastName' => " Turkel"
'creatorType' => "author"
'notes' ...
'tags' ...
'0' => "History"
'1' => "Methodology"
'2' => "Tables."
'3' => "Environment"
'seeAlso' ...
'attachments' ...
'url' => "http://niche-canada.org/member-projects/zotero-guide/sample1.html"
'title' => "Method and Meaning in Canadian Environmental History"
'place' => "Toronto"
'date' => "2009"
'publisher' => "Nelson Canada"
'ISBN' => "0176441166"
'extra' => "None"
'pages' => "573"
'libraryCatalog' => "NiCHE"
'complete' => function(...){...}
14:36:30 Translation successful
- Close all running Scaffold 2.0 instances.
Test the translator on another single-result page:
- Ensure the [[http://niche-canada.org/member-projects/zotero-guide/sample2.html|second sample page]] is open in your browser and has focus.
- Open Scaffold 2.0 from the Firefox main menu with Tools>Scaffold. This should popup dialog="Zotero Scaffold".
- Hit icon=Load (at the upper left of the dialog). This should popup dialog="Load Translator" displaying a single table mapping "Label" to "Creator".
- Scroll through the "Load Translator" table until you see Label=How to Write a Zotero Translator (single result), then hit button=OK.
- You will return to the main dialog="Zotero Scaffold". Check to see that tab=Metadata is properly populated.
- Click icon="Run doWeb" (the thunderbolt) to test the new code: in the Test Frame you should get results like 12:00:00 Returned item:
'itemType' => "book"
'creators' ...
'0' ...
'firstName' => "David Freeland"
'lastName' => "Duke"
'creatorType' => "author"
'notes' ...
'tags' ...
'0' => "Canada"
'1' => "Environment"
'2' => "Bibliography"
'3' => "Tables."
'4' => "History"
'seeAlso' ...
'attachments' ...
'url' => "http://niche-canada.org/member-projects/zotero-guide/sample2.html"
'title' => "Canadian Environmental History: Essential Readings"
'place' => "Toronto"
'date' => "2006"
'publisher' => "Canadian Scholars Press"
'ISBN' => "1551303108"
'extra' => "None"
'pages' => "392"
'libraryCatalog' => "NiCHE"
'complete' => function(...){...}
18:52:01 Translation successful
- Close all running Scaffold 2.0 instances.
To test the translator on a search-results page:
- Ensure the [[http://niche-canada.org/member-projects/zotero-guide/searchresults1.html|first sample search page]] is open in your browser and has focus.
- Open Scaffold 2.0 from the Firefox main menu with Tools>Scaffold. This should popup dialog="Zotero Scaffold".
- Hit icon=Load (upper left of the dialog) to popup dialog="Load Translator".
- Scroll through the "Load Translator" table until you see Label=How to Write a Zotero Translator (search results), then hit button=OK.
- You will return to the main dialog="Zotero Scaffold". Check to see that tab=Metadata is properly populated.
- Click icon="Run doWeb" (the thunderbolt): a dialog="Select Items" should popup, with a selection area containing 10 items, corresponding to the 10 items in the [[http://niche-canada.org/member-projects/zotero-guide/searchresults1.html|sample search page]]. Check to see that the titles of the items in the dialog match the titles of the items on the sample search page, then click button=Cancel on dialog="Select Items".
- TODO: test adding an item to your library.
- Leave the current instance of Scaffold 2.0 open, since we'll use the same translator in the next section.
Since it is now clear that our "single-result translator" also handles search results properly, we can save it as just "How to Write a Zotero Translator" with no suffix:
- Switch to tab=Metadata.
- In field=Label, remove the "(single result)" suffix.
- Click icon=Save (second from left): your translator should save silently.
- TODO: delete now-unnecessary translator="How to Write a Zotero Translator (search results)"
- Close all running Scaffold 2.0 instances.
**Next**: [[dev/How to Write a Zotero Translator, 2nd Edition/Chapter 17|Chapter 17: Common Problems when Scraping an Individual Entry Page]]