===== Chapter 16: Scraping the Individual Entry Page ===== [[http://niche-canada.org/member-projects/zotero-guide/chapter16.html|HWZT chapter 16 (Scraping the Individual Entry Page)]]: As previously, the tutorial example works, but is rather casually presented, so here are some step-by-step instructions to produce the result of the tutorial section: - open the single-result translator - extend the single-result translator - test the single-result translator on single-result pages - test the single-result translator on a search-results page To open the single-result translator: - Close any running Scaffold 2.0 instances. - Ensure the [[http://niche-canada.org/member-projects/zotero-guide/sample1.html|first sample page]] is open in your browser and has focus. - Open Scaffold 2.0 from the Firefox main menu with Tools>Scaffold. This should popup dialog="Zotero Scaffold". - Hit icon=Load (at the upper left of the dialog). This should popup dialog="Load Translator" displaying a single table mapping "Label" to "Creator". - Scroll through the "Load Translator" table until you see Label=How to Write a Zotero Translator (single result), then hit button=OK. - You will return to the main dialog="Zotero Scaffold". Check to see that tab=Metadata is properly populated. - Click icon="Run detectWeb" (the eye) to ensure its current code still works: in the Test Frame you should get results like 12:00:00 detectWeb returned type "multiple" - Leave the current instance of Scaffold 2.0 open, since we'll use the same translator in the next section. To extend the single-result translator: add a doWeb function to the JavaScript in tab=Code. Obviously one can implement this several ways. The following code has been tested, and is somewhat more modular than that provided by HWZT. To use it, just - replace the current contents of tab=Code with the code below:function detectWeb(doc, url) { if (doc.title.match("Single Item")) { return "book"; } else if (doc.title.match("Search Results")) { return "multiple"; } } // The function used to save well formatted data to Zotero function associateData (newItem, items, field, zoteroField) { if (items[field]) { newItem[zoteroField] = items[field]; } } function scrape(doc, url) { // variable declarations var newItem = new Zotero.Item('book'); newItem.url = doc.location.href; newItem.title = "No Title Found"; var items = new Object(); var tagsContent = new Array(); // scrape page data, save to Zotero getItems(doc, items, tagsContent); getAuthors(newItem, items); getImprints(newItem, items); getTags(newItem, items, tagsContent); saveToZotero(newItem, items); } function getItems(doc, items, tagsContent) { // namespace code var namespace = doc.documentElement.namespaceURI; var nsResolver = namespace ? function(prefix) { if (prefix == 'x') return namespace; else return null; } : null; // populate "items" Object and save tags to an Array var blankCell = "temp"; var headersTemp; var headers; var contents; var myXPathObject = doc.evaluate('//td[1]', doc, nsResolver, XPathResult.ANY_TYPE, null); var myXPathObject2 = doc.evaluate('//td[2]', doc, nsResolver, XPathResult.ANY_TYPE, null); while (headers = myXPathObject.iterateNext()) { headersTemp = headers.textContent; if (!headersTemp.match(/\w/)) { headersTemp = blankCell; blankCell = blankCell + "1"; } contents = myXPathObject2.iterateNext().textContent; if (headersTemp.match("temp")) { tagsContent.push(contents); } items[headersTemp.replace(/\s+/g, '')]=contents.replace(/^\s*|\s*$/g, ''); } } function getAuthors(newItem, items) { //Formatting and saving "Author" field if (items["PrincipalAuthor:"]) { var author = items["PrincipalAuthor:"]; if (author.match("; ")) { var authors = author.split("; "); for (var i in authors) { newItem.creators.push(Zotero.Utilities.cleanAuthor(authors[i], "author")); } } else { newItem.creators.push(Zotero.Utilities.cleanAuthor(author, "author")); } } } function getImprints(newItem, items) { // Format and save "Imprint" fields if (items["Imprint:"]) { items["Imprint:"] = items["Imprint:"].replace(/\s\s+/g, ''); if (items["Imprint:"].match(":")) { var colonLoc = items["Imprint:"].indexOf(":"); newItem.place = items["Imprint:"].substr(1, colonLoc-1); var commaLoc = items["Imprint:"].lastIndexOf(","); var date1 =items["Imprint:"].substr(commaLoc + 1); newItem.date = date1.substr(0, date1.length-1); newItem.publisher = items["Imprint:"].substr(colonLoc+1, commaLoc-colonLoc-1); } else { newItem.publisher = items["Imprint:"]; } } } function getTags(newItem, items, tagsContent) { if (items["Subjects:"]) { tagsContent.push(items["Subjects:"]); } for (var i = 0; i < tagsContent.length; i++) { newItem.tags[i] = tagsContent[i]; } } function saveToZotero(newItem, items) { // Associate and save well-formed data to Zotero associateData (newItem, items, "Title:", "title"); associateData (newItem, items, "ISBN-10:", "ISBN"); associateData (newItem, items, "Collection:", "extra"); associateData (newItem, items, "Pages:", "pages"); newItem.repository = "NiCHE"; newItem.complete(); } function doWeb(doc, url) { // namespace code var namespace = doc.documentElement.namespaceURI; var nsResolver = namespace ? function(prefix) { if (prefix == 'x') return namespace; else return null; } : null; // variable declarations var articles = new Array(); var items = new Object(); var nextTitle; // If Statement checks if page is a Search Result, then saves requested Items if (detectWeb(doc, url) == "multiple") { var titles = doc.evaluate('//td[2]/a', doc, nsResolver, XPathResult.ANY_TYPE, null); while (nextTitle = titles.iterateNext()) { items[nextTitle.href] = nextTitle.textContent; } items = Zotero.selectItems(items); for (var i in items) { articles.push(i); } } else { //saves single page items articles = [url]; } // process everything, calling function=scrape to do the heavy lifting Zotero.Utilities.processDocuments(articles, scrape, function(){Zotero.done();}); Zotero.wait(); } - Click icon=Save (second from left): your translator should save silently. - Close all running Scaffold 2.0 instances. To test the translator on a single-result page: - Ensure the [[http://niche-canada.org/member-projects/zotero-guide/sample1.html|first sample page]] is open in your browser and has focus. - Open Scaffold 2.0 from the Firefox main menu with Tools>Scaffold. This should popup dialog="Zotero Scaffold". - Hit icon=Load (at the upper left of the dialog). This should popup dialog="Load Translator" displaying a single table mapping "Label" to "Creator". - Scroll through the "Load Translator" table until you see Label=How to Write a Zotero Translator (single result), then hit button=OK. - You will return to the main dialog="Zotero Scaffold". Check to see that tab=Metadata is properly populated. - Click icon="Run doWeb" (the thunderbolt) to test the new code: in the Test Frame you should get results like 12:00:00 Returned item: 'itemType' => "book" 'creators' ... '0' ... 'firstName' => "Alan" 'lastName' => "MacEachern" 'creatorType' => "author" '1' ... 'firstName' => "William J." 'lastName' => " Turkel" 'creatorType' => "author" 'notes' ... 'tags' ... '0' => "History" '1' => "Methodology" '2' => "Tables." '3' => "Environment" 'seeAlso' ... 'attachments' ... 'url' => "http://niche-canada.org/member-projects/zotero-guide/sample1.html" 'title' => "Method and Meaning in Canadian Environmental History" 'place' => "Toronto" 'date' => "2009" 'publisher' => "Nelson Canada" 'ISBN' => "0176441166" 'extra' => "None" 'pages' => "573" 'libraryCatalog' => "NiCHE" 'complete' => function(...){...} 14:36:30 Translation successful - Close all running Scaffold 2.0 instances. Test the translator on another single-result page: - Ensure the [[http://niche-canada.org/member-projects/zotero-guide/sample2.html|second sample page]] is open in your browser and has focus. - Open Scaffold 2.0 from the Firefox main menu with Tools>Scaffold. This should popup dialog="Zotero Scaffold". - Hit icon=Load (at the upper left of the dialog). This should popup dialog="Load Translator" displaying a single table mapping "Label" to "Creator". - Scroll through the "Load Translator" table until you see Label=How to Write a Zotero Translator (single result), then hit button=OK. - You will return to the main dialog="Zotero Scaffold". Check to see that tab=Metadata is properly populated. - Click icon="Run doWeb" (the thunderbolt) to test the new code: in the Test Frame you should get results like 12:00:00 Returned item: 'itemType' => "book" 'creators' ... '0' ... 'firstName' => "David Freeland" 'lastName' => "Duke" 'creatorType' => "author" 'notes' ... 'tags' ... '0' => "Canada" '1' => "Environment" '2' => "Bibliography" '3' => "Tables." '4' => "History" 'seeAlso' ... 'attachments' ... 'url' => "http://niche-canada.org/member-projects/zotero-guide/sample2.html" 'title' => "Canadian Environmental History: Essential Readings" 'place' => "Toronto" 'date' => "2006" 'publisher' => "Canadian Scholars Press" 'ISBN' => "1551303108" 'extra' => "None" 'pages' => "392" 'libraryCatalog' => "NiCHE" 'complete' => function(...){...} 18:52:01 Translation successful - Close all running Scaffold 2.0 instances. To test the translator on a search-results page: - Ensure the [[http://niche-canada.org/member-projects/zotero-guide/searchresults1.html|first sample search page]] is open in your browser and has focus. - Open Scaffold 2.0 from the Firefox main menu with Tools>Scaffold. This should popup dialog="Zotero Scaffold". - Hit icon=Load (upper left of the dialog) to popup dialog="Load Translator". - Scroll through the "Load Translator" table until you see Label=How to Write a Zotero Translator (search results), then hit button=OK. - You will return to the main dialog="Zotero Scaffold". Check to see that tab=Metadata is properly populated. - Click icon="Run doWeb" (the thunderbolt): a dialog="Select Items" should popup, with a selection area containing 10 items, corresponding to the 10 items in the [[http://niche-canada.org/member-projects/zotero-guide/searchresults1.html|sample search page]]. Check to see that the titles of the items in the dialog match the titles of the items on the sample search page, then click button=Cancel on dialog="Select Items". - TODO: test adding an item to your library. - Leave the current instance of Scaffold 2.0 open, since we'll use the same translator in the next section. Since it is now clear that our "single-result translator" also handles search results properly, we can save it as just "How to Write a Zotero Translator" with no suffix: - Switch to tab=Metadata. - In field=Label, remove the "(single result)" suffix. - Click icon=Save (second from left): your translator should save silently. - TODO: delete now-unnecessary translator="How to Write a Zotero Translator (search results)" - Close all running Scaffold 2.0 instances. **Next**: [[dev/How to Write a Zotero Translator, 2nd Edition/Chapter 17|Chapter 17: Common Problems when Scraping an Individual Entry Page]]