Differences

This shows you the differences between two versions of the page.

--- dev:translators:coding [2020/04/22 18:15] – [detectWeb] dstillman
+++ dev:translators:coding [2023/08/04 01:14] (current) – [Search Translators] dstillman
@@ Line 1: / Line 1: @@
 ====== Writing Translator Code ======
-Below we will describe how the ''detect*'' and ''do*'' functions of Zotero [[dev/translators]] can and should be coded. If you are unfamiliar with JavaScript, make sure to check out a [[https://developer.mozilla.org/en/JavaScript/A_re-introduction_to_JavaScript|JavaScript tutorial]] to get familiar with the syntax. In addition to the information on this page, it can often be very informative to look at existing translators to see how things are done. A [[https://www.mediawiki.org/wiki/Citoid/Creating_Zotero_translators|particularly helpful guide]] with up-to-date recommendation on best coding practices is provided by the wikimedia foundation, whose tool Citoid uses Zotero translators.
+Below we will describe how the ''detect*'' and ''do*'' functions of Zotero [[dev/translators]] can and should be coded. If you are unfamiliar with JavaScript, make sure to check out a [[https://developer.mozilla.org/en/JavaScript/A_re-introduction_to_JavaScript|JavaScript tutorial]] to get familiar with the syntax. In addition to the information on this page, it can often be very informative to look at existing translators to see how things are done. A [[https://www.mediawiki.org/wiki/Citoid/Creating_Zotero_translators|particularly helpful guide]] with up-to-date recommendation on best coding practices is provided by the Wikimedia Foundation, whose tool Citoid uses Zotero translators.
 While translators can be written with any text editor, the built-in [[dev/translators/scaffold|Translator Editor]] can make writing them much easier, as it provides the option to test and troubleshoot translators relatively quickly.
+New web translators should use the Translator Editor's web translator template as a starting point. The template can be inserted in the Code tab: click the green plus dropdown and choose "Add web translator template".
 ====== Web Translators ======
@@ Line 10: / Line 11: @@
 ===== detectWeb =====
-''detectWeb'' is run to determine whether item metadata can indeed be retrieved from the webpage. The return value of this function should be the detected item type (e.g. "journalArticle", see the [[https://aurimasv.github.io/z2csl/typeMap.xml|overview of Zotero item types]]), or, if multiple items are found, "multiple".
+''detectWeb'' is run to determine whether item metadata can indeed be retrieved from the webpage. The return value of this function should be the detected item type (e.g. "journalArticle", see the [[https://aurimasv.github.io/z2csl/typeMap.xml|overview of Zotero item types]]), or, if multiple items are found, "multiple". If no item(s) can be detected on the current page, return false.
-''detectWeb'' receives two arguments, the webpage document object and URL (typically named ''doc'' and ''url''). In some cases, the URL provides all the information needed to determine whether item metadata is available, allowing for a simple ''detectWeb'' function, e.g. (example from ''Cell Press.js''):
+''detectWeb'' receives two arguments: the webpage document object and URL (typically named ''doc'' and ''url''). In some cases, the URL provides all the information needed to determine whether item metadata is available, allowing for a simple ''detectWeb'' function, e.g. (example from ''Cell Press.js''):
 <code javascript>function detectWeb(doc, url) {
-	if (url.indexOf("search/results") != -1) {
+	if (url.includes("search/results")) {
 		return "multiple";
 	}
-	else if (url.indexOf("content/article") != -1) {
+	else if (url.includes("content/article")) {
 		return "journalArticle";
 	}
@@ Line 26: / Line 27: @@
 ===== doWeb =====
-''doWeb'' is run when a user, wishing to save one or more items, activates the selected translator. Sidestepping the retrieval of item metadata, we'll first focus on how ''doWeb'' can be used to save retrieved item metadata (as well as attachments and notes) to your Zotero library.
+''doWeb'' is run when a user, wishing to save one or more items, activates the selected translator. It can be seen as the entry point of the translation process.
+The signature of ''doWeb'' should be
+<code javascript>doWeb(doc, url)</code>
+Here ''doc'' refers to the DOM object of the web page that the user wants to save as a Zotero item, and ''url'' is the page's URL as a string.
+In this section, we will describe the common tasks in the translation workflow started by ''doWeb()''.
 ==== Saving Single Items ====
+=== Scraping for metadata ===
+"Scraping" refers to the act of collecting information that can be used to populate Zotero item fields from the web page. Such information typically include the title, creators, permanent URL, and source of the work being saved (for example, the title/volume/pages of a journal).
+Having identified what information to look for, you need to know where to look. The best way to do this is to use the web inspections tools that come with the browser ([[https://firefox-source-docs.mozilla.org/devtools-user/page_inspector/|Firefox]], [[https://developer.chrome.com/docs/devtools/dom/|Chromium-based]], and [[https://webkit.org/web-inspector/elements-tab/|Webkit/Safari]]). They are indispensable for locating the DOM node / HTML element -- by visual inspection, searching, or browsing the DOM tree.
+To actually retrieve information from the nodes in your translator code, you should be familiar with the use of [[https://developer.mozilla.org/en-US/docs/Web/API/Document_object_model/Locating_DOM_elements_using_selectors|selectors]], in the way they are used with the JavaScript API function ''[[https://developer.mozilla.org/en-US/docs/Web/API/Element/querySelectorAll|querySelectorAll()]]''.
+Most often, you will do the scraping using the helper functions ''text()'' and ''attr()'', for retrieving text content and attribute value, respectively. In fact, these two actions are performed so often, that ''text()'' and ''attr()'' are available to the translator script as top-level functions.
+<code javascript>function text(parentNode, selector[, index])
+function attr(parentNode, selector, attributeName[, index])</code>
+  * ''text()'' finds the descendant of ''parentNode'' (which can also be a document) that matches ''selector'', and returns the text content (i.e. the value of the ''textContent'' property) of the selected node, with leading and trailing whitespace trimmed. If the selector doesn't match, the empty string is returned.
+  * ''attr()'' similarly uses the selector to locate a descendant node. However, it returns the value of the HTML attribute ''attributeName'' on that element. If the selector doesn't match, or if the there's no specified attribute on that element, the empty string is returned.
+Optionally, a number ''index'' (zero-based) can be used to select a specific node when the selector matches multiple nodes. If the index is out of range, the return value of both function will be the empty string.
+Another less-used helper function ''innerText()'' has the same signature as ''text()'', but it differs from the latter by returning the selected node's ''[[https://developer.mozilla.org/en-US/docs/Web/API/HTMLElement/innerText|innerText]]'' value, which is affected by how the node's content would have been rendered.
+In addition, you can always use the API functions ''querySelector'' and ''querySelectorAll'' directly, but the helper functions should be preferred when they are adequate for the job.
+In some older translator code, you are likely to encounter node-selection expressed by XPath. Although XPath has its uses, for the most common types of scraping the selector-based functions should be preferred because of the simpler syntax of selectors.
 === Metadata ===
@@ Line 54: / Line 88: @@
 Attachments may be saved alongside item metadata via the item object's ''attachments'' property. Common attachment types are full-text PDFs, links and snapshots. An example from "Pubmed Central.js":
-<code javascript>var linkurl = "http://www.ncbi.nlm.nih.gov/pmc/articles/PMC" + ids[i] + "/";
+<code javascript>var linkURL = "http://www.ncbi.nlm.nih.gov/pmc/articles/PMC" + ids[i] + "/";
 newItem.attachments = [{
-	url: linkurl,
+	url: linkURL,
 	title: "PubMed Central Link",
 	mimeType: "text/html",
@@ Line 62: / Line 96: @@
 }];
-var pdfurl = "http://www.ncbi.nlm.nih.gov/pmc/articles/PMC" + ids[i] + "/pdf/" + pdfFileName;
+var pdfURL = "http://www.ncbi.nlm.nih.gov/pmc/articles/PMC" + ids[i] + "/pdf/" + pdfFileName;
 newItem.attachments.push({
-	title:"Full Text PDF",
+	title: "Full Text PDF",
-	mimeType:"application/pdf",
+	mimeType: "application/pdf",
-	url:pdfurl
+	url: pdfURL
 });</code>
@@ Line 74: / Line 108: @@
 <code javascript>
 newItem.attachments.push({
-title:"Snapshot",
+	title: "Snapshot",
-document:doc});</code>
+	document: doc
+});</code>
+When ''document'' is set, the MIME type will be set automatically.
 Zotero will automatically use proxied versions of attachment URLs returned from translators when the original page was proxied, which allows translators to construct and return attachment URLs without needing to know whether proxying is in use. However, some sites expect unproxied PDF URLs at all times, causing PDF downloads to potentially fail if requested via a proxy. If a PDF URL is extracted directly from the page, it's already a functioning link that's proxied or not as appropriate, and a translator should include ''proxy: false'' in the attachment metadata to indicate that further proxying should not be performed:
@@ Line 81: / Line 118: @@
 <code javascript>
 item.attachments.push({
-	url:realpdf,
+	url: realPDF,
 	title: "EBSCO Full Text",
-	mimeType:"application/pdf",
+	mimeType: "application/pdf",
 	proxy: false
 });
@@ Line 90: / Line 127: @@
 === Notes ===
-Notes are saved similarly to attachments. The content of the note, which should consist of a string, should be stored in the ''note'' property of the item's ''notes'' property. A title, stored in the ''title'' property, is optional. E.g.:
+Notes are saved similarly to attachments. The content of the note, which should consist of a string, should be stored in the ''note'' property of the item's ''notes'' property. E.g.:
-<code javascript>bbCite = "Bluebook citation: " + bbCite + ".";
+<code javascript>let bbCite = "Bluebook citation: " + bbCite + ".";
-newItem.notes.push({note:bbCite});</code>
+newItem.notes.push({ note: bbCite });</code>
 === Related ===
@@ Line 102: / Line 139: @@
 When the item objects are saved via ''item.complete()'', the relationships will be established. The following code illustrates a simple seeAlso relationship:
 <code javascript>function doWeb(doc, url) {
-	var item, items, i, ilen, j, jlen;
 	Zotero.debug("Simple example of setting seeAlso relations");
-	items = [];
+	let items = [];
 	// Real data acquisition would happen here
 	var titles = ["Book A", "Book B"];
-	for (i = 0, ilen = 2; i < ilen; i += 1) {
+	for (let title of titles) {
-		item = new Zotero.Item();
+		let item = new Zotero.Item("book");
-		item.itemType = "book";
+		item.title = title;
-		item.title = titles[i];
 		items.push(item);
 	}
 	// Assign a bogus itemID to each item in the set
-	for (i = 0, ilen = items.length) {
+	for (let i = 0; i < items.length; i++) {
 		items[i].itemID = "" + i;
 	}
@@ Line 124: / Line 158: @@
 	// Set bogus itemIDs in each item's seeAlso
 	// field (skipping the item's own ID)
-	for (i = 0, ilen = items.length; i < ilen; i += 1) {
+	for (let i = 0; i < items.length; i++) {
-		for (j = 0, jlen = items.length; j < jlen; j += 1) {
+		for (let j = 0; j < items.length; j++) {
 			if (i === j) {
 				continue;
@@ Line 134: / Line 168: @@
 	// Save the items
-	for (i = 0, ilen = items.length; i < ilen; i += 1) {
+	for (let item of items) {
-		items[i].complete();
+		item.complete();
 	}
 };</code>
@@ Line 145: / Line 179: @@
 === Item Selection ===
-To present the user with a selection window that shows all the items that have been found on the webpage, a JavaScript object should be created. Then, for each item, an item ID and label should be stored in the object as a property/value pair. The item ID is used internally by the translator, and can be a URL, DOI, or any other identifier, whereas the label is shown to the user (this will usually be the item's title). Passing the object to the ''Zotero.selectItems'' function will trigger the selection window, and the function passed as the second argument will receive an object with the selected items, as in this example from the IMDb translator:
+To present the user with a selection window that shows all the items that have been found on the webpage, a JavaScript object should be created. Then, for each item, an item ID and label should be stored in the object as a property/value pair. The item ID is used internally by the translator, and can be a URL, DOI, or any other identifier, whereas the label is shown to the user (this will usually be the item's title). Passing the object to the ''Zotero.selectItems'' function will trigger the selection window, and the function passed as the second argument will receive an object with the selected items (or ''false'' if the user canceled the operation), as in this example:
-<code javascript>Zotero.selectItems(items, function(items) {
+<code javascript>Zotero.selectItems(getSearchResults(doc, false), function (items) {
-	if(!items) return true;
+    if (items) ZU.processDocuments(Object.keys(items), scrape);
-	for (var i in items) {
-		ids.push(i);
-	}
-	apiFetch(ids);
 });</code>
 Here, ''Zotero.selectItems(..)'' is called with an anonymous function as the callback. As in many translators, the selected items are simply loaded into an array and passed off to a processing function that makes requests for each of them.
@@ Line 157: / Line 187: @@
 === Batch Saving ===
-You will often need to make additional requests to fetch all the metadata needed, either to make multiple items, or to get additional information on a single item. The most common and reliable way to make such requests is with the utility functions ''Zotero.Utilities.doGet'', ''Zotero.Utilities.doPost'', ''Zotero.Utilities.processDocuments''.
+You will often need to make additional requests to fetch all the metadata needed, either to make multiple items, or to get additional information on a single item. The most common and reliable way to make such requests is with the utility functions ''Zotero.Utilities.doGet'', ''Zotero.Utilities.doPost'', and ''Zotero.Utilities.processDocuments''.
 ''Zotero.Utilities.doGet(url, callback, onDone, charset)'' sends a GET request to the specified URL or to each in an array of URLs, and then calls function ''callback'' with three arguments: response string, response object, and the URL. This function is frequently used to fetch standard representations of items in formats like RIS and BibTeX. The function ''onDone'' is called when the input URLs have all been processed. The optional ''charset'' argument forces the response to be interpreted in the specified character set.
@@ Line 167: / Line 197: @@
 **Note:** The response objects passed to the callbacks above are [[https://developer.mozilla.org/en/XMLHttpRequest|described in detail in the MDC Documentation]].
-''Zotero.Utilities.processAsync(sets, callbacks, onDone)'' can be used from translators to make it easier to correctly chain sets of asynchronous callbacks, since many translators that require multiple callbacks do it incorrectly [text from commit message, r4262]
+''Zotero.Utilities.processAsync(sets, callbacks, onDone)'' can be used from translators to make it easier to correctly chain sets of asynchronous callbacks, since many translators that require multiple callbacks do it incorrectly.
 ====== Import Translators ======
 To read in the input text, call ''Zotero.read()'':
 <code javascript>
 var line;
-while((line = Zotero.read()) !== false)) {
+while ((line = Zotero.read()) !== false)) {
       // Do something
 }
@@ Line 190: / Line 220: @@
 collection.name = "Test Collection";
 collection.type = "collection";
-collection.children = [{type:"item", id:"my-item-id"}];
+collection.children = [{type: "item", id: "my-item-id"}];
 collection.complete();
 </code>
@@ Line 196: / Line 226: @@
 The children of a collection can include other collections. In this case, ''collection.complete()'' should be called only on the top-level collection.
 ====== Export Translators ======
-Export translators use ''Zotero.nextItem()'' and optionally ''Zotero.nextCollection()'' to iterate through the items selected for export, and generally write their output using ''Zotero.write( text )''. A minimal translator might be:
+Export translators use ''Zotero.nextItem()'' and optionally ''Zotero.nextCollection()'' to iterate through the items selected for export, and generally write their output using ''Zotero.write(text)''. A minimal translator might be:
 <code javascript>
 function doExport() {
@@ Line 211: / Line 241: @@
 If ''configOptions'' in [[dev:translators#metadata|the translator metadata]] has the ''getCollections'' attribute set to ''true'', the ''Zotero.nextCollection()'' call will be available. It provides collection objects like those created on import.
 <code javascript>
-while(collection = Zotero.nextCollection()) {
+while ((collection = Zotero.nextCollection())) {
         // Do something
 }
@@ Line 218: / Line 248: @@
 <code javascript>
 {
-        id : "ABCD1234", // Eight-character hexadecimal key
+        id: "ABCD1234", // Eight-character hexadecimal key
-        children : [ item, item, .. , item ], // Array of Zotero item objects
+        children: [item, item, .., item], // Array of Zotero item objects
-        name : "Test Collection"
+        name: "Test Collection"
 }
 </code>
@@ Line 228: / Line 258: @@
 <code javascript>
 function detectSearch(item) {
-        if(item.itemType === "journalArticle" || item.DOI) {
+        if (item.itemType === "journalArticle" || item.DOI) {
                 return true;
         }
@@ Line 235: / Line 265: @@
 </code>
-''doSearch'' should augment the provided item with additional information and call ''item.complete()'' when done. Since search translators are never called directly, but only by other translators or by the [[:getting_stuff_into_your_library#add_item_by_identifier|Add Item by Identifier]] (magic wand) function, it is common for the information to be further processed an [[#calling_other_translators|''itemDone'' handler]] specified in the calling translator.
+''doSearch'' should augment the provided item with additional information and call ''item.complete()'' when done. Since search translators are never called directly, but only by other translators or by the [[:adding_items_to_zotero#add_item_by_identifier|Add Item by Identifier]] (magic wand) function, it is common for the information to be further processed an [[#calling_other_translators|''itemDone'' handler]] specified in the calling translator.
 ====== Further Reference ======
 ===== Utility Functions =====
-Zotero provides several [[https://github.com/zotero/zotero/blob/master/chrome/content/zotero/xpcom/utilities.js|utility functions]] for translators to use. Some of them are used for asynchronous and synchronous HTTP requests; those are [[#batch_saving|discussed above]]. In addition to those HTTP functions and the many standard functions provided by JavaScript, Zotero provides:
+Zotero provides several [[https://github.com/zotero/utilities/blob/master/utilities.js|utility functions]] for translators to use. Some of them are used for asynchronous and synchronous HTTP requests; those are [[#batch_saving|discussed above]]. In addition to those HTTP functions and the many standard functions provided by JavaScript, Zotero provides:
   * ''Zotero.Utilities.capitalizeTitle(title, ignorePreference)''\\ Applies English-style title case to the string, if the capitalizeTitles [[/support/hidden_prefs|hidden preference]] is set. If ''ignorePreference'' is true, title case will be applied even if the preference is set to false. This function is often useful for fixing capitalization of personal names, in conjunction with the built-in string method ''text.toLowerCase()''.
   * ''Zotero.Utilities.cleanAuthor(author, creatorType, hasComma)''\\ Attempts to split the given string into firstName and lastName components, splitting on a comma if desired and performs some clean-up (e.g. removes unnecessary white-spaces and punctuation). The creatorType (see the [[http://gimranov.com/research/zotero/creator-types|list of valid creator types]] for each item type) will be just passed trough. Returns a creator object of the form: ''{ lastName: , firstName: , creatorType: }'', which can for example used directly in ''item.creators.push()'' as argument.