Differences

This shows you the differences between two versions of the page.

--- dev:scaffold_tutorial [2008/11/06 04:57] – external edit 127.0.0.1
+++ dev:scaffold_tutorial [2009/06/28 08:19] (current) – Moved Scaffold tutorial to main Scaffold page rmzelle
@@ Line 1: / Line 1: @@
-====== Developing a translator with Scaffold - a Tutorial ======
-===== 1.  Install Scaffold =====
-You can install [[scaffold|Scaffold]] from this link.
-This will provide you with a "Scaffold" item in the Tools menu in Firefox.  Select this item and a window like the below will appear:
-{{1.png?475|}}
-The 4 buttons at the top of the window are to
-.  Load from Database
-Clicking this button will show you the complete list of translators available on your system.  This is a useful repository to browse in order to understand the code behind a translator.
-.  Save to Database
-This allows you save the translator you are currently working on to the database. Be sure to provide a unique name and Translator ID for your translator.  If you want to modify an existing translator to have a different name, then you can change the "Label" and click on  the "Generate" button to make a new translator ID.
-.  Copy to Clipboard
-This copies the entire translator to the clipboard as an sql statement which can be inserted into a zotero database.
-.  Export
-This doesn't seem to do anything yet?
-.  Execute
-We'll be using this *a lot* later.
-===== 2.  Start writing your translator =====
-Provide some metadata for your translator.  The screenshot below shows the fields that you need to fill in:
-{{2.png|}}
-Once you provide the name (the translator id will be unique by default if you start a new extension) the next thing to do is to provide the regular expression to match the translator URL  In the case of the current example we want this translator to initially match any URLs starting with <code>http://www.emeraldinsight.com</code> This is the string we put in the "Target" text box.
-If we now navigate to the [[http://www.emeraldinsight.com|Emerald Insight home page]] and click on the "Text Regex" button on the right hand side of the "Target" text box, we'll see the following appear in the debug window below:
-<code>
-:54:11 ===>true<===(boolean)
-</code>
-If the regex fails this will appear instead:
-<code>
-:52:02 ===>false<===(boolean)
-</code>
-For more information on regular expressions  [[http://home.cogeco.ca/~ve3ll/jstutore.htm|this tutorial]] appears to be worthwhile, or this [[http://en.wikipedia.org/wiki/Regular_expression|Wikipedia article]] is a good starting point.
-===== 3.  STOP before you code!  Think about your site's construction =====
-The first thing you should do is to look at how your site is constructed.  In this case, we note that the articles we are interested in appear on pages with the title <code>Emerald: Article Request</code>.  This is the simplest way that we can tell that there's an entry we want Zotero to store on this page.
-The second thing we want to do is to find out if there's a "Download Citation" link or similar on the page.  This will greatly simplify the development of our translator.
-It turns out that the only way to get an RIS file from Emerald Insight is to save the article to your "marked list", navigate to the "Marked List" menu then click on the "Download" button which provides you with an RIS record.  Unfortunately this means that the RIS record is difficult to use in this case, so we will use a screen scraper to get the bibliographic record and the fulltext.
-===== 4.  The detectWeb function =====
-The next thing to do is to add a detectWeb function to the "Detect Code" tab in the Scaffold window.  This  code provides the icon in the Firefox address bar and runs the zotero hooks to get the bibliographic data and download the fulltext.
-<code>
-function detectWeb(doc, url) {
-    if(doc.title == "Emerald: Article Request") {
-        return "journalArticle";
-    }
-}
-</code>
-In this instance Zotero looks to see if the title of the Web Page is "Emerald:  Article Request" and if it is, zotero privides the article icon in the address bar.
-**IMPORTANT**: after you save the code into the database, to test it, you will need to restart Firefox.  This seems to be nescessary to get Zotero to register the new plugin, and so that you will see the little "article" icon in the address bar.
-===== A quick digression - debugging =====
-Use the "Zotero.debug" function in any code to get messages about your zotero functions.  To be able to see the Zotero output you'll need to run firefox from the command prompt.  Under windows go to the Start Menu -> Run and type cmd.exe.  On OS X, run Teminal.app which is located in the folder /Applications/Utilities by default.
-On windows the command line to run Firefox is <code> "C:\Program Files\Mozilla Firefox\firefox.exe" -console</code> assuming the default installation location for Firefox.  On OS X the command is <code>/Applications/Firefox.app/Contents/MacOS/firefox</code>.  If you use Linux you should be able to just issue the command firefox from the command line (iceweasel on Debian?). See [[/debug_output|here]] for more details on debug output.
-===== The doWeb function =====
-In the "Code" tab in scaffold enter the following:
-<code>
-function doWeb(doc, url) {
-                Zotero.debug(doc.title);
-}
-</code>
-Now click the "Execute" button in Scaffold.  You'll see the title of the current web page appear in the debug window below.  This is the main way you'll use Zotero.debug to develop your translator.
-===== Building the screen scraper logic =====
-In terms of building a screen scraper, the relevant stuff is held in the following locations in the html:
-d
-pdf fulltext:  link with the text "View PDF"
-html fulltext:  link with the text "View HTML"
-title, author, year, volume, number, pages, issn, doi and abstract fields are all contained in the html given below:
-<code>
-<TR CLASS="tableBodyWhite">
-<TD COLSPAN="2">
-<BR>
-        <b>Title:</b>
-        Title here.  Probably wants html formatting tags stripped.
-<BR>
-<b>Author(s):</b>
-        Firstname1 Surname1, Firstname2 Surname2
-<BR>
-<b>Journal:</b>
-        Journal title
-<BR>
-<b>ISSN:</b>
-        \d{4}-\d{4}
-<BR>
-<b>Year:</b>
-        \d{4}
-        <b> Volume:</b>
-        \d+
-        <b> Issue:</b>
-        \d+
-        <b> Page:</b>
-        \d+
-         -
-        \d+
-<BR>
-<b>DOI:</b>
-        doi-id-here
-<BR>
-<b>Publisher:</b>
-        Emerald Group Publishing Limited
-<BR>
-<b>Abstract:</b>
-        Abstract with some html formatting, <BR/> tags and so on here.
-<BR>
-</code>
-the article url is containted in the following html snippet:
-<code>
-<b>Article URL:</b>
-  <A HREF="http://www.emeraldinsight.com/(doi_string_here)">http://www.emeraldinsight.com/(doi_string_here)</A>
-<BR>
-</code>
-The easiest way to generate page scrapers is with xpath expressions which tell you about the location of information within an xml document.  There is a short tutorial [[grap_xpaths_with_firebug|here]] on using [[http://www.getfirebug.com|Firebug]] to grab xpaths, although [[http://simile.mit.edu/wiki/Solvent|Solvent]] is also recommended.
-In this case we find the following:
-The HTML and PDF download links are stored here:
-The xpath for the PDF link is here:
-<code>
-/html/body/table/tbody/tr[5]/td[3]/div/table/tbody/tr/td/div/table[3]/tbody/tr[4]/td/a[2]
-</code>
-This can be shortened to give a relative path as below.  The point of the relative path is to provide the smallest unique expression that will match the part of the document that you're looking for.
-<code>
-//div/table/tbody/tr/td/div/table[3]/tbody/tr[4]/td/a[2]
-</code>
-The xpath code for the HTML download link is as follows:
-<code>
-//div/table/tbody/tr/td/div/table[3]/tbody/tr[4]/td/a[1]
-</code>
-To obtain the link to the pdf document we use the @href attrib and the xpath expression to obtain this is:
-<code>
-//div/table/tbody/tr/td/div/table[3]/tbody/tr[4]/td/a[2]/@href
-</code>
-Note that this is a relative link and we need to make sure that we've got an absolute link rather than a relative link by viewing the source.  Viewing the source and searching for "view pdf" shows us that we have a relative link.  However we have a relative link to the same document with a sightly different GET parameter at the end.  It turns out that we can dispense with the XPath expressions in this case.  To get the full text pdf all we need to do is request the current url with the string &hdAction=lnkpd appended, and to get the full text html version, we can use the string &hdAction=lnkhtml on the end of the current url.
-The relative XPath expression for the bibliographic data is here:
-<code>
-//div/table/tbody/tr/td/div/table[3]/tbody/tr[6]/td
-</code>
-And the rule seems to be that attributes (title, journal, volume etc) are stored after a block of <b>bold</b> text.  We can get at Title with the following xpath expression:
-<code>
-//div/table/tbody/tr/td/div/table[3]/tbody/tr[6]/td/b
-</code>
-and Author from here
-<code>
-//div/table/tbody/tr/td/div/table[3]/tbody/tr[6]/td/b[2]
-</code>
-**At this stage I'm unsure how to get the html proceeding the bold text up to but not including the next chunk of bold text.  Maybe I'll have to do some regex processing**
-===== the doWeb() and scrape() functions =====
-Here's the doWeb function which is executed when we decide to run the translator:
-<code>
-function doWeb(doc, url) {
-    scrape(doc,url);
-}
-</code>
-All that happens is that it runs the function scrape to get the relevant bibliographic data from the page:
-<code>
-function scrape(doc,url) {
-	var fullTextUrl = url + "&hdAction=lnkhtml"
-	var pdfUrl = url + "&hdAction=lnkpdf"
-	var xpath = "//div/table/tbody/tr/td/div/table[3]/tbody/tr[6]/td"
-	var allRefText = Zotero.Utilities.cleanString(doc.evaluate(xpath, doc, null, XPathResult.ANY_TYPE, null).iterateNext().textContent);
-// bib data scraper code here
-// zotero entry creation code here
-// obtaining the pdf and fulltext attachments here
-}
-</code>
-The first two lines are our "cheat" to get the fulltext url and the pdf url by appending the right GET string to the request.
-The next line defines the xpath expression used to get the bibliographic data.  I couldn't see a straightforward way of getting "everything after the bold 'Title: ' text and before the bold 'Author(s): ' text" with XPath, so I decided to use a regular expression approach on the text assigned to the variable ''allRefText'' above.
-the code we use to obtain the title is as follows:
-<code>
-	var titleRe = "Title: (.*?) Author";
-	var title = getItem(allRefText,titleRe);
-</code>
-and to get the raw authors string (this will be post-processed by Zotero) we use this code:
-<code>
-	var authorsRe = "Author.*?: (.*?)Journal";
-	var authors = getItem(allRefText,authorsRe);
-</code>
-note the use of the ''getItem'' function to obtain the regular expressioin.  Here is the code for this:
-<code>
-function getItem(reftext,re) {
-    var item = reftext.match(re);
-    // Zotero.debug(item[1]);
-    return item[1];
-}
-</code>
-Note that we've commented out the Zotero.debug line, but used it during development to make sure that the regular expressions were returning the correct thing.  the reftext.match(re) line returns an array.  The entire matching string is the first item in the item array.  The second item in the aray ( item[1] ) which is what we return is the captured match - the part of the string that is matched by the (.*?) expression in each regurlar expression.
-We repeat the call to getItem for each bibliographic item we are interested in:
-<code>
-	var titleRe = "Title: (.*?) Author";
-	var authorsRe = "Author.*?: (.*?)Journal";
- 	var journalRe = "Journal: (.*?)ISSN";
-	var issnRe = "ISSN: (.*?)Year";
-	var yearRe = "Year: (.*?) Volume";
-        var volRe = "Volume: (.*?)Issue";
-        var issueRe = "Issue: (.*?)Page";
-        var pageRe = "Page: (.*?)DOI";
-        var doiRe = "DOI: (.*?)Publisher";
-        var publisherRe = "Publisher: (.*?)Abstract";
-        var abstractRe = "Abstract: (.*?)Keywords";
-</code>
-We can also derive the article url from the DOI:
-<code>
-        var articleUrl = "http://www.emeraldinsight.com/"+doi;
-</code>
-==== Getting this into the Zotero database  ===
-Once we've obtained this data and verified with Zotero.debug that the XPath and regular expressions are working, we can start passing the data on to Zotero:
-<code>
-	var newArticle = new Zotero.Item('journalArticle');
-	newArticle.title = title;
-	newArticle.journal = journal;
-	newArticle.ISSN = issn;
-	newArticle.year = year;
-	newArticle.volume = vol;
-	newArticle.issue = issue;
-	newArticle.pages = page
-	newArticle.DOI = doi;
-	newArticle.publisher = publisher;
-	newArticle.abstractNote = abstract;
-	newArticle.url = articleUrl;
-	Zotero.debug(newArticle);
-</code>
-Authors are a little more complex.  Emerald insight formats authors as "firstname[space]surname" and splits multiple authors with commas.  So we can write the following code to make it easy for Zotero to deal with:
-<code>
-	var aus = authors.split(",");
-	 for (var i=0; i< aus.length ; i++) {
-		 newArticle.creators.push(Zotero.Utilities.cleanAuthor(aus[i], "author"));
-	}
-</code>
-(if the authors were in the format "Smith, John" we'd use a slightly different call to cleanAuthor - Zotero.Utilities.cleanAuthor(aus[i], "author", true) will automatically swap the surname and forename)
-Finally, the article needs to be saved. This is done with:
-<code>
-        newArticle.complete();
-</code>
-This stores the citation in Zotero's database.
-==== Development tips ====
-Each different type of citation has a different set of fields available for it (like the 'title','issue', etc. above) . One way of finding out what fields are available for a particular type are to look at https://www.zotero.org/trac/browser/extension/branches/1.0/chrome/locale/en-US/zotero/zotero.properties which lists all of the names used by Zotero. Types are listed as "itemTypes" entries, and fields are listed as "itemFields".
-==== Getting your translator distributed ====
-Once you've finished your translator, if you want it to be distributed with Zotero, email the code to the Zotero developer list: zotero-dev@googlegroups.com