This is an old revision of the document!

Scaffold - an IDE for Zotero translators

Scaffold is a Firefox extension developed to simplify writing Zotero translators. In Zotero 1.0.x, translators are stored in a single SQL database. Scaffold makes it easy to extract translators from this database, to edit and test translator code, and to save changes back into the database.

Installation

Current version: 1.0.2 (July 21, 2008)

Please note: As of yet, Scaffold is only compatible with Zotero 1.0.x. Users of Zotero 2.0 can edit translators directly, as in that version translators are stored as individual text files in the 'translators' subdirectory of the Zotero data directory.

Interface

After installation, the Tools menu in Firefox should contain a “Scaffold” item. Select this item and to open the main Scaffold window:

Top buttons

Load from Database
Opens the “Load Translator” window. Select one of the currently installed translators, and load the translator metadata and code into the main Scaffold window.

Save to Database
Saves the translator you are currently working on to the database. Be sure to provide a unique label and Translator ID for your translator if you don't want to overwrite the existing translator. Translator IDs can be automatically generated via the “Generate” button.

Copy to Clipboard
This copies the entire translator to the clipboard. The translator is formatted as an SQL statement so it can be inserted into scrapers.sql (the Zotero 1.0.x database containing all translators).

Execute
Saves and runs translator code on the webpage loaded in the most recently selected tab. The exact behavior depends on the selected Scaffold tab. If the “Metadata” tab is selected, the translator is only saved. If the “Detect Code” or “Code” tab is selected, the translator is saved, after which the code in the currently selected tab is executed.

Tabs

Metadata
Here you provide metadata of the translator. Translator IDs can be automatically generated via the “Generate” button. The target Regex can be tested with the “Test Regex” button.

Detect Code
The text-field of this tab should the detectWeb function for web translators, as well as any functions detectWeb calls.

Code
The text-field of this tab should the doWeb function for web translators, as well as any functions doWeb calls.

Debug Output

One of the strengths of Scaffold is its ability to provide you with immediate feedback, which can dramatically speed up translator development. After a code change, a single click suffices to run the modified translator and generate debug output. For each of the three tabs of Scaffold, a different type of debug output is generated:

Metadata

When the “Test Regex” button in the Metadata tab is clicked, the regular expression in the target field is applied to the webpage loaded in the most recently selected Firefox tab. The debug window at the bottom of the Scaffold window will show whether the regular expression matches (true for a match, false for no match), e.g.:

09:54:11 ===>true<===(boolean)

Detect Code & Code

When the execute button is clicked while the “Detect Code”-tab is active, the detectWeb function in the “Detect Code”-text box will be executed. If the “Code”-tab is active, the doWeb in the “Code”-text box is executed.

Debug output for the “Detect Code”-tab will show what type of item is found on the loaded webpage, e.g.:

19:19:43 detectCode returned type "book"

Debug output for the “Code”-tab will show what all the item data that would be saved if the translator would be run by Zotero, e.g.:

19:24:21 Returned item:
             'itemType' => "book"
             'creators' ...
                 '0' ...
                     'firstName' => "Herman"
                     'lastName' => "Melville"
                     'creatorType' => "author"
             'notes' ...
             'tags' ...
             'seeAlso' ...
             'attachments' ...
                 '0' ...
                     'title' => "Google Books Link"
                     'snapshot' => "false"
                     'mimeType' => "text/html"
                     'url' => "http://books.google.com/books?id=cYKYYypj8UAC"
                     'document' => "[object]"
             'date' => "1851"
             'pages' => "504"
             'ISBN' => "1603033742, 9781603033749"
             'publisher' => "Plain Label Books"
             'title' => "Moby Dick"
             'repository' => "Google Books"
             'complete' => function(...){...} 
         
19:24:21 Translation successful

If running detectWeb or doWeb results in an error, the debug window will show an error message.

Note that additional debug output can be generated by the javascript command Zotero.debug(string);.

Test Frame

What is the purpose of the Test Frame drop-down menu?

Getting your translator distributed

If your new or modified translator has general appeal, consider posting the translator to the Zotero developers mailing list.

refer to http://forums.zotero.org/discussion/6533/scaffold-in-firefox-31/?Focus=28305#Comment_28305

Placeholder

All the text below should be relocated to a different page

2. Start writing your translator

Provide some metadata for your translator. The screenshot below shows the fields that you need to fill in:

Once you provide the name (the translator id will be unique by default if you start a new extension) the next thing to do is to provide the regular expression to match the translator URL In the case of the current example we want this translator to initially match any URLs starting with

http://www.emeraldinsight.com

This is the string we put in the “Target” text box.

If we now navigate to the Emerald Insight home page and click on the “Text Regex” button on the right hand side of the “Target” text box, we'll see the following appear in the debug window below:

09:54:11 ===>true<===(boolean)

If the regex fails this will appear instead:

09:52:02 ===>false<===(boolean)

For more information on regular expressions this tutorial appears to be worthwhile, or this Wikipedia article is a good starting point.

3. STOP before you code! Think about your site's construction

The first thing you should do is to look at how your site is constructed. In this case, we note that the articles we are interested in appear on pages with the title

Emerald: Article Request

. This is the simplest way that we can tell that there's an entry we want Zotero to store on this page.

The second thing we want to do is to find out if there's a “Download Citation” link or similar on the page. This will greatly simplify the development of our translator.

It turns out that the only way to get an RIS file from Emerald Insight is to save the article to your “marked list”, navigate to the “Marked List” menu then click on the “Download” button which provides you with an RIS record. Unfortunately this means that the RIS record is difficult to use in this case, so we will use a screen scraper to get the bibliographic record and the fulltext.

4. The detectWeb function

The next thing to do is to add a detectWeb function to the “Detect Code” tab in the Scaffold window. This code provides the icon in the Firefox address bar and runs the zotero hooks to get the bibliographic data and download the fulltext.

function detectWeb(doc, url) {
    if(doc.title == "Emerald: Article Request") {
        return "journalArticle";
    }
}

In this instance Zotero looks to see if the title of the Web Page is “Emerald: Article Request” and if it is, zotero privides the article icon in the address bar.

IMPORTANT: after you save the code into the database, to test it, you will need to restart Firefox. This seems to be nescessary to get Zotero to register the new plugin, and so that you will see the little “article” icon in the address bar.

A quick digression - debugging

Use the “Zotero.debug” function in any code to get messages about your zotero functions. To be able to see the Zotero output you'll need to run firefox from the command prompt. Under windows go to the Start Menu → Run and type cmd.exe. On OS X, run Teminal.app which is located in the folder /Applications/Utilities by default.

On windows the command line to run Firefox is

 "C:\Program Files\Mozilla Firefox\firefox.exe" -console

assuming the default installation location for Firefox. On OS X the command is

/Applications/Firefox.app/Contents/MacOS/firefox

. If you use Linux you should be able to just issue the command firefox from the command line (iceweasel on Debian?). See here for more details on debug output.

The doWeb function

In the “Code” tab in scaffold enter the following:

function doWeb(doc, url) {
                Zotero.debug(doc.title);
}

Now click the “Execute” button in Scaffold. You'll see the title of the current web page appear in the debug window below. This is the main way you'll use Zotero.debug to develop your translator.

Building the screen scraper logic

In terms of building a screen scraper, the relevant stuff is held in the following locations in the html: d pdf fulltext: link with the text “View PDF” html fulltext: link with the text “View HTML”

title, author, year, volume, number, pages, issn, doi and abstract fields are all contained in the html given below:

<TR CLASS="tableBodyWhite">
<TD COLSPAN="2">
<BR>
        <b>Title:</b>
        Title here.  Probably wants html formatting tags stripped.        

<BR>


<b>Author(s):</b>

        Firstname1 Surname1, Firstname2 Surname2

<BR>


<b>Journal:</b>
        Journal title

<BR>


<b>ISSN:</b>
        \d{4}-\d{4}

<BR>


<b>Year:</b>
        \d{4}
        <b> Volume:</b>
        \d+
        <b> Issue:</b>
        \d+

        <b> Page:</b>

        \d+
         -
        \d+

<BR>


<b>DOI:</b>
        doi-id-here

<BR>


<b>Publisher:</b>
        Emerald Group Publishing Limited

<BR>


<b>Abstract:</b>
        Abstract with some html formatting, <BR/> tags and so on here.

<BR>

the article url is containted in the following html snippet:

<b>Article URL:</b>
  <A HREF="http://www.emeraldinsight.com/(doi_string_here)">http://www.emeraldinsight.com/(doi_string_here)</A>
<BR>

The easiest way to generate page scrapers is with xpath expressions which tell you about the location of information within an xml document. There is a short tutorial here on using Firebug to grab xpaths, although Solvent is also recommended.

In this case we find the following:

The HTML and PDF download links are stored here:

The xpath for the PDF link is here:

/html/body/table/tbody/tr[5]/td[3]/div/table/tbody/tr/td/div/table[3]/tbody/tr[4]/td/a[2]

This can be shortened to give a relative path as below. The point of the relative path is to provide the smallest unique expression that will match the part of the document that you're looking for.

//div/table/tbody/tr/td/div/table[3]/tbody/tr[4]/td/a[2]

The xpath code for the HTML download link is as follows:

//div/table/tbody/tr/td/div/table[3]/tbody/tr[4]/td/a[1]

To obtain the link to the pdf document we use the @href attrib and the xpath expression to obtain this is:

//div/table/tbody/tr/td/div/table[3]/tbody/tr[4]/td/a[2]/@href

Note that this is a relative link and we need to make sure that we've got an absolute link rather than a relative link by viewing the source. Viewing the source and searching for “view pdf” shows us that we have a relative link. However we have a relative link to the same document with a sightly different GET parameter at the end. It turns out that we can dispense with the XPath expressions in this case. To get the full text pdf all we need to do is request the current url with the string &hdAction=lnkpd appended, and to get the full text html version, we can use the string &hdAction=lnkhtml on the end of the current url.

The relative XPath expression for the bibliographic data is here:

//div/table/tbody/tr/td/div/table[3]/tbody/tr[6]/td

And the rule seems to be that attributes (title, journal, volume etc) are stored after a block of <b>bold</b> text. We can get at Title with the following xpath expression:

//div/table/tbody/tr/td/div/table[3]/tbody/tr[6]/td/b

and Author from here

//div/table/tbody/tr/td/div/table[3]/tbody/tr[6]/td/b[2]

At this stage I'm unsure how to get the html proceeding the bold text up to but not including the next chunk of bold text. Maybe I'll have to do some regex processing

the doWeb() and scrape() functions

Here's the doWeb function which is executed when we decide to run the translator:

function doWeb(doc, url) {
    scrape(doc,url);
}

All that happens is that it runs the function scrape to get the relevant bibliographic data from the page:

function scrape(doc,url) {
	var fullTextUrl = url + "&hdAction=lnkhtml"
	var pdfUrl = url + "&hdAction=lnkpdf"
	var xpath = "//div/table/tbody/tr/td/div/table[3]/tbody/tr[6]/td"
	var allRefText = Zotero.Utilities.cleanString(doc.evaluate(xpath, doc, null, XPathResult.ANY_TYPE, null).iterateNext().textContent);

// bib data scraper code here

// zotero entry creation code here

// obtaining the pdf and fulltext attachments here

}

The first two lines are our “cheat” to get the fulltext url and the pdf url by appending the right GET string to the request.

The next line defines the xpath expression used to get the bibliographic data. I couldn't see a straightforward way of getting “everything after the bold 'Title: ' text and before the bold 'Author(s): ' text” with XPath, so I decided to use a regular expression approach on the text assigned to the variable allRefText above.

the code we use to obtain the title is as follows:

	var titleRe = "Title: (.*?) Author";
	var title = getItem(allRefText,titleRe);

and to get the raw authors string (this will be post-processed by Zotero) we use this code:

	var authorsRe = "Author.*?: (.*?)Journal";
	var authors = getItem(allRefText,authorsRe);

note the use of the getItem function to obtain the regular expressioin. Here is the code for this:

function getItem(reftext,re) {
    var item = reftext.match(re);
    // Zotero.debug(item[1]);
    return item[1];
}

Note that we've commented out the Zotero.debug line, but used it during development to make sure that the regular expressions were returning the correct thing. the reftext.match(re) line returns an array. The entire matching string is the first item in the item array. The second item in the aray ( item[1] ) which is what we return is the captured match - the part of the string that is matched by the (.*?) expression in each regurlar expression.

We repeat the call to getItem for each bibliographic item we are interested in:

	var titleRe = "Title: (.*?) Author";
	var authorsRe = "Author.*?: (.*?)Journal";
 	var journalRe = "Journal: (.*?)ISSN";
	var issnRe = "ISSN: (.*?)Year";
	var yearRe = "Year: (.*?) Volume";
        var volRe = "Volume: (.*?)Issue";
        var issueRe = "Issue: (.*?)Page";
        var pageRe = "Page: (.*?)DOI";
        var doiRe = "DOI: (.*?)Publisher";
        var publisherRe = "Publisher: (.*?)Abstract";
        var abstractRe = "Abstract: (.*?)Keywords";

We can also derive the article url from the DOI:

        var articleUrl = "http://www.emeraldinsight.com/"+doi;

Getting this into the Zotero database

Once we've obtained this data and verified with Zotero.debug that the XPath and regular expressions are working, we can start passing the data on to Zotero:

	var newArticle = new Zotero.Item('journalArticle');

	newArticle.title = title;
	newArticle.journal = journal;
	newArticle.ISSN = issn;
	newArticle.year = year;
	newArticle.volume = vol;
	newArticle.issue = issue; 
	newArticle.pages = page
	newArticle.DOI = doi;
	newArticle.publisher = publisher;
	newArticle.abstractNote = abstract;
	newArticle.url = articleUrl;
	Zotero.debug(newArticle);

Authors are a little more complex. Emerald insight formats authors as “firstname[space]surname” and splits multiple authors with commas. So we can write the following code to make it easy for Zotero to deal with:

	var aus = authors.split(",");
	 for (var i=0; i< aus.length ; i++) {
		 newArticle.creators.push(Zotero.Utilities.cleanAuthor(aus[i], "author"));
	}

(if the authors were in the format “Smith, John” we'd use a slightly different call to cleanAuthor - Zotero.Utilities.cleanAuthor(aus[i], “author”, true) will automatically swap the surname and forename)

Finally, the article needs to be saved. This is done with:

        newArticle.complete();

This stores the citation in Zotero's database.

Development tips

Each different type of citation has a different set of fields available for it (like the 'title','issue', etc. above) . One way of finding out what fields are available for a particular type are to look at https://www.zotero.org/trac/browser/extension/branches/1.0/chrome/locale/en-US/zotero/zotero.properties which lists all of the names used by Zotero. Types are listed as “itemTypes” entries, and fields are listed as “itemFields”.