This is an old revision of the document!


Translator Framework

The translator framework is a way to build web translators that lets translator authors avoid most of the boilerplate that usually is required for new translators, making it possible to write simple content scrapers in just a few lines of JavaScript.

The framework was written and contributed by Erik Hetzner and is licensed under the GPLv3+. It currently resides at http://e6h.org/~egh/hg/zotero-transfw/, but there are plans to include it in Zotero itself.

To use the framework, simply insert the framework code at the beginning of your translator, after the translator information block (JSON header). If you are using Scaffold to develop your translator, you won't see the information block, and you can just insert the framework at the top of the code box. The latest version of the code is here.

You'll start writing beneath the line that reads: /* End generic code */

Example Translator

From APN.ru.js (GPLv3+ licensed):

function detectWeb(doc, url) { return FW.detectWeb(doc, url); }
function doWeb(doc, url) { return FW.doWeb(doc, url); }
 
/** Articles */
FW.Scraper({
itemType         : 'newspaperArticle',
detect           : FW.Xpath('//div[@class="block_div"]/div/*[@class="article_title"]'),
title            : FW.Xpath('//div[@class="block_div"]/div/*[@class="article_title"]').text().trim(),
attachments      : FW.Url().replace(/article/,"print").makeAttachment("text/html", "APN.ru Printable"),
creators         : FW.Xpath('//div[@class="block_div"]/div/a[@class="pub_aname"]').text().cleanAuthor("author"),
date             : FW.Xpath('//div[@class="block_div"]/div/span[@class="pub_date"]').text(),
publicationTitle : "Агенство политических новостей"
});
 
/** Search results */
FW.MultiScraper({
itemType  : "multiple",
detect    : FW.Xpath('//div[@class="search_content"]'),
titles    : FW.Xpath('//div[@class="search_content"]/div/a[@class="searchtitle"]').text(),
urls    : FW.Xpath('//div[@class="search_content"]/div/a[@class="searchtitle"]').key('href').text()
});

This is the functional portion of a real, working web translator using the translator framework. It defines two scrapers, in this case one for newspaper articles and one for multiple result pages.

This is the general model for creating a translator using the framework – define several scrapers that are triggered by different kinds of page content or URLs.

Scrapers

As the example translator above shows, there are two kinds of scrapers in the framework, defined using the functions FW.Scraper() and FW.MultiScraper(). The first kind identifies item metadata for a single item from a single page, while the second kind identifies item page URLs on a single page and is usually used for things like search results of journal issue tables of contents.

Both kinds of scrapers are defined by passing an object with the scraper's item type (itemType), detect conditions (detect) and other keys to the corresponding function.

Each value can be either a string, in which case it is always the same, a function, or a chained series of filters. This last form is most common. In the above example we can see, for instance, the creators filter. It starts with an XPath expression. This expression is then turned into text only using the .text() function. Finally, the author is cleaned up using the cleanAuthor function as provided by Zotero.

FW.Scraper
FW.MultiScraper
  • Required keys: detect, itemType, titles, urls
  • Optional keys: attachments, beforeFilter

The titles and urls keys should be set to expressions that yield sets of corresponding titles and URLs of items to be processed by other scrapers defined in the translator. If set, the attachments key should be set to an expression yielding a corresponding set of attachment objects (that is, one for each title and URL).

The beforeFilter key can be set to a function that returns a new URL. The framework will request this new document and run the MultiScraper in the context of the resulting document and URL. This is used in the framework translator for Google Scholar:

beforeFilter : function (doc, url) {
                var haveBibTeXLinks = FW.Xpath('//a[contains(@href, "scholar.bib")]')
                      .evaluate(doc);
                if(!haveBibTeXLinks) {
                      url = url.replace (/hl\=[^&]*&?/, "");
                      url = url.replace("scholar?",
                         "scholar_setprefs?hl=en&scis=yes&scisf=4&submit=Save+Preferences&");
                 }
                 return url;
} 

Here the option is used to guarantee that the multiple item page has links to the BibTeX files that the translator uses.

Delegation

It is possible to have a translator using this framework delegate processing to another translator, by setting the key itemTrans, as in this example from the framework-derived version of the Google Scholar translator:

itemTrans : FW.DelegateTranslator({ translatorType : "import",
                                    translatorId   : "9cb70025-a888-4a29-a210-93ec52da40d4"}),

Functions

FIXME Functions that can be used with the framework.

Main functions

  • FW.PageText ( )
  • FW.Url ( )
  • FW.Xpath ( expression )
  • FW.Scraper ( {..} )
  • FW.MultiScraper ( {..} )

String functions

  • prepend ( text ) Add a string to the end of the result.
  • append ( text ) Add a string to the beginning of the rest.
  • remove (regex, flags ) note that empty entries are dropped silently– can be used to filter
  • trim ()
  • trimInternal ()
  • match ( regex, [ group ] ) Match the regex, and pass on the match group. If no group is specified, the whole match is used.
  • capitalizeTitle ( ) FIXME Should support flag?
  • unescapeHTML ( text )
  • unescape ( text )
  • key ( key )
  • split ( regex ) Split the string into multiple string on the regex.
  • join ( separator ) Join all the strings into one, using the separator between them.

Zotero functions

  • cleanAuthor ( text, useComma )
  • makeAttachment ( type, title )
dev/translator_framework.1302470805.txt.gz · Last modified: 2011/04/10 17:26 by ajlyon