This is an old revision of the document!


Translator Framework

The translator framework is a way to build web translators that lets translator authors avoid most of the boilerplate that usually is required for new translators, making it possible to write simple content scrapers in just a few lines of JavaScript.

The framework was written and contributed by Erik Hetzner and is licensed under the GPLv3+. It currently resides at http://e6h.org/~egh/hg/zotero-transfw/, but there are plans to include it in Zotero itself.

To use the framework, simply insert the framework code at the beginning of your translator, after the translator information block (JSON header). If you are using Scaffold to develop your translator, you won't see the information block, and you can just insert the framework at the top of the code box. The latest version of the code is here.

You'll start writing beneath the line that reads: /* End generic code */

Example Translator

From APN.ru.js (GPLv3+ licensed):

function detectWeb(doc, url) { return FW.detectWeb(doc, url); }
function doWeb(doc, url) { return FW.doWeb(doc, url); }
 
/** Articles */
FW.Scraper({
itemType         : 'newspaperArticle',
detect           : FW.Xpath('//div[@class="block_div"]/div/*[@class="article_title"]'),
title            : FW.Xpath('//div[@class="block_div"]/div/*[@class="article_title"]').text().trim(),
attachments      : FW.Url().replace(/article/,"print").makeAttachment("text/html", "APN.ru Printable"),
creators         : FW.Xpath('//div[@class="block_div"]/div/a[@class="pub_aname"]').text().cleanAuthor("author"),
date             : FW.Xpath('//div[@class="block_div"]/div/span[@class="pub_date"]').text(),
publicationTitle : "Агенство политических новостей"
});
 
/** Search results */
FW.MultiScraper({
itemType  : "multiple",
detect    : FW.Xpath('//div[@class="search_content"]'),
titles    : FW.Xpath('//div[@class="search_content"]/div/a[@class="searchtitle"]').text(),
urls    : FW.Xpath('//div[@class="search_content"]/div/a[@class="searchtitle"]').key('href').text()
});

This is the functional portion of a real, working web translator using the translator framework. It defines two scrapers, in this case one for newspaper articles and one for multiple result pages.

This is the general model for creating a translator using the framework – define several scrapers that are triggered by different kinds of page content or URLs.

Scrapers

As the example translator above shows, there are two kinds of scrapers in the framework, defined using the functions FW.Scraper() and FW.MultiScraper(). The first kind identifies item metadata for a single item from a single page, while the second kind identifies item page URLs on a single page and is usually used for things like search results of journal issue tables of contents.

Both kinds of scrapers are defined by passing an object with the scraper's item type (itemType), detect conditions (detect) and other keys to the corresponding function.

FW.Scraper
FW.MultiScraper
  • Required keys: detect, itemType, titles, urls
  • Optional keys: attachments
Delegation

It is possible to have a translator using this framework delegate processing to another translator, by setting the key itemTrans, as in this example from the framework-derived version of the Google Scholar translator:

itemTrans : FW.DelegateTranslator({ translatorType : "import",
                                    translatorId   : "9cb70025-a888-4a29-a210-93ec52da40d4"}),

Functions

FIXME Functions that can be used with the framework.

Main functions

  • FW.PageText ( )
  • FW.Url ( )
  • FW.Xpath ( expression )
  • FW.Scraper ( {..} )
  • FW.MultiScraper ( {..} )

String functions

  • prepend ( text )
  • append ( text )
  • remove (regex, flags ) note that empty entries are dropped silently– can be used to filter
  • trim ()
  • trimInternal ()
  • match ( regex, [ group ] )
  • capitalizeTitle ( ) FIXME Should support flag?
  • unescapeHTML ( text )
  • unescape ( text )
  • key ( key )
  • split ( regex )
  • join ( separator )

Zotero functions

  • cleanAuthor ( text, useComma )
  • makeAttachment ( type, title )
dev/translator_framework.1302022482.txt.gz · Last modified: 2011/04/05 12:54 by ajlyon