This is an old revision of the document!


Translator Framework

The translator framework is a way to build web translators that lets translator authors avoid most of the boilerplate that usually is required for new translators, making it possible to write simple content scrapers in just a few lines of JavaScript.

The framework was written and contributed by Erik Hetzner and is licensed under the GPLv3+. It currently resides at http://e6h.org/~egh/hg/zotero-transfw/, but there are plans to include it in Zotero itself.

To use the framework, simply insert the framework code at the beginning of your translator, after the translator information block (JSON header). If you are using Scaffold to develop your translator, you won't see the information block, and you can just insert the framework at the top of the code box. The latest version of the code is here.

You'll start writing beneath the line that reads: /* End generic code */

Example Translator

From APN.ru.js (GPLv3+ licensed):

function detectWeb(doc, url) { return FW.detectWeb(doc, url); }
function doWeb(doc, url) { return FW.doWeb(doc, url); }
 
/** Articles */
FW.Scraper({
itemType         : 'newspaperArticle',
detect           : FW.Xpath('//div[@class="block_div"]/div/*[@class="article_title"]'),
title            : FW.Xpath('//div[@class="block_div"]/div/*[@class="article_title"]').text().trim(),
attachments      : FW.Url().replace(/article/,"print").makeAttachment("text/html", "APN.ru Printable"),
creators         : FW.Xpath('//div[@class="block_div"]/div/a[@class="pub_aname"]').text().cleanAuthor("author"),
date             : FW.Xpath('//div[@class="block_div"]/div/span[@class="pub_date"]').text(),
publicationTitle : "Агенство политических новостей"
});
 
/** Search results */
FW.MultiScraper({
itemType  : "multiple",
detect    : FW.Xpath('//div[@class="search_content"]'),
titles    : FW.Xpath('//div[@class="search_content"]/div/a[@class="searchtitle"]').text(),
urls    : FW.Xpath('//div[@class="search_content"]/div/a[@class="searchtitle"]').key('href').text()
});

This is the functional portion of a real, working web translator using the translator framework. It defines two scrapers, in this case one for newspaper articles and one for multiple result pages.

Functions

FIXME Functions that can be used with the framework.

Main functions

  • FW.PageText ( )
  • FW.Url ( )
  • FW.Xpath ( expression )
  • FW.Scraper ( {..} )
  • FW.MultiScraper ( {..} )

String functions

  • prepend ( text )
  • append ( text )
  • remove (regex, flags )
  • trim ()
  • trimInternal ()
  • split ( regex )
  • match ( regex, [ group ] )
  • capitalizeTitle ( ) FIXME Should support flag?
  • unescapeHTML ( text )
  • unescape ( text )

Node function

  • key ( key )

Zotero functions

  • cleanAuthor ( text, useComma )
  • makeAttachment ( type, title )
dev/translator_framework.1301949397.txt.gz · Last modified: 2011/04/04 16:36 by ajlyon