Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Last revisionBoth sides next revision
dev:translator_framework [2011/04/10 17:26] – more on FW ajlyondev:translator_framework [2011/04/20 17:07] rmzelle
Line 1: Line 1:
-===== Translator Framework ===== +See [[dev/translators/Framework]].
-The translator framework is a way to build web translators that lets translator authors avoid most of the boilerplate that usually is required for new translators, making it possible to write simple content scrapers in just a few lines of JavaScript. +
- +
-The framework was written and contributed by Erik Hetzner and is licensed under the GPLv3+. It currently resides at http://e6h.org/~egh/hg/zotero-transfw/, but there are plans to include it in Zotero itself. +
- +
-To use the framework, simply insert the framework code at the beginning of your translator, after the translator information block (JSON header). If you are using [[dev/scaffold|Scaffold]] to develop your translator, you won't see the information block, and you can just insert the framework at the top of the code box. The latest version of the code is [[http://e6h.org/~egh/hg/zotero-transfw/raw-file/tip/framework.js|here]]. +
- +
-You'll start writing beneath the line that reads: +
-''/* End generic code */'' +
- +
-===Example Translator=== +
-From APN.ru.js (GPLv3+ licensed): +
-<code javascript> +
-function detectWeb(doc, url) { return FW.detectWeb(doc, url); } +
-function doWeb(doc, url) { return FW.doWeb(doc, url); } +
- +
-/** Articles */ +
-FW.Scraper({ +
-itemType         : 'newspaperArticle', +
-detect           : FW.Xpath('//div[@class="block_div"]/div/*[@class="article_title"]'), +
-title            : FW.Xpath('//div[@class="block_div"]/div/*[@class="article_title"]').text().trim(), +
-attachments      : FW.Url().replace(/article/,"print").makeAttachment("text/html", "APN.ru Printable"), +
-creators         : FW.Xpath('//div[@class="block_div"]/div/a[@class="pub_aname"]').text().cleanAuthor("author"), +
-date             : FW.Xpath('//div[@class="block_div"]/div/span[@class="pub_date"]').text(), +
-publicationTitle : "Агенство политических новостей" +
-}); +
- +
-/** Search results */ +
-FW.MultiScraper({ +
-itemType  : "multiple", +
-detect    : FW.Xpath('//div[@class="search_content"]'), +
-titles    : FW.Xpath('//div[@class="search_content"]/div/a[@class="searchtitle"]').text(), +
-urls    : FW.Xpath('//div[@class="search_content"]/div/a[@class="searchtitle"]').key('href').text() +
-}); +
-</code> +
- +
-This is the functional portion of a real, working web translator using the translator framework. It defines two scrapers, in this case one for newspaper articles and one for multiple result pages. +
- +
-This is the general model for creating a translator using the framework -- define several scrapers that are triggered by different kinds of page content or URLs. +
-  +
-===Scrapers=== +
-As the example translator above shows, there are two kinds of scrapers in the framework, defined using the functions ''FW.Scraper()'' and ''FW.MultiScraper()''. The first kind identifies item metadata for a single item from a single page, while the second kind identifies item page URLs on a single page and is usually used for things like search results of journal issue tables of contents. +
- +
-Both kinds of scrapers are defined by passing an object with the scraper's item type (''itemType''), detect conditions (''detect'') and other keys to the corresponding function. +
- +
-Each value can be either a string, in which case it is always the same, a function, or a chained series of filters. This last form is most common. In the above example we can see, for instance, the ''creators'' filter. It starts with an XPath expression. This expression is then turned into text only using the ''.text()'' function. Finally, the author is cleaned up using the ''cleanAuthor'' function as provided by Zotero. +
- +
-== FW.Scraper == +
-  * Required keys: ''detect'', ''itemType'' +
-  * Optional keys: ''attachments'', all [[http://gsl-nagoya-u.net/http/pub/csl-fields/index.html|Zotero item fields]] +
- +
-== FW.MultiScraper == +
-  * Required keys: ''detect'', ''itemType'', ''titles'', ''urls'' +
-  * Optional keys: ''attachments'', ''beforeFilter'' +
- +
-The ''titles'' and ''urls'' keys should be set to expressions that yield sets of corresponding titles and URLs of items to be processed by other scrapers defined in the translator. If set, the ''attachments'' key should be set to an expression yielding a corresponding set of attachment objects (that is, one for each title and URL). +
- +
-The ''beforeFilter'' key can be set to a function that returns a new URL. The framework will request this new document and run the MultiScraper in the context of the resulting document and URL. This is used in the framework translator for Google Scholar: +
-<code javascript> +
-beforeFilter : function (doc, url) { +
-                var haveBibTeXLinks = FW.Xpath('//a[contains(@href, "scholar.bib")]'+
-                      .evaluate(doc); +
-                if(!haveBibTeXLinks) { +
-                      url = url.replace (/hl\=[^&]*&?/, ""); +
-                      url = url.replace("scholar?", +
-                         "scholar_setprefs?hl=en&scis=yes&scisf=4&submit=Save+Preferences&"); +
-                 } +
-                 return url; +
-}  +
-</code> +
-Here the option is used to guarantee that the multiple item page has links to the BibTeX files that the translator uses. +
- +
-== Delegation == +
-It is possible to have a translator using this framework delegate processing to another translator, by setting the key ''itemTrans'', as in this example from the framework-derived version of the Google Scholar translator: +
- +
-<code javascript> +
-itemTrans : FW.DelegateTranslator({ translatorType : "import", +
-                                    translatorId   : "9cb70025-a888-4a29-a210-93ec52da40d4"}), +
-</code> +
- +
-==== Functions ==== +
-FIXME Functions that can be used with the framework. +
-=== Main functions === +
-  * ''FW.PageText ( )'' +
-  * ''FW.Url ( )'' +
-  * ''FW.Xpath ( expression )'' +
-  * ''FW.Scraper ( {..} )'' +
-  * ''FW.MultiScraper ( {..} )'' +
-=== String functions === +
-  * ''prepend ( text )'' Add a string to the end of the result. +
-  * ''append ( text )'' Add a string to the beginning of the rest. +
-  * ''remove (regex, flags )'' note that empty entries are dropped silently-- can be used to filter +
-  * ''trim ()'' +
-  * ''trimInternal ()'' +
-  * ''match ( regex, [ group ] )'' Match the regex, and pass on the match group. If no group is specified, the whole match is used. +
-  * ''capitalizeTitle ( )'' FIXME Should support flag? +
-  * ''unescapeHTML ( text )'' +
-  * ''unescape ( text )'' +
-  * ''key ( key )'' +
-  * ''split ( regex )'' Split the string into multiple string on the regex. +
-  * ''join ( separator )'' Join all the strings into one, using the separator between them. +
-=== Zotero functions === +
-  * ''cleanAuthor ( text, useComma )'' +
-  * ''makeAttachment ( type, title )''+
dev/translator_framework.txt · Last modified: 2017/11/12 19:53 by 127.0.0.1