Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
dev:translators:framework [2011/05/27 16:48] – [Scrapers] oops ajlyondev:translators:framework [2017/11/18 16:20] (current) – Add legacy notice adamsmith
Line 1: Line 1:
 +**While the Translator Framework still works, new translators using the Framework are no longer accepted, and we are migrating existing translators away from the format.**
 +
 +**This page exists as legacy documentation only.**
 +
 ====== Translator Framework ====== ====== Translator Framework ======
 The translator framework is a way to build web translators that lets translator authors avoid most of the boilerplate that usually is required for new translators, making it possible to write simple content scrapers in just a few lines of JavaScript. The translator framework is a way to build web translators that lets translator authors avoid most of the boilerplate that usually is required for new translators, making it possible to write simple content scrapers in just a few lines of JavaScript.
  
-The framework was written and contributed by Erik Hetzner and is licensed under the GPLv3+. It currently resides at http://e6h.org/~egh/hg/zotero-transfw/, but there are plans to include it in Zotero itself.+The framework was written and contributed by Erik Hetzner and is licensed under the AGPLv3+. It currently resides at https://gitlab.com/egh/zotero-transfw, but there are plans to include it in Zotero itself.
  
-To use the framework, simply insert the framework code at the beginning of your translator, after the translator information block (JSON header). If you are using [[dev/translators/Scaffold]] to develop your translator, you won't see the information block, and you can just insert the framework at the top of the code box. The latest version of the code is [[http://e6h.org/~egh/hg/zotero-transfw/raw-file/tip/framework.js|here]].+To use the framework, simply insert the framework code at the beginning of your translator, after the translator information block (JSON header). If you are using [[dev/translators/Scaffold]] to develop your translator, you won't see the information block, and can click "Uses translator framework" on Scaffold's Metadata tab to automatically include the code.
  
 You'll start writing beneath the line that reads: You'll start writing beneath the line that reads:
 ''/* End generic code */'' ''/* End generic code */''
  
-**Note: The most recent version of the framework has been changed somewhatand the details below on MultiScraper and attachments are currently outdated. This will soon be fixed.** +**Note:** The latest version of [[dev/translators/Scaffold]], the Zotero translator IDEcan automatically include the framework code.
 ====Example Translator==== ====Example Translator====
 From EurasiaNet.js: From EurasiaNet.js:
Line 55: Line 58:
  
 === FW.Scraper === === FW.Scraper ===
-  * Required keys: ''detect'', ''itemType'' +  * Required keys: ''detect'', ''itemType'' ([[http://aurimasv.github.io/z2csl/typeMap.xml|list of itemType options]]
-  * Optional keys: ''attachments'', all [[http://gsl-nagoya-u.net/http/pub/csl-fields/index.html|Zotero item fields]]+  * Optional keys: ''attachments''
  
 === FW.MultiScraper === === FW.MultiScraper ===
Line 86: Line 89:
 Here the option is used to guarantee that the multiple item page has links to the BibTeX files that the translator uses. Here the option is used to guarantee that the multiple item page has links to the BibTeX files that the translator uses.
  
-== Delegation == +=== Attachments === 
-It is possible to have a translator using this framework delegate processing to another translatorby setting the key ''itemTrans'', as in this example from the framework-derived version of the Google Scholar translator:+To add attachments to an itemspecify the ''attachments'' key in the scraper object. The key should be set to an array of attachment objects:‌ ''{ url : ... , title : ... , type : ... }''.
  
 +The keys ''url'', ''title'' and ''type'' can be set to single values (like ''FW.Url()'' or ''"Page Snapshot"'') or to multiple values, as in the XPath constructions below.
 <code javascript> <code javascript>
-itemTrans : FW.DelegateTranslator({ translatorType : "import", + attachments : [ 
-                                    translatorId   : "9cb70025-a888-4a29-a210-93ec52da40d4"}),+  { url : FW.Url(), 
 +    title : FW.Url().match(/[0-9]+$/).prepend("Washington Monthly Snapshot pg. "), 
 +    type : "text/html" }, 
 +  url FW.Xpath('//div[@class="pagination"]/a[not(@class)]').key('href').text()
 +    title FW.Xpath('//div[@class="pagination"]/a'). 
 +                        key('href').text(). 
 +                        match(/[0-9]+$/). 
 +                        prepend("Washington Monthly Snapshot pg. "), 
 +    type : "text/html"
 + ]
 </code> </code>
- 
-This delegation method can only be used with a ''MultiScraper'' -- each response will be sent to the specified translator to be processed, instead of being matched against the other scrapers defined in the current translator. 
  
 === Post-Processing with Hooks === === Post-Processing with Hooks ===
Line 113: Line 124:
 In the example above, the scraper had been written to save two potential date fields, one as "runningTime" and one as "date". The post-processing function sets the item's "date" property to the valid one of these two choices. It also checks if the author last names are in all-caps and fixes them if they are. Both of these tasks are a little hard to do within the framework. In the example above, the scraper had been written to save two potential date fields, one as "runningTime" and one as "date". The post-processing function sets the item's "date" property to the valid one of these two choices. It also checks if the author last names are in all-caps and fixes them if they are. Both of these tasks are a little hard to do within the framework.
 ===== Functions ===== ===== Functions =====
-To use the framework, just chain together functions from the list below until you get the desired output.+To use the framework, just chain together functions from the list below until you get the desired output. Note that JavaScript functions not in this list will not work within the scrapers.
 === Main functions === === Main functions ===
   * ''FW.PageText ( )'' Provides the HTML source of the current document as a string.   * ''FW.PageText ( )'' Provides the HTML source of the current document as a string.
Line 124: Line 135:
   * ''prepend ( text )'' Add a string to the end of the result.   * ''prepend ( text )'' Add a string to the end of the result.
   * ''append ( text )'' Add a string to the beginning of the result.   * ''append ( text )'' Add a string to the beginning of the result.
-  * ''remove (regex, flags )'' note that empty entries are dropped silently-- can be used to filter+  * ''remove (regex, flags )'' Removes any text that matches the [[dev:technologies#regular_expressions|regular expression]]. Note that empty entries are dropped silently, so this can be used to filter out unwanted results.
   * ''trim ()'' Removes whitespace at the beginning and end of the result.    * ''trim ()'' Removes whitespace at the beginning and end of the result. 
   * ''trimInternal ()'' Like ''trim ()'', but also removes extra whitespace inside the result.   * ''trimInternal ()'' Like ''trim ()'', but also removes extra whitespace inside the result.
-  * ''match ( regex, [ group ] )'' Match the regex, and pass on the match group. If no group is specified, the whole match is used (match group 0).+  * ''match ( regex, [ group ] )'' Match the [[dev:technologies#regular_expressions|regular expression]], and pass on the specified match group. If no group is specified, the whole match is used (match group 0).
   * ''capitalizeTitle ( )'' Capitalizes the string using Zotero's capitalization function   * ''capitalizeTitle ( )'' Capitalizes the string using Zotero's capitalization function
   * ''unescapeHTML ( text )''   * ''unescapeHTML ( text )''
   * ''unescape ( text )''   * ''unescape ( text )''
-  * ''split ( regex )'' Split the string into multiple string on the regex+  * ''split ( regex )'' Split the string into multiple strings on the [[dev:technologies#regular_expressions|regular expression]]
-  * ''join ( separator )'' Join all the strings into one, using the separator between them. +  * ''join ( separator )'' Join all the strings into one, placing specified the separator between them. 
-=== Zotero functions === +  * ''cleanAuthor ( type, [ useComma ] )'' Makes creator objects of the specified type (i.e., ''author'', ''editor'', ''translator'', ''contributor'', ''bookAuthor'', ''director'', etc.) If the second argument is true, the input will be split into first and last names on a comma, if present, in the input. See the [[http://gimranov.com/research/zotero/creator-types|list of valid creator types for each item type]]. 
-  ''cleanAuthor ( text, useComma )'' + 
-  ''makeAttachment ( type, title )''+=== Putting things together === 
 +''FW.Xpath()'' and ''FW.Url()'' are the main functions you'll call; they return an object that, when processed by the framework, results in selecting some text from a page or in the current URL. 
 + 
 +You can also call a method on this object, e.g.: 
 + 
 +  FW.Xpath("//xpath/expression").split(/,/
 + 
 +This modifies the object to include a filter that splits the 
 +text on /,/. You can chain these together: 
 + 
 +  FW.Xpath("//xpath/expression").text().split(/,/).trim().cleanAuthor(
 + 
 +This will split the text returned by the XPath into an array and call 
 +the filters (trimthen cleanAuthoron each member of the array. This 
 +way we can add multiple creators to the item. 
 + 
 +If you want to add an arbitrary filter, this should work: 
 + 
 +  FW.Xpath("//xpath/expression").text().addFilter(function (s) { return s + "HELLO WORLD"; }) 
 + 
 +===== Templates ===== 
 +Just paste the following templates into your translator, fill in the appropriate fields, and delete the unnecessary fields. 
 +==== Boilerplate ==== 
 +Required for every framework-derived translator. 
 +<code javascript> 
 +function detectWeb(doc, url) { return FW.detectWeb(doc, url); } 
 +function doWeb(doc, url) { return FW.doWeb(doc, url); } 
 +</code> 
 +==== FW.MultiScraper ==== 
 +<code javascript> 
 +FW.MultiScraper({ 
 +itemType         : 'multiple'
 +detect           : , 
 +choices          : { 
 +  titles :   
 +  urls   : 
 +  attachments :  // optional 
 +
 +}); 
 +</code> 
 +==== FW.Scraper ==== 
 +For possible values of ''itemType'' and the legal fields for each type, see the schema description at http://aurimasv.github.io/z2csl/typeMap.xml . 
 +<code javascript> 
 +FW.Scraper({ 
 +itemType         : , 
 +detect           : , 
 +attachments      : , 
 +creators         : , 
 +hooks            : { 
 +        "scraperDone" : function (item, doc, url) {} 
 +
 +// All possible fields 
 +abstractNote : , 
 +applicationNumber : , 
 +archive : , 
 +archiveLocation : , 
 +artworkMedium : , 
 +artworkSize : , 
 +assignee : , 
 +audioFileType : , 
 +audioRecordingType : , 
 +billNumber : , 
 +blogTitle : , 
 +bookTitle : , 
 +callNumber : , 
 +caseName : , 
 +code : , 
 +codeNumber : , 
 +codePages : , 
 +codeVolume : , 
 +committee : , 
 +company : , 
 +conferenceName : , 
 +country : , 
 +court : , 
 +date : , 
 +dateDecided : , 
 +dateEnacted : , 
 +dictionaryTitle : , 
 +distributor : , 
 +docketNumber : , 
 +documentNumber : , 
 +DOI : , 
 +edition : , 
 +encyclopediaTitle : , 
 +episodeNumber : , 
 +extra : , 
 +filingDate : , 
 +firstPage : , 
 +forumTitle : , 
 +genre : , 
 +history : , 
 +institution : , 
 +interviewMedium : , 
 +ISBN : , 
 +ISSN : , 
 +issue : , 
 +issueDate : , 
 +issuingAuthority : , 
 +journalAbbreviation : , 
 +label : , 
 +language : , 
 +legalStatus : , 
 +legislativeBody : , 
 +letterType : , 
 +libraryCatalog : , 
 +manuscriptType : , 
 +mapType : , 
 +medium : , 
 +meetingName : , 
 +nameOfAct : , 
 +network : , 
 +number : , 
 +numberOfVolumes : , 
 +numPages : , 
 +pages : , 
 +patentNumber : , 
 +place : , 
 +postType : , 
 +presentationType : , 
 +priorityNumbers : , 
 +proceedingsTitle : , 
 +programTitle : , 
 +programmingLanguage : , 
 +publicLawNumber : , 
 +publicationTitle : , 
 +publisher : , 
 +references : , 
 +reportNumber : , 
 +reportType : , 
 +reporter : , 
 +reporterVolume : , 
 +rights : , 
 +runningTime : , 
 +scale : , 
 +section : , 
 +series : , 
 +seriesNumber : , 
 +seriesText : , 
 +seriesTitle : , 
 +session : , 
 +shortTitle : , 
 +studio : , 
 +subject : , 
 +system : , 
 +thesisType : , 
 +title : , 
 +type : , 
 +university : , 
 +url : , 
 +version : , 
 +videoRecordingType : , 
 +volume : , 
 +websiteTitle : , 
 +websiteType : 
 +})
 +</code>
dev/translators/framework.1306529283.txt.gz · Last modified: 2011/05/27 16:48 by ajlyon