Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
dev:technologies [2011/07/23 10:43] – [Regular Expressions] ajlyondev:technologies [2017/12/12 04:56] (current) – removed sean
Line 1: Line 1:
-A very brief introduction to some commonly used technologies used in translator development. 
  
-====== XPath ====== 
-XPath provides a way to refer to specific parts of HTML or XML documents. It's usually the best way to extract data from webpages when writing a translator. 
- 
-An XPath expression is a chain of pieces that specify the path to a node of the document. The main pieces of expressions are: 
- 
-  * ''/'' Separator between parts of the path 
-  * ''*'' match any tag 
-  * ''%%//%%'' one or more levels deeper 
-  * ''..'' go up one level 
-  * ''[]'' match a tag that has this (the contents of the brackets) 
-  * ''@key'' an attribute named ''key'' 
-  * ''text()'' match a text node 
-  * ''[2]'' match the second matching node  
-  * ''[last()]'' match the last matching node 
-  * ''div[@class="important"]'' match a ''<div>'' with the attribute ''class'', with the value ''important''. 
-  * ''td[contains(text(),"Expect")]'' match a ''<td>'' which contains text that contains "Expect" 
-  * Plus much more. See the [[http://www.w3.org/TR/xpath/|XPath specification]] and the [[https://developer.mozilla.org/en/xpath|XPath documentation]] of the Mozilla Developer Network. 
- 
-The best introduction to XPath for use in translators is Mozilla's [[https://developer.mozilla.org/en/Introduction_to_using_XPath_in_JavaScript|Introduction to using XPath in JavaScript]], but it may be even easier to model your code off of the logic in existing translators, which provide a wide array of XPath techniques to pick apart fussy sites. 
- 
-=== Examples === 
-<code html> 
-<html> 
- <body> 
- <div id="names"> 
-   <span class="editor">George Spelvin</span>, 
-   <span class="translator">Andrea Johnson</span> 
- </div> 
- <table> 
-  <tr class="odd"> 
-   <td>Great Expectations</td> 
-   <td>Mediocre Plans</td> 
-  </tr> 
- </table> 
- </body> 
-</html> 
-</code> 
-For the sample document above, these expressions would refer to... 
-  * ''%%//tr[@class="odd"]/td%%'': a result set with the nodes ''<td>Great Expectations</td>'' and ''<td>Mediocre Plans</td>'' 
-  * ''%%//table//td%%'': Same as previous 
-  * ''%%//table//td[last()]%%'': a result set with the single node ''<td>Mediocre Plans</td>'' 
-  * ''%%//span[@class="editor"]%%'': a result set with the single node ''<span class="editor">George Spelvin</span>'' 
- 
-=== Using Firebug to design XPath expressions === 
- 
-To make it easier to design XPath expressions, you can use [[http://www.getfirebug.com|the Firefox add-on Firebug]].  It's a useful tool for web developers in general, and it can make it easier to design XPaths for use in Zotero translators. 
- 
-First, install Firebug and restart Firefox. You'll see a bug icon at the bottom right hand side of your browser window: 
- 
-{{firebug2.png|}} 
- 
-{{firebug1.png|}} 
- 
-Click the icon to open the Firebug pane, which will show lots of details about the current page: 
- 
-{{firebug3.png}} 
- 
-Next, click the "Inspect" button next to the cockroach at the top left hand side of the Firebug pane, and mouse over the bit of the page you're interested in (the view PDF link in this case): 
- 
-{{firebug4.png}} 
- 
-Firebug will highlight the part of the web page's code that corresponds to the location you clicked on. Right-click on that piece of code in the Firebug pane and select "Copy XPath". 
- 
-Then we can paste the XPath expression into the editor (i.e., the code pane of [[dev:translators:scaffold|Scaffold]]): 
- 
-<code> 
-/html/body/table/tbody/tr[5]/td[3]/div/table/tbody/tr/td/div/table[3]/tbody/tr[4]/td/a[2] 
-</code> 
- 
-XPath like this, however, is very fragile. It's better to latch onto some identifiers and make an expression relative to them: 
-<code> 
-//div/table/tbody/tr/td/div/table[3]/tbody/tr[4]/td/a[2] 
-</code> 
- 
-====== Regular Expressions ====== 
-Regular expressions are a way of matching and extracting pieces of text from a larger body of text. In translator development, they can be very useful for pulling out important pieces of text from parts of a webpage that aren't meaningfully marked up in HTML. They are also the only way to work with imported data that isn't in HTML or XML. Finally, every web translator uses regular expressions in [[:dev:translators#metadata|its target expression]]. 
- 
-  * ''.'' matches any character 
-  * ''[a-z01]'' matches any of the lowercase English letters and the numbers 0 and 1 
-  * ''()'' surround a match expression 
-  * ''+'' Match one or more of the preceding expression 
-  * ''*'' Match 0 or more of the preceding expression 
-  * ''?'' Match 0 or 1 of the preceding expression 
- 
-For a complete description of how regular expressions are used in JavaScript, with examples of their use, see Mozilla's [[https://developer.mozilla.org/en/JavaScript/Guide/Regular_Expressions|Regular Expressions Guide]]. 
- 
-There are many regular expressions guides on the web, and entire books on the topic. While you can and should consult these guides, keep in mind that the details of how they work will depend a little on the environment they're running in-- JavaScript does not support everything that you might read about regular expressions in, say, a book for Perl or Java. The Mozilla guide above describes everything that can be done in JavaScript. 
dev/technologies.1311432235.txt.gz · Last modified: 2011/07/23 10:43 by ajlyon