This is an old revision of the document!


A very brief introduction to some commonly used technologies used in translator development.

XPath

XPath provides a way to refer to specific parts of HTML or XML documents. It's usually the best way to extract data from webpages when writing a translator.

An XPath expression is a chain of pieces that specify the path to a node of the document. The main pieces of expressions are:

  • / Separator between parts of the path
  • * match any tag
  • // one or more levels deeper
  • .. go up one level
  • [] match a tag that has this (the contents of the brackets)
  • @key an attribute named key
  • text() match a text node
  • [2] match the second matching node
  • [last()] match the last matching node
  • div[@class=“important”] match a <div> with the attribute class, with the value important.
  • td[contains(text(),“Expect”)] match a <td> which contains text that contains “Expect”
  • Plus much more. See the XPath specification and the XPath documentation of the Mozilla Developer Network.

The best introduction to XPath for use in translators is Mozilla's Introduction to using XPath in JavaScript, but it may be even easier to model your code off of the logic in existing translators, which provide a wide array of XPath techniques to pick apart fussy sites.

Examples

<div id="names">
  <span class="editor">George Spelvin</span>,
  <span class="translator">Andrea Johnson</span>
 </div>
 <table>
  <tr class="odd">
   <td>Great Expectations</td>
   <td>Mediocre Plans</td>
  </tr>
 </table>

For the sample document above, these expressions would refer to…

  • //tr[@class="odd"]/td: a result set with the nodes <td>Great Expectations</td> and <td>Mediocre Plans</td>
  • //table//td: Same as previous
  • //table//td[last()]: a result set with the single node <td>Mediocre Plans</td>
  • //span[@class="editor"]: a result set with the single node <span class=“editor”>George Spelvin</span>

Regular Expressions

  • . matches any character
  • [a-z01] matches any of the lowercase English letters and the numbers 0 and 1
  • () surround a match expression
  • + Match one or more of the preceding expression
  • * Match 0 or more of the preceding expression
  • ? Match 0 or 1 of the preceding expression
dev/technologies.1303456395.txt.gz · Last modified: 2011/04/22 03:13 by ajlyon