Both sides previous revisionPrevious revisionNext revision | Previous revision |
dev:technologies [2011/04/21 14:50] – [XPath] rmzelle | dev:technologies [2017/12/12 04:56] (current) – removed sean |
---|
A very brief introduction to some commonly used technologies used in translator development. | |
| |
====== XPath ====== | |
XPath provides a way to refer to specific parts of HTML or XML documents. It's usually the best way to extract data from webpages when writing a translator. | |
| |
An XPath expression is a chain of pieces that specify the path to a node of the document. The main pieces of expressions are: | |
| |
* ''/'' Separator between parts of the path | |
* ''*'' match any tag | |
* ''%%//%%'' one or more levels deeper | |
* ''..'' go up one level | |
* ''[]'' match a tag that has this (the contents of the brackets) | |
* ''@key'' an attribute named ''key'' | |
* ''text()'' match a text node | |
* ''[2]'' match the second matching node | |
* ''[last()]'' match the last matching node | |
* ''div[@class="important"]'' match a ''<div>'' with the attribute ''class'', with the value ''important''. | |
* ''td[contains(text(),"Expect")]'' match a ''<td>'' which contains text that contains "Expect" | |
* Plus much more. See the [[http://www.w3.org/TR/xpath/|XPath specification]] and the [[https://developer.mozilla.org/en/xpath|XPath documentation]] of the Mozilla Developer Network. | |
| |
The best introduction to XPath for use in translators is Mozilla's [[https://developer.mozilla.org/en/Introduction_to_using_XPath_in_JavaScript|Introduction to using XPath in JavaScript]], but it may be even easier to model your code off of the logic in existing translators, which provide a wide array of XPath techniques to pick apart fussy sites. | |
| |
=== Examples === | |
<code html> | |
<div id="names"> | |
<span class="editor">George Spelvin</span>, | |
<span class="translator">Andrea Johnson</span> | |
</div> | |
<table> | |
<tr class="odd"> | |
<td>Great Expectations</td> | |
<td>Mediocre Plans</td> | |
</tr> | |
</table> | |
</code> | |
For the sample document above, these expressions would refer to... | |
* ''%%//tr[@class="odd"]/td%%'': a result set with the nodes ''<td>Great Expectations</td>'' and ''<td>Mediocre Plans</td>'' | |
* ''%%//table//td%%'': Same as previous | |
* ''%%//span[@class="editor"]%%'': a result set with the single node ''<span class="editor">George Spelvin</span>'' | |
| |
====== Regular Expressions ====== | |
* ''.'' matches any character | |
* ''[a-z01]'' matches any of the lowercase English letters and the numbers 0 and 1 | |
* ''()'' surround a match expression | |
* ''+'' Match one or more of the preceding expression | |
* ''*'' Match 0 or more of the preceding expression | |
* ''?'' Match 0 or 1 of the preceding expression | |
| |