Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
dev:scaffold [2009/06/28 13:27] – rmzelle | dev:scaffold [2018/05/07 13:09] (current) – bwiernik | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Scaffold - an IDE for Zotero translators ====== | + | See [[dev/ |
- | + | ||
- | Scaffold is a Firefox extension developed to simplify writing Zotero translators. In Zotero 1.0.x, translators are stored in a single SQL database. Scaffold makes it easy to extract translators from this database, to edit and test translator code, and to save changes back into the database. | + | |
- | + | ||
- | ===== Installation ===== | + | |
- | + | ||
- | [[/download/ | + | |
- | + | ||
- | **Please note:** As of yet, Scaffold is only compatible with Zotero 1.0.x. Users of Zotero 2.0 can edit translators | + | |
- | + | ||
- | ===== Interface ===== | + | |
- | + | ||
- | After installation, | + | |
- | + | ||
- | {{: | + | |
- | + | ||
- | ==== Top buttons ==== | + | |
- | + | ||
- | {{: | + | |
- | Opens the "Load Translator" | + | |
- | + | ||
- | {{: | + | |
- | Saves the translator you are currently working on to the database. Be sure to provide a unique label and Translator ID for your translator if you don't want to overwrite the existing translator. Translator IDs can be automatically generated via the " | + | |
- | + | ||
- | {{: | + | |
- | This copies the entire translator to the clipboard. The translator is formatted as an SQL statement so it can be inserted into scrapers.sql (the Zotero 1.0.x database containing all translators). | + | |
- | + | ||
- | {{: | + | |
- | Saves and runs translator code on the webpage loaded in the most recently selected tab. The exact behavior depends on the selected Scaffold tab. If the " | + | |
- | + | ||
- | ==== Tabs ==== | + | |
- | + | ||
- | **Metadata** \\ Here you provide metadata of the translator. Translator IDs can be automatically generated via the " | + | |
- | + | ||
- | **Detect Code** \\ The text-field of this tab should contain the detectWeb function for web translators, | + | |
- | + | ||
- | **Code** \\ The text-field of this tab should contain the doWeb function for web translators, | + | |
- | + | ||
- | ==== Debug Output ==== | + | |
- | + | ||
- | One of the strengths of Scaffold is its ability to provide you with immediate feedback, which can dramatically speed up translator development. After a code change, a single click suffices to run the modified translator and generate debug output. For each of the three tabs of Scaffold, a different type of debug output is generated: | + | |
- | + | ||
- | === Metadata === | + | |
- | When the "Test Regex" button in the Metadata tab is clicked, the regular expression in the target field is applied to the webpage loaded in the most recently selected Firefox tab. The debug window at the bottom of the Scaffold window will show whether the regular expression matches ('' | + | |
- | + | ||
- | < | + | |
- | 09:54:11 ===> | + | |
- | </code> | + | |
- | + | ||
- | === Detect Code & Code === | + | |
- | + | ||
- | When the execute button is clicked while the " | + | |
- | + | ||
- | Debug output for the " | + | |
- | + | ||
- | < | + | |
- | 19:19:43 detectCode returned type " | + | |
- | </ | + | |
- | + | ||
- | Debug output for the " | + | |
- | + | ||
- | < | + | |
- | 19:24:21 Returned item: | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | ' | + | |
- | + | ||
- | 19:24:21 Translation successful | + | |
- | </ | + | |
- | + | ||
- | If running detectWeb or doWeb results in an error, the debug window will show an error message. | + | |
- | + | ||
- | Note that additional debug output can be generated by the javascript command '' | + | |
- | + | ||
- | === Test Frame === | + | |
- | + | ||
- | FIXME What is the purpose of the Test Frame drop-down menu? | + | |
- | + | ||
- | ===== Getting your translator distributed ===== | + | |
- | + | ||
- | If your new or modified translator has general appeal, consider posting the translator to the [[http:// | + | |
- | + | ||
- | refer to http:// | + | |
- | + | ||
- | ====== Placeholder ====== | + | |
- | + | ||
- | **All the text below should be relocated to a different page** | + | |
- | + | ||
- | ==== 2. Start writing your translator ==== | + | |
- | + | ||
- | Provide some metadata for your translator. | + | |
- | + | ||
- | {{2.png|}} | + | |
- | + | ||
- | Once you provide the name (the translator id will be unique by default if you start a new extension) the next thing to do is to provide the regular expression to match the translator URL In the case of the current example we want this translator to initially match any URLs starting with < | + | |
- | + | ||
- | If we now navigate to the [[http:// | + | |
- | + | ||
- | + | ||
- | < | + | |
- | 09:54:11 ===> | + | |
- | </ | + | |
- | + | ||
- | If the regex fails this will appear instead: | + | |
- | + | ||
- | < | + | |
- | 09:52:02 ===> | + | |
- | </ | + | |
- | + | ||
- | For more information on regular expressions | + | |
- | + | ||
- | ==== 3. STOP before you code! Think about your site's construction ==== | + | |
- | + | ||
- | The first thing you should do is to look at how your site is constructed. | + | |
- | + | ||
- | The second thing we want to do is to find out if there' | + | |
- | + | ||
- | It turns out that the only way to get an RIS file from Emerald Insight is to save the article to your " | + | |
- | + | ||
- | + | ||
- | ==== 4. The detectWeb function ==== | + | |
- | + | ||
- | The next thing to do is to add a detectWeb function to the " | + | |
- | + | ||
- | < | + | |
- | function detectWeb(doc, | + | |
- | if(doc.title == " | + | |
- | return " | + | |
- | } | + | |
- | } | + | |
- | </ | + | |
- | + | ||
- | In this instance Zotero looks to see if the title of the Web Page is " | + | |
- | + | ||
- | **IMPORTANT**: | + | |
- | + | ||
- | + | ||
- | + | ||
- | ==== A quick digression - debugging ==== | + | |
- | + | ||
- | Use the " | + | |
- | + | ||
- | On windows the command line to run Firefox is < | + | |
- | + | ||
- | + | ||
- | + | ||
- | + | ||
- | ==== The doWeb function ==== | + | |
- | + | ||
- | In the " | + | |
- | + | ||
- | < | + | |
- | function doWeb(doc, url) { | + | |
- | Zotero.debug(doc.title); | + | |
- | } | + | |
- | </ | + | |
- | + | ||
- | Now click the " | + | |
- | + | ||
- | ==== Building the screen scraper logic ==== | + | |
- | + | ||
- | In terms of building a screen scraper, the relevant stuff is held in the following locations in the html: | + | |
- | d | + | |
- | pdf fulltext: | + | |
- | html fulltext: | + | |
- | + | ||
- | title, author, year, volume, number, pages, issn, doi and abstract fields are all contained in the html given below: | + | |
- | + | ||
- | < | + | |
- | <TR CLASS=" | + | |
- | <TD COLSPAN=" | + | |
- | < | + | |
- | < | + | |
- | Title here. Probably wants html formatting tags stripped. | + | |
- | + | ||
- | < | + | |
- | + | ||
- | + | ||
- | < | + | |
- | + | ||
- | Firstname1 Surname1, Firstname2 Surname2 | + | |
- | + | ||
- | < | + | |
- | + | ||
- | + | ||
- | < | + | |
- | Journal title | + | |
- | + | ||
- | < | + | |
- | + | ||
- | + | ||
- | < | + | |
- | \d{4}-\d{4} | + | |
- | + | ||
- | < | + | |
- | + | ||
- | + | ||
- | < | + | |
- | \d{4} | + | |
- | <b> Volume:</ | + | |
- | \d+ | + | |
- | <b> Issue:</ | + | |
- | \d+ | + | |
- | + | ||
- | <b> Page:</ | + | |
- | + | ||
- | \d+ | + | |
- | - | + | |
- | \d+ | + | |
- | + | ||
- | < | + | |
- | + | ||
- | + | ||
- | < | + | |
- | doi-id-here | + | |
- | + | ||
- | < | + | |
- | + | ||
- | + | ||
- | < | + | |
- | Emerald Group Publishing Limited | + | |
- | + | ||
- | < | + | |
- | + | ||
- | + | ||
- | < | + | |
- | Abstract with some html formatting, <BR/> tags and so on here. | + | |
- | + | ||
- | < | + | |
- | + | ||
- | </ | + | |
- | + | ||
- | + | ||
- | the article url is containted in the following html snippet: | + | |
- | + | ||
- | < | + | |
- | < | + | |
- | <A HREF=" | + | |
- | < | + | |
- | + | ||
- | </ | + | |
- | + | ||
- | + | ||
- | The easiest way to generate page scrapers is with xpath expressions which tell you about the location of information within an xml document. | + | |
- | + | ||
- | In this case we find the following: | + | |
- | + | ||
- | The HTML and PDF download links are stored here: | + | |
- | + | ||
- | The xpath for the PDF link is here: | + | |
- | + | ||
- | < | + | |
- | + | ||
- | / | + | |
- | + | ||
- | </ | + | |
- | + | ||
- | This can be shortened to give a relative path as below. | + | |
- | + | ||
- | < | + | |
- | + | ||
- | // | + | |
- | + | ||
- | </ | + | |
- | + | ||
- | The xpath code for the HTML download link is as follows: | + | |
- | + | ||
- | < | + | |
- | + | ||
- | // | + | |
- | + | ||
- | </ | + | |
- | + | ||
- | To obtain the link to the pdf document we use the @href attrib and the xpath expression to obtain this is: | + | |
- | + | ||
- | < | + | |
- | // | + | |
- | </ | + | |
- | + | ||
- | Note that this is a relative link and we need to make sure that we've got an absolute link rather than a relative link by viewing the source. | + | |
- | + | ||
- | + | ||
- | The relative XPath expression for the bibliographic data is here: | + | |
- | + | ||
- | < | + | |
- | // | + | |
- | </ | + | |
- | + | ||
- | And the rule seems to be that attributes (title, journal, volume etc) are stored after a block of < | + | |
- | + | ||
- | < | + | |
- | // | + | |
- | </ | + | |
- | + | ||
- | and Author from here | + | |
- | + | ||
- | < | + | |
- | // | + | |
- | </ | + | |
- | + | ||
- | **At this stage I'm unsure how to get the html proceeding the bold text up to but not including the next chunk of bold text. Maybe I'll have to do some regex processing** | + | |
- | + | ||
- | + | ||
- | + | ||
- | ==== the doWeb() and scrape() functions ==== | + | |
- | + | ||
- | Here's the doWeb function which is executed when we decide to run the translator: | + | |
- | + | ||
- | < | + | |
- | function doWeb(doc, url) { | + | |
- | scrape(doc, | + | |
- | } | + | |
- | </ | + | |
- | + | ||
- | All that happens is that it runs the function scrape to get the relevant bibliographic data from the page: | + | |
- | + | ||
- | < | + | |
- | function scrape(doc, | + | |
- | var fullTextUrl = url + "& | + | |
- | var pdfUrl = url + "& | + | |
- | var xpath = "// | + | |
- | var allRefText = Zotero.Utilities.cleanString(doc.evaluate(xpath, | + | |
- | + | ||
- | // bib data scraper code here | + | |
- | + | ||
- | // zotero entry creation code here | + | |
- | + | ||
- | // obtaining the pdf and fulltext attachments here | + | |
- | + | ||
- | } | + | |
- | </ | + | |
- | + | ||
- | The first two lines are our " | + | |
- | + | ||
- | The next line defines the xpath expression used to get the bibliographic data. I couldn' | + | |
- | + | ||
- | the code we use to obtain the title is as follows: | + | |
- | + | ||
- | < | + | |
- | var titleRe = " | + | |
- | var title = getItem(allRefText, | + | |
- | </ | + | |
- | + | ||
- | and to get the raw authors string (this will be post-processed by Zotero) we use this code: | + | |
- | + | ||
- | < | + | |
- | var authorsRe = " | + | |
- | var authors = getItem(allRefText, | + | |
- | </ | + | |
- | + | ||
- | note the use of the '' | + | |
- | + | ||
- | < | + | |
- | function getItem(reftext, | + | |
- | var item = reftext.match(re); | + | |
- | // Zotero.debug(item[1]); | + | |
- | return item[1]; | + | |
- | } | + | |
- | </ | + | |
- | + | ||
- | Note that we've commented out the Zotero.debug line, but used it during development to make sure that the regular expressions were returning the correct thing. | + | |
- | + | ||
- | We repeat the call to getItem for each bibliographic item we are interested in: | + | |
- | + | ||
- | < | + | |
- | var titleRe = " | + | |
- | var authorsRe = " | + | |
- | var journalRe = " | + | |
- | var issnRe = "ISSN: (.*? | + | |
- | var yearRe = "Year: (.*?) Volume"; | + | |
- | var volRe = " | + | |
- | var issueRe = " | + | |
- | var pageRe = "Page: (.*? | + | |
- | var doiRe = "DOI: (.*? | + | |
- | var publisherRe = " | + | |
- | var abstractRe = " | + | |
- | </ | + | |
- | + | ||
- | We can also derive the article url from the DOI: | + | |
- | + | ||
- | < | + | |
- | var articleUrl = " | + | |
- | </ | + | |
- | + | ||
- | + | ||
- | === Getting this into the Zotero database | + | |
- | + | ||
- | Once we've obtained this data and verified with Zotero.debug that the XPath and regular expressions are working, we can start passing the data on to Zotero: | + | |
- | + | ||
- | < | + | |
- | var newArticle = new Zotero.Item(' | + | |
- | + | ||
- | newArticle.title = title; | + | |
- | newArticle.journal = journal; | + | |
- | newArticle.ISSN = issn; | + | |
- | newArticle.year = year; | + | |
- | newArticle.volume = vol; | + | |
- | newArticle.issue = issue; | + | |
- | newArticle.pages = page | + | |
- | newArticle.DOI = doi; | + | |
- | newArticle.publisher = publisher; | + | |
- | newArticle.abstractNote = abstract; | + | |
- | newArticle.url = articleUrl; | + | |
- | Zotero.debug(newArticle); | + | |
- | </ | + | |
- | + | ||
- | Authors are a little more complex. | + | |
- | + | ||
- | < | + | |
- | var aus = authors.split("," | + | |
- | for (var i=0; i< aus.length ; i++) { | + | |
- | newArticle.creators.push(Zotero.Utilities.cleanAuthor(aus[i], | + | |
- | } | + | |
- | </ | + | |
- | + | ||
- | (if the authors were in the format " | + | |
- | + | ||
- | Finally, the article needs to be saved. This is done with: | + | |
- | + | ||
- | < | + | |
- | newArticle.complete(); | + | |
- | </ | + | |
- | + | ||
- | This stores the citation in Zotero' | + | |
- | + | ||
- | + | ||
- | ==== Development tips ==== | + | |
- | + | ||
- | Each different type of citation has a different set of fields available for it (like the ' | + | |
- | + |