Differences
This shows you the differences between two versions of the page.
Both sides previous revisionPrevious revision | |||
dev:scaffold_tutorial [2008/11/06 04:57] – external edit 127.0.0.1 | dev:scaffold_tutorial [2009/06/28 08:19] (current) – Moved Scaffold tutorial to main Scaffold page rmzelle | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Developing a translator with Scaffold - a Tutorial ====== | ||
- | |||
- | ===== 1. Install Scaffold ===== | ||
- | |||
- | You can install [[scaffold|Scaffold]] from this link. | ||
- | |||
- | This will provide you with a " | ||
- | |||
- | {{1.png? | ||
- | |||
- | The 4 buttons at the top of the window are to | ||
- | |||
- | 1. Load from Database | ||
- | |||
- | Clicking this button will show you the complete list of translators available on your system. | ||
- | |||
- | 2. Save to Database | ||
- | |||
- | This allows you save the translator you are currently working on to the database. Be sure to provide a unique name and Translator ID for your translator. | ||
- | |||
- | |||
- | 3. Copy to Clipboard | ||
- | |||
- | This copies the entire translator to the clipboard as an sql statement which can be inserted into a zotero database. | ||
- | |||
- | 4. Export | ||
- | |||
- | This doesn' | ||
- | |||
- | 5. Execute | ||
- | |||
- | We'll be using this *a lot* later. | ||
- | |||
- | ===== 2. Start writing your translator ===== | ||
- | |||
- | Provide some metadata for your translator. | ||
- | |||
- | {{2.png|}} | ||
- | |||
- | Once you provide the name (the translator id will be unique by default if you start a new extension) the next thing to do is to provide the regular expression to match the translator URL In the case of the current example we want this translator to initially match any URLs starting with < | ||
- | |||
- | If we now navigate to the [[http:// | ||
- | |||
- | |||
- | < | ||
- | 09:54:11 ===> | ||
- | </ | ||
- | |||
- | If the regex fails this will appear instead: | ||
- | |||
- | < | ||
- | 09:52:02 ===> | ||
- | </ | ||
- | |||
- | For more information on regular expressions | ||
- | |||
- | ===== 3. STOP before you code! Think about your site's construction ===== | ||
- | |||
- | The first thing you should do is to look at how your site is constructed. | ||
- | |||
- | The second thing we want to do is to find out if there' | ||
- | |||
- | It turns out that the only way to get an RIS file from Emerald Insight is to save the article to your " | ||
- | |||
- | |||
- | ===== 4. The detectWeb function ===== | ||
- | |||
- | The next thing to do is to add a detectWeb function to the " | ||
- | |||
- | < | ||
- | function detectWeb(doc, | ||
- | if(doc.title == " | ||
- | return " | ||
- | } | ||
- | } | ||
- | </ | ||
- | |||
- | In this instance Zotero looks to see if the title of the Web Page is " | ||
- | |||
- | **IMPORTANT**: | ||
- | |||
- | |||
- | |||
- | ===== A quick digression - debugging ===== | ||
- | |||
- | Use the " | ||
- | |||
- | On windows the command line to run Firefox is < | ||
- | |||
- | |||
- | |||
- | |||
- | ===== The doWeb function ===== | ||
- | |||
- | In the " | ||
- | |||
- | < | ||
- | function doWeb(doc, url) { | ||
- | Zotero.debug(doc.title); | ||
- | } | ||
- | </ | ||
- | |||
- | Now click the " | ||
- | |||
- | ===== Building the screen scraper logic ===== | ||
- | |||
- | In terms of building a screen scraper, the relevant stuff is held in the following locations in the html: | ||
- | d | ||
- | pdf fulltext: | ||
- | html fulltext: | ||
- | |||
- | title, author, year, volume, number, pages, issn, doi and abstract fields are all contained in the html given below: | ||
- | |||
- | < | ||
- | <TR CLASS=" | ||
- | <TD COLSPAN=" | ||
- | <BR> | ||
- | < | ||
- | Title here. Probably wants html formatting tags stripped. | ||
- | |||
- | <BR> | ||
- | |||
- | |||
- | < | ||
- | |||
- | Firstname1 Surname1, Firstname2 Surname2 | ||
- | |||
- | <BR> | ||
- | |||
- | |||
- | < | ||
- | Journal title | ||
- | |||
- | <BR> | ||
- | |||
- | |||
- | < | ||
- | \d{4}-\d{4} | ||
- | |||
- | <BR> | ||
- | |||
- | |||
- | < | ||
- | \d{4} | ||
- | <b> Volume:</ | ||
- | \d+ | ||
- | <b> Issue:</ | ||
- | \d+ | ||
- | |||
- | <b> Page:</ | ||
- | |||
- | \d+ | ||
- | - | ||
- | \d+ | ||
- | |||
- | <BR> | ||
- | |||
- | |||
- | < | ||
- | doi-id-here | ||
- | |||
- | <BR> | ||
- | |||
- | |||
- | < | ||
- | Emerald Group Publishing Limited | ||
- | |||
- | <BR> | ||
- | |||
- | |||
- | < | ||
- | Abstract with some html formatting, <BR/> tags and so on here. | ||
- | |||
- | <BR> | ||
- | |||
- | </ | ||
- | |||
- | |||
- | the article url is containted in the following html snippet: | ||
- | |||
- | < | ||
- | < | ||
- | <A HREF=" | ||
- | <BR> | ||
- | |||
- | </ | ||
- | |||
- | |||
- | The easiest way to generate page scrapers is with xpath expressions which tell you about the location of information within an xml document. | ||
- | |||
- | In this case we find the following: | ||
- | |||
- | The HTML and PDF download links are stored here: | ||
- | |||
- | The xpath for the PDF link is here: | ||
- | |||
- | < | ||
- | |||
- | / | ||
- | |||
- | </ | ||
- | |||
- | This can be shortened to give a relative path as below. | ||
- | |||
- | < | ||
- | |||
- | // | ||
- | |||
- | </ | ||
- | |||
- | The xpath code for the HTML download link is as follows: | ||
- | |||
- | < | ||
- | |||
- | // | ||
- | |||
- | </ | ||
- | |||
- | To obtain the link to the pdf document we use the @href attrib and the xpath expression to obtain this is: | ||
- | |||
- | < | ||
- | // | ||
- | </ | ||
- | |||
- | Note that this is a relative link and we need to make sure that we've got an absolute link rather than a relative link by viewing the source. | ||
- | |||
- | |||
- | The relative XPath expression for the bibliographic data is here: | ||
- | |||
- | < | ||
- | // | ||
- | </ | ||
- | |||
- | And the rule seems to be that attributes (title, journal, volume etc) are stored after a block of < | ||
- | |||
- | < | ||
- | // | ||
- | </ | ||
- | |||
- | and Author from here | ||
- | |||
- | < | ||
- | // | ||
- | </ | ||
- | |||
- | **At this stage I'm unsure how to get the html proceeding the bold text up to but not including the next chunk of bold text. Maybe I'll have to do some regex processing** | ||
- | |||
- | |||
- | |||
- | ===== the doWeb() and scrape() functions ===== | ||
- | |||
- | Here's the doWeb function which is executed when we decide to run the translator: | ||
- | |||
- | < | ||
- | function doWeb(doc, url) { | ||
- | scrape(doc, | ||
- | } | ||
- | </ | ||
- | |||
- | All that happens is that it runs the function scrape to get the relevant bibliographic data from the page: | ||
- | |||
- | < | ||
- | function scrape(doc, | ||
- | var fullTextUrl = url + "& | ||
- | var pdfUrl = url + "& | ||
- | var xpath = "// | ||
- | var allRefText = Zotero.Utilities.cleanString(doc.evaluate(xpath, | ||
- | |||
- | // bib data scraper code here | ||
- | |||
- | // zotero entry creation code here | ||
- | |||
- | // obtaining the pdf and fulltext attachments here | ||
- | |||
- | } | ||
- | </ | ||
- | |||
- | The first two lines are our " | ||
- | |||
- | The next line defines the xpath expression used to get the bibliographic data. I couldn' | ||
- | |||
- | the code we use to obtain the title is as follows: | ||
- | |||
- | < | ||
- | var titleRe = " | ||
- | var title = getItem(allRefText, | ||
- | </ | ||
- | |||
- | and to get the raw authors string (this will be post-processed by Zotero) we use this code: | ||
- | |||
- | < | ||
- | var authorsRe = " | ||
- | var authors = getItem(allRefText, | ||
- | </ | ||
- | |||
- | note the use of the '' | ||
- | |||
- | < | ||
- | function getItem(reftext, | ||
- | var item = reftext.match(re); | ||
- | // Zotero.debug(item[1]); | ||
- | return item[1]; | ||
- | } | ||
- | </ | ||
- | |||
- | Note that we've commented out the Zotero.debug line, but used it during development to make sure that the regular expressions were returning the correct thing. | ||
- | |||
- | We repeat the call to getItem for each bibliographic item we are interested in: | ||
- | |||
- | < | ||
- | var titleRe = " | ||
- | var authorsRe = " | ||
- | var journalRe = " | ||
- | var issnRe = "ISSN: (.*? | ||
- | var yearRe = "Year: (.*?) Volume"; | ||
- | var volRe = " | ||
- | var issueRe = " | ||
- | var pageRe = "Page: (.*? | ||
- | var doiRe = "DOI: (.*? | ||
- | var publisherRe = " | ||
- | var abstractRe = " | ||
- | </ | ||
- | |||
- | We can also derive the article url from the DOI: | ||
- | |||
- | < | ||
- | var articleUrl = " | ||
- | </ | ||
- | |||
- | |||
- | ==== Getting this into the Zotero database | ||
- | |||
- | Once we've obtained this data and verified with Zotero.debug that the XPath and regular expressions are working, we can start passing the data on to Zotero: | ||
- | |||
- | < | ||
- | var newArticle = new Zotero.Item(' | ||
- | |||
- | newArticle.title = title; | ||
- | newArticle.journal = journal; | ||
- | newArticle.ISSN = issn; | ||
- | newArticle.year = year; | ||
- | newArticle.volume = vol; | ||
- | newArticle.issue = issue; | ||
- | newArticle.pages = page | ||
- | newArticle.DOI = doi; | ||
- | newArticle.publisher = publisher; | ||
- | newArticle.abstractNote = abstract; | ||
- | newArticle.url = articleUrl; | ||
- | Zotero.debug(newArticle); | ||
- | </ | ||
- | |||
- | Authors are a little more complex. | ||
- | |||
- | < | ||
- | var aus = authors.split("," | ||
- | for (var i=0; i< aus.length ; i++) { | ||
- | newArticle.creators.push(Zotero.Utilities.cleanAuthor(aus[i], | ||
- | } | ||
- | </ | ||
- | |||
- | (if the authors were in the format " | ||
- | |||
- | Finally, the article needs to be saved. This is done with: | ||
- | |||
- | < | ||
- | newArticle.complete(); | ||
- | </ | ||
- | |||
- | This stores the citation in Zotero' | ||
- | |||
- | |||
- | ==== Development tips ==== | ||
- | |||
- | Each different type of citation has a different set of fields available for it (like the ' | ||
- | |||
- | ==== Getting your translator distributed ==== | ||
- | |||
- | Once you've finished your translator, if you want it to be distributed with Zotero, email the code to the Zotero developer list: zotero-dev@googlegroups.com |