Translations of this page:

Chapter 16: Scraping the Individual Entry Page

HWZT chapter 16 (Scraping the Individual Entry Page): As previously, the tutorial example works, but is rather casually presented, so here are some step-by-step instructions to produce the result of the tutorial section:

  1. open the single-result translator
  2. extend the single-result translator
  3. test the single-result translator on single-result pages
  4. test the single-result translator on a search-results page

To open the single-result translator:

  1. Close any running Scaffold 2.0 instances.
  2. Ensure the first sample page is open in your browser and has focus.
  3. Open Scaffold 2.0 from the Firefox main menu with Tools>Scaffold. This should popup dialog=“Zotero Scaffold”.
  4. Hit icon=Load (at the upper left of the dialog). This should popup dialog=“Load Translator” displaying a single table mapping “Label” to “Creator”.
  5. Scroll through the “Load Translator” table until you see Label=
    How to Write a Zotero Translator (single result)

    , then hit button=OK.

  6. You will return to the main dialog=“Zotero Scaffold”. Check to see that tab=Metadata is properly populated.
  7. Click icon=“Run detectWeb” (the eye) to ensure its current code still works: in the Test Frame you should get results like
    12:00:00 detectWeb returned type "multiple"
  8. Leave the current instance of Scaffold 2.0 open, since we'll use the same translator in the next section.

To extend the single-result translator: add a

doWeb

function to the JavaScript in tab=Code. Obviously one can implement this several ways. The following code has been tested, and is somewhat more modular than that provided by HWZT. To use it, just

  1. replace the current contents of tab=Code with the code below:
    function detectWeb(doc, url) {
      if (doc.title.match("Single Item")) {
        return "book";
      } else if (doc.title.match("Search Results")) {
        return "multiple";
      }
    }
    
    // The function used to save well formatted data to Zotero
    function associateData (newItem, items, field, zoteroField) {
      if (items[field]) {
        newItem[zoteroField] = items[field];
      }
    }
    
    function scrape(doc, url) {
      // variable declarations
      var newItem = new Zotero.Item('book');
      newItem.url = doc.location.href;
      newItem.title = "No Title Found";
      var items = new Object();
      var tagsContent = new Array();
    
      // scrape page data, save to Zotero
      getItems(doc, items, tagsContent);
      getAuthors(newItem, items);
      getImprints(newItem, items);
      getTags(newItem, items, tagsContent);
      saveToZotero(newItem, items);
    }
    
    function getItems(doc, items, tagsContent) {
      // namespace code
      var namespace = doc.documentElement.namespaceURI;
      var nsResolver = namespace ? 
        function(prefix) {
          if (prefix == 'x') return namespace; else return null;
        } : null;
    
      // populate "items" Object and save tags to an Array
      var blankCell = "temp";
      var headersTemp;
      var headers;
      var contents;
      var myXPathObject = doc.evaluate('//td[1]', doc, nsResolver, XPathResult.ANY_TYPE, null);
      var myXPathObject2 = doc.evaluate('//td[2]', doc, nsResolver, XPathResult.ANY_TYPE, null);
      while (headers = myXPathObject.iterateNext()) {
        headersTemp = headers.textContent;
        if (!headersTemp.match(/\w/)) {
          headersTemp = blankCell;
          blankCell = blankCell + "1";
        }
        contents = myXPathObject2.iterateNext().textContent;
        if (headersTemp.match("temp")) {
          tagsContent.push(contents);
        }
        items[headersTemp.replace(/\s+/g, '')]=contents.replace(/^\s*|\s*$/g, '');
      }
    }
    
    function getAuthors(newItem, items) {
      //Formatting and saving "Author" field
      if (items["PrincipalAuthor:"]) {
        var author = items["PrincipalAuthor:"];
        if (author.match("; ")) {
          var authors = author.split("; ");
          for (var i in authors) {
            newItem.creators.push(Zotero.Utilities.cleanAuthor(authors[i], "author"));
          }
        } else {
          newItem.creators.push(Zotero.Utilities.cleanAuthor(author, "author"));
        }
      }
    }
    
    function getImprints(newItem, items) {
      // Format and save "Imprint" fields
      if (items["Imprint:"]) {
        items["Imprint:"] = items["Imprint:"].replace(/\s\s+/g, '');
        if (items["Imprint:"].match(":")) {
          var colonLoc = items["Imprint:"].indexOf(":");
          newItem.place = items["Imprint:"].substr(1, colonLoc-1);
          var commaLoc = items["Imprint:"].lastIndexOf(",");
          var date1 =items["Imprint:"].substr(commaLoc + 1);
          newItem.date = date1.substr(0, date1.length-1);
          newItem.publisher = items["Imprint:"].substr(colonLoc+1, commaLoc-colonLoc-1);
        } else {
          newItem.publisher = items["Imprint:"];
        }
      }
    }
    
    function getTags(newItem, items, tagsContent) {
      if (items["Subjects:"]) {
        tagsContent.push(items["Subjects:"]);
      }
      for (var i = 0; i < tagsContent.length; i++) {
        newItem.tags[i] = tagsContent[i];
      }
    }
    
    function saveToZotero(newItem, items) {
      // Associate and save well-formed data to Zotero
      associateData (newItem, items, "Title:", "title");
      associateData (newItem, items, "ISBN-10:", "ISBN");
      associateData (newItem, items, "Collection:", "extra");
      associateData (newItem, items, "Pages:", "pages");
      newItem.repository = "NiCHE";
      newItem.complete();
    }
    
    function doWeb(doc, url) {
      // namespace code
      var namespace = doc.documentElement.namespaceURI;
      var nsResolver = namespace ? function(prefix) {
        if (prefix == 'x') return namespace; else return null;
      } : null;
    
      // variable declarations
      var articles = new Array();
      var items = new Object();
      var nextTitle;
    
      // If Statement checks if page is a Search Result, then saves requested Items
      if (detectWeb(doc, url) == "multiple") {
        var titles = doc.evaluate('//td[2]/a', doc, nsResolver, XPathResult.ANY_TYPE, null);
        while (nextTitle = titles.iterateNext()) {
          items[nextTitle.href] = nextTitle.textContent;
        }
        items = Zotero.selectItems(items);
        for (var i in items) {
          articles.push(i);
        }
      } else {
        //saves single page items
        articles = [url];
      }
    
      // process everything, calling function=scrape to do the heavy lifting
      Zotero.Utilities.processDocuments(articles, scrape, function(){Zotero.done();});
      Zotero.wait();
    }
  2. Click icon=Save (second from left): your translator should save silently.
  3. Close all running Scaffold 2.0 instances.

To test the translator on a single-result page:

  1. Ensure the first sample page is open in your browser and has focus.
  2. Open Scaffold 2.0 from the Firefox main menu with Tools>Scaffold. This should popup dialog=“Zotero Scaffold”.
  3. Hit icon=Load (at the upper left of the dialog). This should popup dialog=“Load Translator” displaying a single table mapping “Label” to “Creator”.
  4. Scroll through the “Load Translator” table until you see Label=
    How to Write a Zotero Translator (single result)

    , then hit button=OK.

  5. You will return to the main dialog=“Zotero Scaffold”. Check to see that tab=Metadata is properly populated.
  6. Click icon=“Run doWeb” (the thunderbolt) to test the new code: in the Test Frame you should get results like
    12:00:00 Returned item:
        'itemType' => "book"
        'creators' ...
            '0' ...
                'firstName' => "Alan"
                'lastName' => "MacEachern"
                'creatorType' => "author"
            '1' ...
                'firstName' => "William J."
                'lastName' => " Turkel"
                'creatorType' => "author"
        'notes' ...
        'tags' ...
            '0' => "History"
            '1' => "Methodology"
            '2' => "Tables."
            '3' => "Environment"
        'seeAlso' ...
        'attachments' ...
        'url' => "http://niche-canada.org/member-projects/zotero-guide/sample1.html"
        'title' => "Method and Meaning in Canadian Environmental History"
        'place' => "Toronto"
        'date' => "2009"
        'publisher' => "Nelson Canada"
        'ISBN' => "0176441166"
        'extra' => "None"
        'pages' => "573"
        'libraryCatalog' => "NiCHE"
        'complete' => function(...){...} 
             
    14:36:30 Translation successful
  7. Close all running Scaffold 2.0 instances.

Test the translator on another single-result page:

  1. Ensure the second sample page is open in your browser and has focus.
  2. Open Scaffold 2.0 from the Firefox main menu with Tools>Scaffold. This should popup dialog=“Zotero Scaffold”.
  3. Hit icon=Load (at the upper left of the dialog). This should popup dialog=“Load Translator” displaying a single table mapping “Label” to “Creator”.
  4. Scroll through the “Load Translator” table until you see Label=
    How to Write a Zotero Translator (single result)

    , then hit button=OK.

  5. You will return to the main dialog=“Zotero Scaffold”. Check to see that tab=Metadata is properly populated.
  6. Click icon=“Run doWeb” (the thunderbolt) to test the new code: in the Test Frame you should get results like
    12:00:00 Returned item:
                 'itemType' => "book"
                 'creators' ...
                     '0' ...
                         'firstName' => "David Freeland"
                         'lastName' => "Duke"
                         'creatorType' => "author"
                 'notes' ...
                 'tags' ...
                     '0' => "Canada"
                     '1' => "Environment"
                     '2' => "Bibliography"
                     '3' => "Tables."
                     '4' => "History"
                 'seeAlso' ...
                 'attachments' ...
                 'url' => "http://niche-canada.org/member-projects/zotero-guide/sample2.html"
                 'title' => "Canadian Environmental History: Essential Readings"
                 'place' => "Toronto"
                 'date' => "2006"
                 'publisher' => "Canadian Scholars Press"
                 'ISBN' => "1551303108"
                 'extra' => "None"
                 'pages' => "392"
                 'libraryCatalog' => "NiCHE"
                 'complete' => function(...){...} 
             
    18:52:01 Translation successful
  7. Close all running Scaffold 2.0 instances.

To test the translator on a search-results page:

  1. Ensure the first sample search page is open in your browser and has focus.
  2. Open Scaffold 2.0 from the Firefox main menu with Tools>Scaffold. This should popup dialog=“Zotero Scaffold”.
  3. Hit icon=Load (upper left of the dialog) to popup dialog=“Load Translator”.
  4. Scroll through the “Load Translator” table until you see Label=
    How to Write a Zotero Translator (search results)

    , then hit button=OK.

  5. You will return to the main dialog=“Zotero Scaffold”. Check to see that tab=Metadata is properly populated.
  6. Click icon=“Run doWeb” (the thunderbolt): a dialog=“Select Items” should popup, with a selection area containing 10 items, corresponding to the 10 items in the sample search page. Check to see that the titles of the items in the dialog match the titles of the items on the sample search page, then click button=Cancel on dialog=“Select Items”.
  7. TODO: test adding an item to your library.
  8. Leave the current instance of Scaffold 2.0 open, since we'll use the same translator in the next section.

Since it is now clear that our “single-result translator” also handles search results properly, we can save it as just “How to Write a Zotero Translator” with no suffix:

  1. Switch to tab=Metadata.
  2. In field=Label, remove the “(single result)” suffix.
  3. Click icon=Save (second from left): your translator should save silently.
  4. TODO: delete now-unnecessary translator=“How to Write a Zotero Translator (search results)”
  5. Close all running Scaffold 2.0 instances.

Next: Chapter 17: Common Problems when Scraping an Individual Entry Page

dev/how_to_write_a_zotero_translator_2nd_edition/chapter_16.txt · Last modified: 2011/04/03 16:29 by debweb