Opened 10 years ago

Closed 10 years ago

#377 closed defect (fixed)

Problems scraping from Hubmed/PubMed

Reported by: dstillman Owned by: simon
Priority: major Milestone: 1.0 Beta 3
Component: ingester Version: 1.0
Keywords: Cc:

Description

Two or three (depending on whether the first two are related) problems on Hubmed/Pubmed reported by a user:

First two are "Could not save item" errors that I can't reproduce, but I see them in the notification log:

05d07af9-105a-4572-99f6-a8e231c0daef  2006-10-02 17:00:00  81.151.78.150   Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1) Gecko/20061010 Firefox/2.0                QueryInterface => function QueryInterface() {
    [native code]
}
message => 
result => 2152398858
filename => chrome://zotero/content/xpcom/translate.js
lineNumber => 571
columnNumber => 0
initialize => function initialize() {
    [native code]
}
url => http://www.hubmed.org/display.cgi?uids=17054214
extensions.zotero.cacheTranslatorData => true
extensions.zotero.automaticSnapshots => true                                         2006-10-27 06:34:26
fcf41bed-0cbc-3704-85c7-8062a0068a7a|2006-10-23 00:23:00|128.32.177.180|Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20061010 Firefox/2.0||message => missing ) after argument list
fileName => 
lineNumber => 108
stack => doWeb()@:0
@:0
@:0

name => SyntaxError
url => http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?holding=;&db=PubMed&cmd=search&term=takeout
extensions.zotero.cacheTranslatorData => true
extensions.zotero.automaticSnapshots => true|2006-10-27 03:11:30

Here's the third:

In Hubmed there's an Export link below the abstract. Click this and choose the RDF export. For me, the details then show in plaintext in my browser window. Click icon to add to Zotero, now get "Saving item..." The item doesn't get saved and the 'Saving item...' will only go away with a FF restart.

This one I can reproduce. It throws an error at line 225 of translate.js: doc.location has no properties

I can access doc.location's properties via Venkman, so it looks like it may be a sandbox problem.

For what it's worth I'm also getting doc.domain has no properties on line 175 lots of places, though that's probably unrelated.

Change History (4)

comment:1 Changed 10 years ago by simon

(In [844]) addresses #377, Problems scraping from Hubmed/PubMed

makes scrape icon disappear when navigating away from a page

comment:2 Changed 10 years ago by simon

the third is now resolved; the first and second i need to figure out how to reproduce

comment:3 Changed 10 years ago by dstillman

According to feedback, happens without being logged into My NCBI and without other installed extensions.

comment:4 Changed 10 years ago by dstillman

  • Resolution set to fixed
  • Status changed from new to closed

(In [883]) Fixes #377, Problems scraping from Hubmed/PubMed
Fixes #381, SIRSI scraper no longer working at William & Mary

And new Amazon scraper. And a few COinS errors. And possibly some others.

It turns out Firefox has a bug in which DOM nodeValues greater than 4096 characters are split into multiple nodes, and so any scrapers pulled from the repository with 'code' fields greater than 4K were being truncated. We didn't see it during testing of repo code because most are smaller.

Calling normalize() on the node combines the nodes, so future releases won't have the problem regardless of when it's fixed in Firefox. For existing installs, I managed to get PubMed, COinS, SIRSI 2003+, and, with quite a lot of effort, Amazon, under 4096 characters, hopefully without breaking anything. I removed all other scrapers from the repository for now.

Note: See TracTickets for help on using tickets.