Opened 10 years ago

Closed 10 years ago

#327 closed enhancement (fixed)

Scrapers should either take snapshots or use URL field

Reported by: dstillman Owned by: simon
Priority: major Milestone: 1.0 Beta 3
Component: ingester Version: 1.0
Keywords: Cc:

Description

The Link attachment type no longer officially exists, so we need to change scrapers to not use it.

Scrapers should either take snapshots or put the URL into the URL field and set accessDate to "CURRENT_TIMESTAMP" (which is now inserted as the SQL keyword rather than a string by Item.save()).

(It also makes a lot more sense to have things that disappear behind an archive wall (like the NYT) use snapshots instead.)

Simon, if you won't have time to do this in the next couple days, reassign it to me, and I'll try to take care of it.

Change History (7)

comment:1 Changed 10 years ago by dstillman

  • Priority changed from critical to major

OK, we're keeping the link attachment type after all, but we still want to make some changes here.

At the very least, we need to use snapshots instead of links for pages that go behind a wall. I'll be adding a pref for automatically taking snapshots when creating new items from pages, which is really intended for the toolbar button but should apply to at least some of the scrapers as well (though the ones that gather PDFs or other large files may need to be a separate pref).

As for using the URL field, obviously it depends on the scraper--it makes sense to use the URL field rather than an attached link for sites with their own content (NYT, WashPo, etc.). In those cases there's no need for an attached link, though attaching a snapshot based on the pref would make sense. For things like Amazon, it doesn't make sense to use the URL field, since it's not part of the source, so that could probably stay as a link, though whether or not pages like that (which aren't going to change) should switch to snapshots if the pref is set is up for debate. At the very least, it would be good to use linkFromDocument() instead of linkFromURL() whenever possible (i.e. whenever the page being linked is already loaded), as the latter doesn't run the page through the fulltext indexer.

comment:2 Changed 10 years ago by simon

(In [731]) closes #305, add conditionals/quotes to CSL
addresses #327, Scrapers should either take snapshots or use URL field
closes #309, Integration server prevents Zotero from loading in multiple instances of Firefox

comment:3 Changed 10 years ago by dstillman

(In [744]) Added automaticSnapshots pref, and changed Create New Item From Current Page button to obey pref

At least some scrapers (NYT and WashPo, for sure) should be updated to follow this pref

Addresses #327, Scrapers should either take snapshots or use URL field

comment:4 Changed 10 years ago by dstillman

(In [755]) Addresses #327, Scrapers should either take snapshots or use URL field

Use automaticSnapshots pref (which defaults to on and is changeable in the prefs window) rather than downloadAssociatedFiles (which defaults to off and is only settable through about:config at the moment) for now in translate.js

downloadAssociatedFiles should eventually be used for PDFs and other large files, whereas automaticSnapshots will be for HTML and the like -- in the meantime, I think it's OK for scrapers to just follow the visible pref for both, since otherwise they'd be totally confused when the NIFP button took a snapshot and the scrapers didn't

Simon, if there's any problem I'm not aware of with switching this for now (other than people getting some large PDFs on JStor), let me know.

comment:5 Changed 10 years ago by dstillman

Ticket for using downloadAssocitedFiles for PDFs is #351

comment:6 Changed 10 years ago by dstillman

  • Milestone changed from 1.0 Final to 1.0 Beta 3

comment:7 Changed 10 years ago by simon

  • Resolution set to fixed
  • Status changed from new to closed

(In [939]) - closes #327, scrapers should either take snapshots or use URL field

  • closes #351, scrapers with PDF downloads should use downloadAssociatedFiles instead of automaticSnapshots

there are some problems with snapshot titles. see bug #436.

Note: See TracTickets for help on using tickets.