Opened 10 years ago

Closed 10 years ago

#312 closed defect (fixed)

Optional subdomains should be optional in scraper regular expressions

Reported by: dstillman Owned by: simon
Priority: minor Milestone:
Component: ingester Version:
Keywords: Cc:

Description

For example, while Amazon links canonically intra-site, URLs without 'www.' still work for viewing pages on the store, but the scraper doesn't detect them. Same with NYTimes, and possibly a few others.

Might want to allow an optional 's' in the scheme as well, since one can also view Amazon pages (and probably other sites) over HTTPS.

Change History (2)

comment:1 Changed 10 years ago by simon

The problem is, if you do a search from http://amazon.com/, you get links to http://www.amazon.com/, which completely screws things up, since it trips the Mozilla same-origin security feature (which I can't seem to disable without disabling all of the security features, which is not a good idea). The preferred way to deal with this issue would be to convince the Mozilla people to provide some way of disabling the origin check for sandboxed scripts (which might also be useful for GreaseMonkey), but that hasn't happened yet. Thus, there are two possibilities for us:

  1. Change the Amazon scraper to scrape the page source (which isn't affected by same-origin security).
  2. Redirect the user from amazon.com to www.amazon.com.

Neither of these have to be done now, since both are possible in pure scraper code.

comment:2 Changed 10 years ago by simon

  • Resolution set to fixed
  • Status changed from new to closed

(In [948]) closes #432, Add hidden pref to enable clipboard export on Mac. The pref is extensions.zotero.enableMacClipboard
closes #312, Optional subdomains should be optional in scraper regular expressions.

Note: See TracTickets for help on using tickets.