Opened 9 years ago

Closed 8 years ago

#743 closed enhancement (fixed)

Support non-EZproxy proxies

Reported by: dstillman Owned by: simon
Priority: major Milestone: 1.0.8
Component: ingester Version: 1.0
Keywords: Cc: stakats, mikowitz, erazlogo

Description

We should offer more generalized proxy support. It seems many of the library proxy systems Zotero doesn't currently work with do use predictable URLs:

Example 1
Example 2

The first is some sort of (Juniper?) VPN system. I'm not sure what the second proxy server is.

The first one, at least, seems to be a common system that we should add automatic support for, but we should have some way of supporting unknown proxies with consistent URLs, even if it's as hacky as a hidden/advanced pref that takes a regex (or, if we're going to support more complicated URLs like the VPN one and not just modified domain names, both a regex and a substitution string--that is, the first two parameters of the JS replace() method) that we can provide people on the forums if necessary. And since people might use more than one library, we probably need to support multiple space-delimited sets in the pref.

It may be worth trying to do something quick and inelegant for 1.0. Simon, I could try to work on it if you don't think you'll have time. We could wait until 1.5, but I think this is making Zotero fairly unusable for a pretty large contingent of people.

Change History (19)

comment:1 Changed 9 years ago by dstillman

Looks like the proxy server from Example 2 might be common, too, since another user posted with an identically modified URL:

http://forums.zotero.org/discussion/1232/?Focus=4981#Comment_4981

comment:2 Changed 9 years ago by stakats

  • Milestone changed from 1.0 RC 4 to 1.0.2
  • Priority changed from major to critical

And in some cases, there is nothing (e.g. no "0-") prepended. For example:

http://www.jstor.org.offcampus.lib.washington.edu/view/00218723/di952370/95p00897/0?currentResult=00218723%2bdi952370%2b95p00897%2b0%2c00&searchUrl=http%3A%2F%2Fwww.jstor.org%2Fsearch%2FAdvancedResults%3Fhp%3D25%26si%3D26%26q0%3Dedmund%2Bmorgan%26f0%3Dau%26c0%3DAND%26ar%3Don%26wc%3Don%26sd%3D%26ed%3D%26la%3D

We ought to roll this out ASAP, even if inelegantly. Basically, we need two strings, a prefix and a suffix. Users could manually enter them in preferences dialog. If we don't want to mess with XUL for the moment, they can be hidden prefs.

comment:3 Changed 9 years ago by simon

For 1.0, I think we should just loosen the regexps. I already loosened the regexps on most of the translators to allow for domain suffixes a while ago. It looks like all that's necessary is to allow domain prefixes as well. Some of the newer translators may also need refinements to their detectCode, if it's not already sufficiently specific.

Sean, does that second link not work? Or is it just an example? The current JSTOR regexp should be matching it.

For 1.5, I'd love to have intelligent proxy support, so that the user can flip a switch and requests to journal sites are automatically routed through his/her library proxy when s/he is off campus. This solution would fix this bug and #604. It would also be useful for links to journals from blogs, etc. (a situation I run into relatively frequently), and would provide another incentive to use Zotero.

comment:4 Changed 9 years ago by simon

On second thought, it looks like there are a lot more translators than there were when I made those changes, and going through each individually might be a lot of work, and not particularly worthwhile if we plan on implementing something more sophisticated in 1.5. We could loosen only the major databases, implement the preference, or some combination of the two. What do you think?

comment:5 Changed 9 years ago by stakats

I think loosening the regexps is a fine idea until we can come up with something more sophisticated. What regexp do you propose to handle the prefix most gracefully? Michael can then just zip through scrapers.sql and update the resources most likely to be proxied (e.g. LexisNexis, JSTOR, MUSE, etc.) with better support for prefixes and suffixes.

comment:6 Changed 9 years ago by stakats

Would the following do the trick or is it going to introduce other problems? For example:

change

^https?://serials\.abc-clio\.com[^/]*/active/go/ABC-Clio-Serials_v4

to

^https?://[^/]*serials\.abc-clio\.com[^/]*/active/go/ABC-Clio-Serials_v4

Please advise.

comment:7 Changed 9 years ago by simon

That should be fine. The odds that you'll introduce problems are low to begin with, since very few URLs besides ABC-CLIO will contain the domain "serials.abc-clio.com," although you should make sure that the detectCode actually does something with the DOM to make sure that it can scrape the page. (In this case, it does.)

comment:8 Changed 9 years ago by simon

Also, we should remove the caret from front of the beginning of the regexps in order to match URLs like:

https://www.myuu.nl/http://www.springerlink.com/content/q1j7651n41584r76/

comment:9 Changed 9 years ago by erazlogo

This still doesn't seem to be fixed for 1.0.2. Concordia's url is the same as Example 2 and I'm teaching Zotero in my research seminar this semester so I need this to work for my students. Would you object if I edit some of the main translators to make them work? If there are no objections, how flexible should the new regexp be?

For example, original url:
https?:(?:www\.|ocrpdf-sandbox\.)jstor\.org[/]*/(?:view|browse/[/]+/[/]+\?|search/|cgi-bin/jstor/viewitem)

Option 1:
https?:(?:0-www\.|www\.|ocrpdf-sandbox\.)jstor\.org[/]*/(?:view|browse/[/]+/[/]+\?|search/|cgi-bin/jstor/viewitem)

Option 2 (as suggested in various comments above):
https?:[/]*jstor\.org[/]*/(?:view|browse/[/]+/[/]+\?|search/|cgi-bin/jstor/viewitem)

Thanks!

comment:10 Changed 9 years ago by stakats

  • Cc mikowitz added

No objections here at all. Thanks!

comment:11 Changed 9 years ago by mikowitz

If no one minds/has already done so, I'm going to go ahead and start adding Elena and Simon's change ideas to some of our major databases as a test. If anyone's already doing/done this, let me know.

comment:12 Changed 9 years ago by erazlogo

  • Cc erazlogo added

comment:13 Changed 9 years ago by erazlogo

michael -- that would be great, thanks! let me know if the more general regexp works--that would be preferable i think.

comment:14 Changed 9 years ago by mikowitz

(In [2083]) Addresses #743. Pushes first set of regex changes to handle proxies.
Translators updated:
-ABC-Clio Serials
-ACM
-BioMed Central
-Cambridge Journals Online
-Ebscohost
-JSTOR
-Lexis-Nexis

comment:15 Changed 9 years ago by erazlogo

(In [2104]) Addresses #743, sets History Cooperative and Project Muse to better handle other proxies

comment:16 Changed 9 years ago by dstillman

(In [2150]) Addresses #743, Support non-EZproxy proxies

Pushed updated PubMed (r2149) to repo

comment:17 Changed 9 years ago by dstillman

  • Priority changed from critical to major

comment:18 Changed 9 years ago by simon

(In [2463]) addresses #743, Support non-EZproxy proxies

theoretically support Juniper proxies. this needs testing.

comment:19 Changed 8 years ago by simon

  • Resolution set to fixed
  • Status changed from new to closed

(In [3107]) closes #743, Support non-EZproxy proxies
closes #831, transparent EZProxy support
adds a proxy pane to the preferences
asks before saving proxies to the DB (to avoid the potential phishing risk #831 would otherwise pose)

Note: See TracTickets for help on using tickets.