Opened 10 years ago

Closed 10 years ago

#313 closed enhancement (fixed)

Blacklist known ad sites from scraper detection

Reported by: dstillman Owned by: simon
Priority: minor Milestone:
Component: ingester Version:
Keywords: Cc:

Description

I noticed in the debug output that Zotero was running detect code (COinS, etc.) on ad.doubleclick.net and its ilk. I don't know how resource-intensive this is relative to what it would take anyway to run the regexes to blacklist the sites (which would probably want to happen through the same DB so that it was repository-based), but since they often run multiple times per page, it might be wise to do something about it.

Change History (1)

comment:1 Changed 10 years ago by simon

  • Resolution set to fixed
  • Status changed from new to closed

(In [734]) closes #334, Washington Post scraper shouldn't include " - washingtonpost.com" in title
closes #313, Blacklist known ad sites from scraper detection
closes #306, some New York Times ads prevent page from being recognized
closes #308, attachment import bug

currently, the ad site blacklist is located at the top of ingester/browser.js. at some point, we may want to switch this to a database table.

Note: See TracTickets for help on using tickets.