Opened 8 years ago

Closed 2 years ago

#1146 closed enhancement (fixed)

Check for duplicate items functionality

Reported by: stakats Owned by: dstillman
Priority: major Milestone:
Component: data layer Version: 2.1
Keywords: mellon Cc:

Description

Ben's distance code should go some way toward addressing this problem.

Attachments (3)

duplicates_v06.patch (42.3 KB) - added by fbennett 6 years ago.
duplicates_v06_-w_option.patch (42.3 KB) - added by fbennett 6 years ago.
duplicates_detection_multilingual_r7712.patch (46.8 KB) - added by fbennett 6 years ago.

Download all attachments as: .zip

Change History (14)

comment:1 Changed 8 years ago by dstillman

(In [4215]) Addresses #1146, Check for duplicate items functionality

Ben's duplicate detection code, with the integration reworked a bit

Very rough, so currently requires creation of a boolean extensions.zotero.debugShowDuplicates pref to view the Actions menu option

comment:2 Changed 8 years ago by dstillman

(In [4216]) Addresses #1146, Check for duplicate items functionality

Actually load duplicate detection code

comment:3 Changed 8 years ago by dstillman

  • Component changed from interface to data layer
  • Milestone changed from 1.5 Final to 1.5 Beta 3

Some things missing:

  • More sophisticated detection logic (using unique identifiers (URL, ISBN, DOI...), levenshtein() (available in Zotero.Utilities and as a DB function), taking creators into account, etc.)
  • UI button to revert to showing all items without switching collections
  • A way to automatically merge child items and collections
  • What about word processor documents that rely on keys? Tough luck?
  • Awareness of duplicate mode in itemTreeView.js notify() (for ignoring certain refresh actions)
  • Anything on ingest/import? Maybe not.

comment:4 Changed 7 years ago by dstillman

  • Milestone 2.0 Beta 3 deleted
  • Priority changed from critical to major
  • Version changed from 1.5 to 2.1

comment:5 Changed 6 years ago by fbennett

I've put up a patch (readable version is duplicates_v06_-w_option.patch) that covers some of the outstanding issues.

Features include:

  • Duplicates check done in SQL rather than JS (possibly helps performance? not sure, haven't tested);
  • Duplicates view appears on top-level collection context menu, rather than gear menu;
  • Web translators can explicitly set the fields to be checked on a scraped item;
  • Items can be merged with a master (master substituted into collections as appropriate);
  • Duplicates view can be canceled without switching collections;
  • User can mark individual suspect items for checking;
  • Duplicates checking is available on group libraries as well as the main library.

Shortcomings include:

  • Still no provision for mapping item IDs, so merged items will break in word processor documents that depend on the dropped item. On the other hand, a credible substitute will be available in the original location, so it's only half bad.

This is serving me very well locally. Although the UI is a little different than the current "hidden preference" duplicates mechanism, this can be used for the same purpose (by marking all items as suspect, and then opening the duplicates view).

I'll try to keep the patch up to date, but the verbose version just uploaded should apply cleanly against 6339.

comment:6 Changed 6 years ago by fbennett

I've done some tidying up, and found a couple of errors in the patches. They have now been corrected, and the two versions of the patch are identical and will apply cleanly to the #6339 sources.

comment:7 Changed 6 years ago by fbennett

Patch updated to play nice with trunk revision r6346.

Changed 6 years ago by fbennett

Changed 6 years ago by fbennett

comment:8 Changed 6 years ago by fbennett

Patched version fixed to properly without debugShowDuplicates option (i.e. with no duplicates support), or with it. Newly uploaded patch also incorporates some minor bugfixes.

Changed 6 years ago by fbennett

comment:9 Changed 6 years ago by fbennett

I've refashioned my UI/SQL patch for duplicates detection support to work with the current multilingual version. The multilingual branch will be kept up to date with the trunk, and I'll plan to update the patch for each XPI release. The duplicates detection interface will be enabled by default in the multilingual XPI, so people can more easily take it out for a spin and form a view on how well or poorly it fits their own workflow.

comment:10 Changed 5 years ago by stakats

Two critical areas for internal development: merge UI and preservation of merged item keys. The existing distance function and item value comparisons should be abstracted as much as possible in order to provide easy access for outside developers to customize and tweak.

comment:11 Changed 2 years ago by dstillman

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.