Opened 9 years ago

Closed 2 years ago

#865 closed defect (duplicate)

Unicode normalization

Reported by: dstillman Owned by: dstillman
Priority: major Milestone:
Component: data layer Version: 2.1
Keywords: Cc: simon, fbennett

Description (last modified by ajlyon)

Normalization is needed for correct search, sorting, and disambiguation.
http://forums.zotero.org/discussion/12684/special-character-search/

Also for BibTex export:
http://groups.google.com/group/zotero-dev/browse_thread/thread/6f6d5c2eec1cc9ae?hl=en

The forum discussion covers possible solutions.

Change History (7)

comment:2 Changed 9 years ago by simon

Should we be doing this at the DB or ingester level instead? Otherwise, output will have to be normalized in reports, bibliography, etc. (if we care)

comment:3 Changed 9 years ago by dstillman

  • Cc simon added
  • Component changed from export to data layer
  • Owner changed from simon to dstillman
  • Status changed from new to assigned

Good call. Assuming Firefox doesn't do this itself when entering data in a text box (in which case it could be just in the ingester), doing it in the data layer probably makes more sense. Since normalization is supposed to be idempotent, it shouldn't be a problem even if it's run every time a field is changed.

The downside to doing it on-change is that existing data wouldn't be normalized, but 1) the workaround (triggering a re-saving of the field) would be pretty easy and 2) we could even provide a hidden XUL page to run a normalize operation on all data, though it'd take an awfully long time.

comment:4 Changed 6 years ago by ajlyon

  • Description modified (diff)
  • Milestone 2.0 Beta 3 deleted
  • Version changed from 1.5 to 2.1

Reported again at http://forums.zotero.org/discussion/12684/special-character-search/ , complete with suggested solutions.

comment:5 Changed 6 years ago by fbennett

  • Cc fbennett added

comment:6 Changed 5 years ago by egh

FYI - a trick I have used to good effect to strip "accents" from an index is (in python):

def strip_accents(s):

return .join((c for c in unicodedata.normalize('NFD', unicode(s)) if unicodedata.category(c) != 'Mn'))

That is, normalize in NFD, then remove each character in class "Mark, non-spacing".

Of course this is only to add tokens to your index for full text search.

And of course this is a different issue than just normalization to avoid problems with composed v. decomposed characters, but it is what was originally raised in the form post.

comment:7 Changed 2 years ago by dstillman

  • Resolution set to duplicate
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.