Opened 9 years ago
Closed 2 years ago
#865 closed defect (duplicate)
Unicode normalization
| Reported by: | dstillman | Owned by: | dstillman |
|---|---|---|---|
| Priority: | major | Milestone: | |
| Component: | data layer | Version: | 2.1 |
| Keywords: | Cc: | simon, fbennett |
Description (last modified by ajlyon)
Normalization is needed for correct search, sorting, and disambiguation.
http://forums.zotero.org/discussion/12684/special-character-search/
Also for BibTex export:
http://groups.google.com/group/zotero-dev/browse_thread/thread/6f6d5c2eec1cc9ae?hl=en
The forum discussion covers possible solutions.
Change History (7)
comment:1 Changed 9 years ago by dstillman
comment:2 Changed 9 years ago by simon
Should we be doing this at the DB or ingester level instead? Otherwise, output will have to be normalized in reports, bibliography, etc. (if we care)
comment:3 Changed 9 years ago by dstillman
- Cc simon added
- Component changed from export to data layer
- Owner changed from simon to dstillman
- Status changed from new to assigned
Good call. Assuming Firefox doesn't do this itself when entering data in a text box (in which case it could be just in the ingester), doing it in the data layer probably makes more sense. Since normalization is supposed to be idempotent, it shouldn't be a problem even if it's run every time a field is changed.
The downside to doing it on-change is that existing data wouldn't be normalized, but 1) the workaround (triggering a re-saving of the field) would be pretty easy and 2) we could even provide a hidden XUL page to run a normalize operation on all data, though it'd take an awfully long time.
comment:4 Changed 6 years ago by ajlyon
- Description modified (diff)
- Milestone 2.0 Beta 3 deleted
- Version changed from 1.5 to 2.1
Reported again at http://forums.zotero.org/discussion/12684/special-character-search/ , complete with suggested solutions.
comment:5 Changed 6 years ago by fbennett
- Cc fbennett added
comment:6 Changed 5 years ago by egh
FYI - a trick I have used to good effect to strip "accents" from an index is (in python):
def strip_accents(s):
return .join((c for c in unicodedata.normalize('NFD', unicode(s)) if unicodedata.category(c) != 'Mn'))
That is, normalize in NFD, then remove each character in class "Mark, non-spacing".
Of course this is only to add tokens to your index for full text search.
And of course this is a different issue than just normalization to avoid problems with composed v. decomposed characters, but it is what was originally raised in the form post.
comment:7 Changed 2 years ago by dstillman
- Resolution set to duplicate
- Status changed from assigned to closed
Also: http://www.xulplanet.com/references/xpcomref/ifaces/nsIUnicodeNormalizer.html