Opened 9 years ago
Closed 8 years ago
#874 closed enhancement (fixed)
Improved character set support for RIS
| Reported by: | dstillman | Owned by: | simon |
|---|---|---|---|
| Priority: | major | Milestone: | 1.0.8 |
| Component: | ingester | Version: | 1.0 |
| Keywords: | Cc: | tjowens |
Description
We really should address this. The relevant thread (or one of the many) is here: http://forums.zotero.org/discussion/1246/
I think emitting UTF-8 RIS is probably a given. It looks like quite a few apps and sites do so, and users can always convert the RIS file to a more limited character set manually if necessary.
For importing, we may be able to use Zotero.File.getCharsetFromFile() for local files and the new responseCharset flag on doGet() and doPost() for RIS files from translators, since those charsets should be known. I think we can remove the IBM850 declaration and default to UTF-8.
For local files, there's also the possibility of adding an import-time option and only running getCharsetFromFile() if it's set to Automatic.
We should do something soon, even if more elegant options are pushed back to 1.5.
Change History (5)
comment:1 Changed 9 years ago by dstillman
- Cc tjowens added
comment:2 Changed 8 years ago by dstillman
Here's one site that actually uses IBM850 and doesn't specify a character set:
http://www.ub.uni-freiburg.de/xopac/wwwolix.cgi?db=ubfr&nd=6481130
From http://forums.zotero.org/discussion/1775/#Item_4
Even Firefox's auto-detection doesn't get it right, so we don't have great options here. I'm not actually sure why the characters don't come through via the translator--does everything over the wire get treated as UTF-8 if it doesn't have a charset specified, even if the RIS translator currently specifies IBM850? Anyhow, it seems like we still need to allow for manual setting to IBM850 (and Windows-1252) even if we default to UTF-8 for file imports and even if we use Zotero.File.getCharsetFromFile(). And we might want a manual setting that applies to translator saving as well so that people who deal primarily with libraries like this can still save using the translators.
comment:3 Changed 8 years ago by dstillman
From http://forums.zotero.org/discussion/3314/:
The RIS file provided from http://adsabs.harvard.edu/abs/2005NJPh....7..204F, is served as UTF-8 when it's actually Windows-1252, so even using the charset from the wire wouldn't help:
Firefox does detect it correctly off the disk. Maybe it's possible to manually clear the charset in a channel and re-trigger the detection process?
Assuming we can't find any new exposed functionality in Fx3, we may need to implement manual detection between IBM850, Windows-1252 and UTF-8 on a byte array.
comment:4 Changed 8 years ago by simon
(In [3066]) addresses #874, Improved character set support for RIS
- uses Zotero.File.getCharsetFromFile to determine charset for local files
- adds export-time charset selection option
closes #1075, RIS translator should parse newline-delimited keywords
fixes a regression in the RIS translator relating to notes with newlines
comment:5 Changed 8 years ago by simon
- Resolution set to fixed
- Status changed from new to closed
Trevor, if you have time, while we're on a RIS kick, could you attach a few examples of RIS files containing extended (French, German, Chinese, etc.) characters? It'd be helpful to have examples of a few versions of EndNote (or at least the latest version) and some of the major RIS-exporting websites so that we can decide if defaulting to UTF-8 for import makes sense. Perhaps the large majority of sites are exporting UTF-8 already and we don't even need to bother with detection.