Opened 9 years ago
Last modified 6 years ago
#888 new enhancement
"is-date" should return "true" only if date parses cleanly
| Reported by: | erazlogo | Owned by: | simon |
|---|---|---|---|
| Priority: | major | Milestone: | |
| Component: | styles | Version: | 1.5 |
| Keywords: | Cc: | bdarcus, simon, fbennett |
Description (last modified by dstillman)
right now "is-date" seems to return "true" if the field can be parsed into a valid date (regardless of what else it contains), which leads to the following loss of data when moving from the Zotero date field to CSL:
"1750-1754" > "1750";
"[2000?]" > "2000";
"ca. 2000" > "ca 2000";
"n.d." > "n.d"
All of the above should probably evaluate as "false" and just return the whole string.
Change History (25)
comment:1 Changed 9 years ago by erazlogo
- Description modified (diff)
comment:2 Changed 9 years ago by dstillman
- Description modified (diff)
- Summary changed from "is-date" should return "true" only if it's a date with nothing in "parts" to "is-date" should return "true" only if date parses cleanly
comment:3 Changed 9 years ago by dstillman
Or, as Elena suggests on the dev list, just look for certain characters in the string (hyphen, brackets, question mark) that always disable complete parsing.
comment:4 Changed 9 years ago by erazlogo
Yes, I think these three (hyphen, brackets, question mark) would be the only characters that are relevant here. the other examples not yet listed are words:
"autumn" (and other seasons), "n.d.", "forthcorming", "in press", etc.
comment:5 Changed 9 years ago by erazlogo
also slashes, as in May/June 2001
comment:6 follow-ups: ↓ 7 ↓ 9 Changed 9 years ago by simon
- Cc bdarcus codec added
Maybe there's a more sophisticated way of doing this? In APA format, it seems like we need some parsing (or user editing) to get the format right in most cases, since it wants:
(2005, Summer)
rather than
(Summer 2005)
One possibility is to alter the way we perform date processing, so we get a text month, day, year and part for CSL and a numeric month, day, year (or a JS date object) for sorting, etc. We could include seasons in the text month, since, in most styles, I believe it's formatted the same way, but separate out "n.d.", "forthcoming" and other such terms so that, when they are included, is-date returns false.
comment:7 in reply to: ↑ 6 ; follow-up: ↓ 8 Changed 9 years ago by bdarcus
Replying to simon:
One possibility is to alter the way we perform date processing, so we get a text month, day, year and part for CSL and a numeric month, day, year (or a JS date object) for sorting, etc.
+1
We could include seasons in the text month, since, in most styles, I believe it's formatted the same way ...
There is some NISO date encoding that uses the convention of encoding the seasons in the month slot, but using the values 41-44 (IIRC).
That's one way; the other is the RIS approach, which is an "other" or "additional" attribute.
but separate out "n.d.", "forthcoming" and other such terms so that, when they are included, is-date returns false.
Hmm ... not sure about this. These kinds of things can and probably should be handled in a structured way; "n.d." --> "date = null", forthcoming --> "status = forthcoming" (since, for example, such condition often includes a date proper; in that case, it's just a date in the future).
I'm starting to think, BTW, that adding the is-date attribute might not have been a good idea.
comment:8 in reply to: ↑ 7 ; follow-ups: ↓ 10 ↓ 12 Changed 9 years ago by dstillman
Replying to bdarcus:
Replying to simon:
but separate out "n.d.", "forthcoming" and other such terms so that, when they are included, is-date returns false.
Hmm ... not sure about this. These kinds of things can and probably should be handled in a structured way; "n.d." --> "date = null", forthcoming --> "status = forthcoming" (since, for example, such condition often includes a date proper; in that case, it's just a date in the future).
I'm starting to think, BTW, that adding the is-date attribute might not have been a good idea.
It's an admirable goal, but I fear that for every convention that we manage to shoehorn into semantic fields, Elena (or some random user) could probably come up with another example that wouldn't fit.
And localization makes it that much harder. Do we want to account for all the possible ways people might indicate date-related concepts in every language?
But these things aren't mutually exclusive. There's no reason why we can't make the parser and citation processor incrementally smarter while still allowing unparsed fields to pass through. Otherwise we're just frustrating some people in the meantime.
comment:9 in reply to: ↑ 6 Changed 9 years ago by erazlogo
Replying to simon:
Maybe there's a more sophisticated way of doing this? In APA format, it seems like we need some parsing (or user editing) to get the format right in most cases, since it wants:
(2005, Summer)rather than
(Summer 2005)One possibility is to alter the way we perform date processing, so we get a text month, day, year and part for CSL and a numeric month, day, year (or a JS date object) for sorting, etc. We could include seasons in the text month, since, in most styles, I believe it's formatted the same way, but separate out "n.d.", "forthcoming" and other such terms so that, when they are included, is-date returns false.
That would be great, although it would be important to account for all variants of seasons in all languages. In English, we should include both "Autumn" and "Fall", as well as dates for double issues ("Winter-Spring 2000-2001," "Winter/Spring 2000/2001," "September-October," etc.). If we are treating seasons as dates, could we also sort properly by seasons, so Summer 2000 sorts before Fall 2000?
comment:10 in reply to: ↑ 8 Changed 9 years ago by erazlogo
Replying to dstillman:
Replying to bdarcus:
Replying to simon:
but separate out "n.d.", "forthcoming" and other such terms so that, when they are included, is-date returns false.
Hmm ... not sure about this. These kinds of things can and probably should be handled in a structured way; "n.d." --> "date = null", forthcoming --> "status = forthcoming" (since, for example, such condition often includes a date proper; in that case, it's just a date in the future).
I'm starting to think, BTW, that adding the is-date attribute might not have been a good idea.
It's an admirable goal, but I fear that for every convention that we manage to shoehorn into semantic fields, Elena (or some random user) could probably come up with another example that wouldn't fit.
Here is one: approximate dates. Approximately 2000 can be expressed by the user as "ca.2000", "circa 2000", "[2000?]", "2000?", "2000(?)" and in some other ways I can't think of at the moment, plus more yet more options exist in languages other than English. It's hard to account for all of these in parsing, plus a user may have a preference for using one of these while a structured way of parsing these could only keep one option.
And localization makes it that much harder. Do we want to account for all the possible ways people might indicate date-related concepts in every language?
But these things aren't mutually exclusive. There's no reason why we can't make the parser and citation processor incrementally smarter while still allowing unparsed fields to pass through. Otherwise we're just frustrating some people in the meantime.
This would be my preference.
What about BCE? This should be processed as date, right? In which case it needs to be localized as well.
comment:11 Changed 9 years ago by erazlogo
More variants:
August 1972(?)
August(?) 1972
August 25(?), 1972
These would be parsed as dates, but then the (?) has to be attached to the appropriate part of the date (usually to the most specific part of the date, so that shouldn't be too hard).
comment:12 in reply to: ↑ 8 ; follow-up: ↓ 13 Changed 9 years ago by bdarcus
Replying to dstillman:
It's an admirable goal, but I fear that for every convention that we manage to shoehorn into semantic fields, Elena (or some random user) could probably come up with another example that wouldn't fit.
That's ALWAYS the case with structured data; that doesn't mean we don't try to make the vast majority of cases (let's say 99%) work reliably and predictably.
And localization makes it that much harder. Do we want to account for all the possible ways people might indicate date-related concepts in every language?
No; I think we (Zotero, bibo, CSL) account for the obvious common cases in a structured way, and dump the other stuff in some "other" structure (bibo:displayDate, or cs:date-part/@type="other", or some JS hash entry). I'm basically agreeing with Simon's suggestion, in other words (though am a little worried about mapping seasons to months).
But these things aren't mutually exclusive. There's no reason why we can't make the parser and citation processor incrementally smarter while still allowing unparsed fields to pass through. Otherwise we're just frustrating some people in the meantime.
I agree. For example, I've run into the desire to store original publication dates. Can't do that now.
comment:13 in reply to: ↑ 12 ; follow-up: ↓ 14 Changed 9 years ago by dstillman
Replying to bdarcus:
No; I think we (Zotero, bibo, CSL) account for the obvious common cases in a structured way, and dump the other stuff in some "other" structure (bibo:displayDate, or cs:date-part/@type="other", or some JS hash entry). I'm basically agreeing with Simon's suggestion, in other words (though am a little worried about mapping seasons to months).
That's fine. I was just defending the general concept of string pass-through, since even if we can parse complex dates, a single errant character should probably invalidate all parsing and pass the string through as is. If there's a better way to do it in CSL, such as using a conditional to test if 'date-part' is empty and manually passing the string through if not, you guys would know best. But while the added structures (for approximate dates, etc.) might be useful to have for other applications, and for CSL if date-part is empty, at least for CSL I don't think the content of date-part would actually be usable in any way other than as a boolean to indicate clean parsing.
I'm also still a little uncomfortable with reformatting some of these more difficult concepts like approximate dates, since it seems there might be many-to-one mappings from possible user input, and except for where there's a clear case of correct and incorrect formattings, we want to make sure we don't prevent a user from having something display in a particular way.
An example would be time, if we handled it. We could easily parse "7 p.m.", "7PM", and "7pm", and if a style guide mandated a particular format, we would want to use that, but otherwise, even if the majority of style guides say "7 p.m.", we would want to make sure users had the ability to use whichever convention they preferred.
Is there a way to achieve that sort of flexibility?
comment:14 in reply to: ↑ 13 ; follow-up: ↓ 15 Changed 9 years ago by bdarcus
Replying to dstillman:
I'm also still a little uncomfortable with reformatting some of these more difficult concepts like approximate dates ....
It is not a "difficult concept." It's totally simple, clear and unambiguous.
It's just the case that most systems don't include the concept, and so users and data entry people are then forced to invent often awkward conventions for indicating this simple concept. The result is it's difficult to parse.
Is there a way to achieve that sort of flexibility?
It would probably help for Zotero to understand the basic date concepts here, and to build on existing ways the UI handles this. For example, if I entered "c1045" in the field, it could indicate that the year is parsed, and it is an approximate date.
I suppose it'd help for CSL to understand these concepts too. It already understands original dates, of course, but not approximate dates.
There's not really any need to test if the date-part is empty or not, since there's an understanding that empty data doesn't get printed (no any additional formatting associated with that template). Beyond that, I'm not really sure on the details. I guess it's worth further discussion as people work on this.
comment:15 in reply to: ↑ 14 ; follow-up: ↓ 16 Changed 9 years ago by dstillman
Replying to bdarcus:
Replying to dstillman:
I'm also still a little uncomfortable with reformatting some of these more difficult concepts like approximate dates ....
It is not a "difficult concept." It's totally simple, clear and unambiguous.
It's just the case that most systems don't include the concept, and so users and data entry people are then forced to invent often awkward conventions for indicating this simple concept. The result is it's difficult to parse.
OK, that's fair, though I wasn't really talking about difficulty in parsing. We can easily parse any number of different conventions for any concept. I just want to make sure we don't make it impossible for a user to output a particular convention when they deem it necessary to do so, at least when there's no style-mandated convention for the concept.
If Chicago specifies how to include approximate dates, then CSL should support it and we should use that. If another style doesn't, the user should probably be able to use their desired formatting, no? Or do we arbitrarily pick one convention for unspecified cases and tell users they can edit the style files if they want a different one?
There's not really any need to test if the date-part is empty or not, since there's an understanding that empty data doesn't get printed (no any additional formatting associated with that template).
Except the premise of this ticket, and what I'm arguing, is that if there's anything that didn't parse cleanly, even a single unexpected character, any parsed structure should be ignored as far as CSL is concerned, since we no longer really know what concept the user was conveying. In Zotero this would be indicated in the metadata pane by something like what's described in #887 so that the user could remove the unparsed content if they wanted to use CSL-based formatting. Otherwise they'd have the flexibility of using an arbitrary string.
comment:16 in reply to: ↑ 15 Changed 9 years ago by erazlogo
Replying to dstillman:
Replying to bdarcus:
Replying to dstillman:
I'm also still a little uncomfortable with reformatting some of these more difficult concepts like approximate dates ....
It is not a "difficult concept." It's totally simple, clear and unambiguous.
It's just the case that most systems don't include the concept, and so users and data entry people are then forced to invent often awkward conventions for indicating this simple concept. The result is it's difficult to parse.
OK, that's fair, though I wasn't really talking about difficulty in parsing. We can easily parse any number of different conventions for any concept. I just want to make sure we don't make it impossible for a user to output a particular convention when they deem it necessary to do so, at least when there's no style-mandated convention for the concept.
If Chicago specifies how to include approximate dates, then CSL should support it and we should use that. If another style doesn't, the user should probably be able to use their desired formatting, no? Or do we arbitrarily pick one convention for unspecified cases and tell users they can edit the style files if they want a different one?
Chicago is actually flexible on this issue. See 17.119: "When the publication date of a printed work cannot be ascertained, the abbreviation n.d. takes the place of the year in the publication details. A guessed-at date may either be substituted (in brackets) or added.
Boston, n.d.
Edinburgh, [1750?] or Edinburgh, n.d., ca. 1750"
However, Chicago CSL is already less flexible than CMS in several cases, so in this case it would be consistent to choose one variation over another. If you choose one of these after parsing (in cases when the field parses cleanly), you could just go with [1750?] since it's simpler. But if a significant share of users demand more flexibility in approximate dates, then giving them the option to use their own formatting may make sense.
comment:17 Changed 9 years ago by simon
- Milestone changed from 1.0.4 to 1.5 Alpha 1
comment:18 Changed 8 years ago by stakats
- Cc simon added
- Version changed from 1.0 to 1.5
The arguments in this ticket are all very worthwhile, but the bottom line is that users will need a way to pass dates as strings for edge cases. What exactly constitutes an edge case may narrow considerably if we employ some of the smarter parsing suggested here, but we will always need to be able to fall back and send an unparsed string. May I suggest that we make this functionality an immediate priority?
comment:19 Changed 7 years ago by dstillman
- Cc codec removed
In case there was any doubt that there would always be unparseable edge cases: http://forums.zotero.org/discussion/7093/hebrew-dats/
Frank plans to allow passing of a literal string in the new processor: http://forums.zotero.org/discussion/6802/error-in-date-parsing-for-serials-with-format-yyyyyyyy-or-yyyyyyyy/#Item_5
comment:20 Changed 6 years ago by dstillman
- Milestone 2.0 Beta 3 deleted
Milestone 2.0 Beta 3 deleted
comment:21 Changed 6 years ago by simon
- Cc fbennett added
Frank, is this something we should still be worrying about with citeproc-js?
comment:22 Changed 6 years ago by fbennett
The is-date conditional was dropped from CSL 1.0, so this specific issue is closed (unless it makes a comeback at some future time).
From the specification, it looks like is-numeric is now a general-purpose test that can be applied to any field. There are no restrictions on the fields that can be tested, and there is no statement of what constitutes a "numeric" field. In the schema, it accepts all variables as arguments. Here's what the specfication says:
is-numeric
Tests whether the given variables (Appendix I - Variables) contain numeric data.
Appendix I contains a list of all variables.
So on the current spec, at least, is-numeric could be (and by inference probably should be) dual-purposed to allow a test for a properly parsed date. It doesn't do that at present, and the test code looks pretty ratty. Shall I look at cleaning up the behavior of the attribute and extending it in this direction?
comment:23 Changed 6 years ago by fbennett
I should also note that CSL 1.0 has an is-uncertain-date test, that covers Elena's concerns up-thread.
comment:24 Changed 6 years ago by simon
Yes, that should help, although at the moment it looks like, while citeproc-js will recognize the "circa" attribute on a date, neither the citeproc-js nor the Zotero date parsers will set it. I'm going to send an email about this (and a few other date-related questions) to the citeproc-js list shortly.
comment:25 Changed 6 years ago by fbennett
Aha. It looks like there is a naming mismatch between the citeproc-js manual and the code. The code calls this "fuzzy" at the moment. It's parsed out okay, so renaming the key value returned by the parser should get it working. I'll make that change in the next release.
Actually I guess it's a bit more difficult than checking if "part" is empty, since "[2000?]" parses to { year: 2000, part: undefined }... So parsed numbers might need to be rechecked in their original context to make sure there are no other characters.
Elena, are there other examples that you can think of that might fail at something like this?