Opened 6 years ago
Last modified 5 years ago
#1808 assigned enhancement
Build snapshots from DOM elements
| Reported by: | fbennett | Owned by: | simon |
|---|---|---|---|
| Priority: | major | Milestone: | |
| Component: | ingester | Version: | 2.1 |
| Keywords: | Cc: | fbennett, ajlyon |
Description
For many sites, there are a few well-defined blocks of content that are of interest. It would save network burden and simplify translator authoring to allow extracted DOM elements to be saved in a separately generated document. The attached patches against the trunk illustrate the approach. The files are:
Diff-enable-dom-object-save.patch
This provides a "document:" slot on the translator attachments object, which accepts a DOM document object.
Diff-dom-translator-utility.patch
This is a utility for use in translators, that does the heavy lifting of building a simple web page with DOM elements provided to the function as content.
Diff-mainichi-translator.patch
This is a sample translator that works against news articles at mainichi.jp (both the English and the Japanese sites).
This work follows on from earlier discussions that took place on zotero-dev:
http://groups.google.com/group/zotero-dev/browse_thread/thread/7d4ed9d1710ac897
If it is feasible, it would be very convenient to have this facility available in translators.
Attachments (3)
Change History (7)
Changed 6 years ago by fbennett
Changed 6 years ago by fbennett
Changed 6 years ago by fbennett
comment:1 Changed 6 years ago by ajlyon
comment:2 Changed 6 years ago by ajlyon
Indeed, the potential email import translator I mentioned on #zotero-dev and on the forums would benefit from this. I believe in reserving notes for user content (as reflected in the privacy preferences for library sharing-- notes can easily be hidden), so an imported email should be an HTML attachment.
comment:3 Changed 5 years ago by simon
Just to provide an update (and an explanation for why I haven't taken the patch yet), this will make it in, but I need to figure out how to make it work with the connectors.
comment:4 Changed 5 years ago by simon
- Component changed from translators to ingester
- Owner changed from ajlyon to simon
- Status changed from new to assigned
As I've told Frank privately, this is a patch I would like to see in the trunk. It has the potential to significantly increase the potency of the translator infrastructure, and I, as you know, have high hopes that we might see more translators like Frank's law translation one.
This is also a matter of completeness-- we can currently generate note attachments, but there may well be cases where HTML attachments are more appropriate, as in the case of an import translator that imports from a format with inline content (say, an eBook format...), and we find ourselves needing to attach not referenced files, but rather files that are part of the file we are processing.