Opened 6 years ago

Last modified 5 years ago

#1808 assigned enhancement

Build snapshots from DOM elements

Reported by: fbennett Owned by: simon
Priority: major Milestone:
Component: ingester Version: 2.1
Keywords: Cc: fbennett, ajlyon

Description

For many sites, there are a few well-defined blocks of content that are of interest. It would save network burden and simplify translator authoring to allow extracted DOM elements to be saved in a separately generated document. The attached patches against the trunk illustrate the approach. The files are:

Diff-enable-dom-object-save.patch

This provides a "document:" slot on the translator attachments object, which accepts a DOM document object.

Diff-dom-translator-utility.patch

This is a utility for use in translators, that does the heavy lifting of building a simple web page with DOM elements provided to the function as content.

Diff-mainichi-translator.patch

This is a sample translator that works against news articles at mainichi.jp (both the English and the Japanese sites).

This work follows on from earlier discussions that took place on zotero-dev:

http://groups.google.com/group/zotero-dev/browse_thread/thread/7d4ed9d1710ac897

If it is feasible, it would be very convenient to have this facility available in translators.

Attachments (3)

Diff-enable-dom-object-save.patch (3.6 KB) - added by fbennett 6 years ago.
Diff-dom-translator-utility.patch (4.8 KB) - added by fbennett 6 years ago.
Diff-mainichi-translator.patch (3.4 KB) - added by fbennett 6 years ago.

Download all attachments as: .zip

Change History (7)

Changed 6 years ago by fbennett

Changed 6 years ago by fbennett

Changed 6 years ago by fbennett

comment:1 Changed 6 years ago by ajlyon

As I've told Frank privately, this is a patch I would like to see in the trunk. It has the potential to significantly increase the potency of the translator infrastructure, and I, as you know, have high hopes that we might see more translators like Frank's law translation one.

This is also a matter of completeness-- we can currently generate note attachments, but there may well be cases where HTML attachments are more appropriate, as in the case of an import translator that imports from a format with inline content (say, an eBook format...), and we find ourselves needing to attach not referenced files, but rather files that are part of the file we are processing.

comment:2 Changed 6 years ago by ajlyon

Indeed, the potential email import translator I mentioned on #zotero-dev and on the forums would benefit from this. I believe in reserving notes for user content (as reflected in the privacy preferences for library sharing-- notes can easily be hidden), so an imported email should be an HTML attachment.

comment:3 Changed 5 years ago by simon

Just to provide an update (and an explanation for why I haven't taken the patch yet), this will make it in, but I need to figure out how to make it work with the connectors.

comment:4 Changed 5 years ago by simon

  • Component changed from translators to ingester
  • Owner changed from ajlyon to simon
  • Status changed from new to assigned
Note: See TracTickets for help on using tickets.