Opened 9 years ago

Closed 9 years ago

#766 closed defect (fixed)

Zotero saves text/html URLs with .pdf extensions as PDFs

Reported by: dstillman Owned by: dstillman
Priority: major Milestone: 1.0 RC 4
Component: data layer Version: 1.0
Keywords: Cc: simon, stakats

Description

So I think my workaround for #460 may have been unwise. It might be better for Zotero to simply refuse to import "PDF"s that return 'text/html' in the HEAD request in importFromURL() rather than forcing them to application/pdf, since with the current behavior it ends up creating broken fake PDFs that are actually login pages or other intermediate HTML pages.

If we did that, would there be another solution for sites like Oxford Journals? One option might be to hard-code a MIME type override in the translator, with translate.js passing a flag to importFromURL() to ignore the MIME type. But then the problem would still occur if a user tried to save a PDF via a right-click.

Hard-coding some of the bad sites in importFromURL() would be another option, and it'd make sense in that it's not a translator-specific issue, but I'm somewhat loath to start hard-coding sites in the data layer and the consequence of getting it wrong (downloading the PDF to the desktop or popping up the Firefox save dialog) is probably even worse than creating a broken PDF (which at least has a link back to the HTML page).

The crudest option would be to download the "PDF", inspect it, and delete the created attachment item if the PDF isn't valid (which is easy to test for). Inefficient, but probably the most reliable solution.

Anybody have thoughts?

Change History (2)

comment:1 Changed 9 years ago by simon

I'd go for the crude option of inspecting the PDF. If it's just a page notifying the user s/he can't get the PDF, it will only be a few kilobytes of wasted download anyway.

comment:2 Changed 9 years ago by dstillman

  • Resolution set to fixed
  • Status changed from new to closed

(In [1710]) Fixes #766, Zotero saves text/html URLs with .pdf extensions as PDFs
Addresses #460, importFromURL fails when importing PDFs from servers that do not properly support HEAD requests

Now inspects supposed PDFs after download and deletes if not actually PDF format

Also:

  • Fixed bug when running importFromDocument() on a PDF on Windows that would result in an incomplete or missing (since r1688) attachment item
  • importFromDocument() no longer returns an itemID, since it can be partly asynchronous now
  • Added rudimentary 'text/html' support for Zotero.MIME.sniffForMIMEType()
Note: See TracTickets for help on using tickets.