Text conversion problem?

Bug #120011 reported by Richard H.
4
Affects Status Importance Assigned to Milestone
Document Library
Confirmed
Undecided
Unassigned

Bug Description

I decided to replace a pdf in the library with a submission that was identical save for the fact that this time I ticked 'generate plain text' as well. This worked fine (once 'update' was selected in silva (manage_update)) and gave me a text link. However, the resulting text file has an  after every word and on the end of every line.

Revision history for this message
Richard H. (richard-hewison) wrote :

This is a browser issue. The text file is fine when viewed in windows, but if viewed within the browser then you see the  after every word and at the end of every line!

Revision history for this message
Jasper Op de Coul (jasper-infrae) wrote : Re: [Bug 120011] Re: Text conversion problem?

Hi Richard,

The file is in unicode UTF-8 encoding, when you switch your browser to
unicode UTF-8, the file looks fine.
This might be a bug though, the software should use the encoding you
have set in your browser, maybe the script is returning unicode.

Richard H. wrote:
> This is a browser issue. The text file is fine when viewed in windows,
> but if viewed within the browser then you see the  after every word and
> at the end of every line!
>
> ** Attachment added: "uob-data-backup-procs.txt"
> http://launchpadlibrarian.net/8061201/uob-data-backup-procs.txt
>

Revision history for this message
Kit Blake (kitblake) wrote :

Do users (Libraians, etc.) often view the text files? It should be in UTF-8, for the indexing, but we can probably add a header when the file is served that declares it to be UTF-8.

Changed in documentlibrary:
importance: Undecided → Wishlist
status: Unconfirmed → Confirmed
Revision history for this message
Richard H. (richard-hewison) wrote :

I'm unsure what's going on because I have had two different pdf's in the DL 'generate plain text' today, and one displays wrongly (with the Â) whilst the other is displaying (in the browser) perfectly fine. This suggests a difference between the pdfs that is affecting the DL's text generation somehow?

Revision history for this message
Kit Blake (kitblake) wrote :

The PDF to plaintext conversion will never be 100% reliable, as unfortunately not all PDFs are convertable into plaintext. It does look like UTF-8 solves the weird character discussed in this issue. I think we intend to generate UTF-8 in our conversion, so we should be able to add in a browser encoding heading for the plaintext response to make sure the browser knows it too. Keeping this on wishlist.

Revision history for this message
Kit Blake (kitblake) wrote :

Sorry, demoting to Undecided, as its not estimated yet.

Changed in documentlibrary:
importance: Wishlist → Undecided
Revision history for this message
Richard H. (richard-hewison) wrote :

Just to illustrate the point, attached is a screenshot of a text file being displayed in a browser but pulled out of the DL on arana. This is clearly a bit of a problem, so we might have to get this added to the quote for the 'enhancements 2' project?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.