Document Library

Text conversion problem?

Bug #120011 reported by Richard H. on 2007-06-12

4

Affects		Status	Importance	Assigned to	Milestone
	Document Library	Confirmed	Undecided	Unassigned

Bug Description

I decided to replace a pdf in the library with a submission that was identical save for the fact that this time I ticked 'generate plain text' as well. This worked fine (once 'update' was selected in silva (manage_update)) and gave me a text link. However, the resulting text file has an Â after every word and on the end of every line.

Revision history for this message

Richard H. (richard-hewison) wrote on 2007-06-12:

#1

uob-data-backup-procs.txt Edit (12.8 KiB, text/plain)

This is a browser issue. The text file is fine when viewed in windows, but if viewed within the browser then you see the Â after every word and at the end of every line!

Revision history for this message

Jasper Op de Coul (jasper-infrae) wrote on 2007-06-12: Re: [Bug 120011] Re: Text conversion problem?

#2

Hi Richard,

The file is in unicode UTF-8 encoding, when you switch your browser to
unicode UTF-8, the file looks fine.
This might be a bug though, the software should use the encoding you
have set in your browser, maybe the script is returning unicode.

Richard H. wrote:
> This is a browser issue. The text file is fine when viewed in windows,
> but if viewed within the browser then you see the Â after every word and
> at the end of every line!
>
> ** Attachment added: "uob-data-backup-procs.txt"
> http://launchpadlibrarian.net/8061201/uob-data-backup-procs.txt
>

Revision history for this message

Kit Blake (kitblake) wrote on 2007-06-12:

#3

Do users (Libraians, etc.) often view the text files? It should be in UTF-8, for the indexing, but we can probably add a header when the file is served that declares it to be UTF-8.

Changed in documentlibrary:
importance:	Undecided → Wishlist
status:	Unconfirmed → Confirmed

Revision history for this message

Richard H. (richard-hewison) wrote on 2007-06-12:

#4

I'm unsure what's going on because I have had two different pdf's in the DL 'generate plain text' today, and one displays wrongly (with the Â) whilst the other is displaying (in the browser) perfectly fine. This suggests a difference between the pdfs that is affecting the DL's text generation somehow?

Revision history for this message

Kit Blake (kitblake) wrote on 2007-11-12:

#5

The PDF to plaintext conversion will never be 100% reliable, as unfortunately not all PDFs are convertable into plaintext. It does look like UTF-8 solves the weird character discussed in this issue. I think we intend to generate UTF-8 in our conversion, so we should be able to add in a browser encoding heading for the plaintext response to make sure the browser knows it too. Keeping this on wishlist.

Revision history for this message

Kit Blake (kitblake) wrote on 2007-11-12:

#6

Sorry, demoting to Undecided, as its not estimated yet.

Changed in documentlibrary:
importance:	Wishlist → Undecided

Revision history for this message

Richard H. (richard-hewison) wrote on 2007-11-14:

#7

text-output.png Edit (59.7 KiB, image/png)

Just to illustrate the point, attached is a screenshot of a text file being displayed in a browser but pulled out of the DL on arana. This is clearly a bit of a problem, so we might have to get this added to the quote for the 'enhancements 2' project?

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.