suboptimal handling of multi-page documents

Bug #707700 reported by Jakub Wilk
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cuneiform for Linux
New
Undecided
Unassigned
cuneiform (Debian)
Confirmed
Unknown

Bug Description

If Cuneiform is linked with ImageMagick, users might be tempted to run in on documents consisting of multiple pages (e.g. PDFs). In such a situation, all the pages are rendered and loaded into memory (though they usually won't fit, so Cuneiform will die at this point...), but then, if I understand correctly, only the first one is actually OCR-ed.

Changed in cuneiform (Debian):
status: Unknown → Confirmed
Revision history for this message
Yury V. Zaytsev (zyv) wrote :

Yes, you are totally correct. And apparently there is no obvious way to fix it apart from re-writing the kernel completely. See https://bugs.launchpad.net/cuneiform-linux/+bug/705500 for the details.

Revision history for this message
Jussi Pakkanen (jpakkane) wrote :

Well to be fair, Cuneiform is capable of multi-page recognition. It's just that it got broken some time during the port to *nix.

Revision history for this message
Yury V. Zaytsev (zyv) wrote :

What I meant is that the effort to find out where exactly it got broken seems right now to be comparable with a full rewrite...

Revision history for this message
julien (julien-aubert) wrote : Re: [Bug 707700] Re: suboptimal handling of multi-page documents

In case anyone knows.
Do cuneiform use any adaptive techniques over many pages?
Specifically, would performance increase if one pdf was sent with multiple
pages compared to sending them one by one and then assembling the output?

2011/1/26 Yury V. Zaytsev <email address hidden>

> What I meant is that the effort to find out where exactly it got broken
> seems right now to be comparable with a full rewrite...
>
> --
> You received this bug notification because you are a member of Cuneiform
> Linux, which is the registrant for Cuneiform for Linux.
> https://bugs.launchpad.net/bugs/707700
>
> Title:
> suboptimal handling of multi-page documents
>
> Status in Linux port of Cuneiform:
> New
> Status in “cuneiform” package in Debian:
> Confirmed
>
> Bug description:
> If Cuneiform is linked with ImageMagick, users might be tempted to run
> in on documents consisting of multiple pages (e.g. PDFs). In such a
> situation, all the pages are rendered and loaded into memory (though
> they usually won't fit, so Cuneiform will die at this point...), but
> then, if I understand correctly, only the first one is actually OCR-
> ed.
>
>
>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.