Bug #458999 “No longer retries hpmud_open_device” : Bugs : HPLIP

Revision history for this message

Tim Waugh (twaugh) wrote on 2009-10-23:

#1

hplip-retry-open.patch Edit (624 bytes, text/x-diff)

Revision history for this message

David Suffield (david-suffield) wrote on 2009-10-23:

#2

I am not convinced we need this patch. If the user wants retries then the CUPS print queue error policy should be set to retry-job. The patch would lock out the user from setting their own error policy.

Revision history for this message

Tim Waugh (twaugh) wrote on 2009-10-27:

#3

hplip-retry-open.patch (with CLASS handling) Edit (1.2 KiB, text/plain)

Compare with CUPS' own socket backend: if httpAddrConnect() fails, it acts as follows:

1. If CLASS environment variable is set, sleep 5 seconds and exit with CUPS_BACKEND_FAILED.
2. Otherwise, for ECONNREFUSED, EHOSTDOWN and EHOSTUNREACH, retry after 5 seconds (backing off to 30 seconds)
3. For other errors, retry after 30 seconds
4. Give up after a week (adjustable with URI option 'contimeout')

Here's a patch to take the CLASS environment variable into account.

Revision history for this message

David Suffield (david-suffield) wrote on 2009-10-28:

#4

Thanks, I added the CLASS environment variable check.

I did not change the device error functionality because I have users who did not like the forever loop. So I prefer to rely on the user settable error-policy.

Revision history for this message

Johannes Meixner (jsmeix) wrote on 2009-10-28:

#5

It seems you confuse an internal retry of the backend
while it tries to establish the communication
with its recepient with an overall retry of the whole print job
by the cupsd.

Why do those users think a long running (a wek is really not "forever")
loop of retries with reasonable sleeps in between is bad?

If you like to please all your users, make those settings adjustable
via optional additional parameters to your device URI.

For more info what I think about the whole issue have a look at
https://bugzilla.novell.com/show_bug.cgi?id=337794#c16
in particular:
----------------------------------------------------------
As long as the backend cannot establish a communication with its
recipient, it cannot cause damage when it loops infinitely and
retries again and again (with a reasonable sleep time between
each retry).

In contrast if a working communication with its recipient crashes,
it should not re-try the same job (which means print it again from
the beginning).
But the backend can also not simply wait in an infinite loop until
it can re-establish the communication and then send the rest of
the job because usually this doesn't continue the job where it was
interrupted (which happens usually somewhere within a page but the
backend doesn't know about pages - it just sends a stream of bytes).

The more I think about it, the less I like the idea to "solve" it
via another single default CUPS ErrorPolicy.
----------------------------------------------------------

Revision history for this message

David Suffield (david-suffield) wrote on 2009-10-28:

#6

Ok let me try to explain how the next version of the "hp" backend handles errors. There are two error conditions I will expand on.

1. Device connect errors.
a. Device communicates, but is busy.
b. Device fails to communicate.

2. Device write errors.
a. User recoverable error, out-of-paper, lid-open, paper-jam, etc...
b. Non recoverable hardware error.

Some errors will exit and some will loop forever until cleared.

Exit codes used:
      0 = ok
      1 = job failed, use error-policy
      4 = job failed, stop queue

Device connect error definitions:

exit code | condition a. | condition b. | CLASS env.
          | (busy) | (fails) | set
----------+---------------+--------------+-----------
    1 | x | | x
  loop | x | |
    1 | | x | x
    1 | | x |

Device write error definitions:

As you can see only device connect errors will have the option to retry the job. Non recoverable write errors will stop the queue. Recoverable errors will loop forever with 30-70s delays.

Revision history for this message

David Suffield (david-suffield) wrote on 2009-10-28:

#7

hp_error_defs.txt Edit (1.3 KiB, text/plain)

Sorry about the formatting. Try my attachment...

Revision history for this message

Tim Waugh (twaugh) wrote on 2009-10-29:

#8

Don't you distinguish between e.g. a hostname lookup failure (which is not recoverable), and a connection failure (which is)?

The question is: what happens if I send a job to a queue for a network printer which is switched off, and then some time later that day it is switched on again?

The answer I'm hoping for is: "The job gets printed when the device is switched on." -- that's what currently happens with all the other CUPS network backends.

Revision history for this message

Johannes Meixner (jsmeix) wrote on 2009-10-29:

#9

Strictly speaking even hostname lookup failures are recoverable
but may need special admin actions (/etc/hosts or DNS setup) to recover.

Or in a home-network a hostname lookup failure might be even
only a temporary failure because the local machine which acts
as DNS server might be currently down (e.g. one of those small
router boxes for home or small local network use).

Therefore it is my personal opinion to loop for some time even
in case of errors like hostname lookup failure, provided:

a)
The backend opuputs meaningful error messages via stderr like
"Hostname lookup 'does.not.exist' failed. Retry in 5 minutes."
so that the user who submitted the print job is informed
and can decide if he likes to cancel the whole job or
if he likes to contact the admin (who is perpaps he himself
in a home network) to get the issue fixed.

b)
There is a reasonable sleep time between each retry
for this particular case of error (e.g. 5 minutes here).

c)
Optionally there could be a limited number of retries.
E.g. after one week it does probably make no sense
to retry regardless what the reason of the failure is.

Revision history for this message

David Suffield (david-suffield) wrote on 2009-10-29:

#10

If you send a job and the printer is switched off. The CUPS queue error-policy will determine what happens. The user has the following choices.

abort-job
retry-current-job
retry-job
stop-printer

Yes, this different than the current socket backend, but it is valid functionality for backends.

Typically the default error-policy is stop-printer. If retry-job is prefered, then the default error-policy should be changed.

Revision history for this message

Johannes Meixner (jsmeix) wrote on 2009-10-30:

#11

Caution: Long answer!
;-)

I like to explain some more background why I think that
a backend should retry in any possible case
when it is about to establish the communication.

From my pint of view the crucial idea behind is to
keep the control at the user as long as possible.

Initially the "hp" backend has the control which means
that you (both "you" as the author of the "hp" backend and
"you" as the HP company which makes the printer device
and its driver software) has the control what happens
with your customer's (the one who bought your printer)
print job in case of a problem.

As long as you let your backend run, you can communicate
with your customer via stderr messages in case of a problem.

As long as you let your backend run, your customer
has the control what should happen with his print job
in case of a problem because he can decide for each
individual print job to keep the backend retrying or to
cancel the whole job.

In contrast when you give up (i.e. exit the backend)
you withdraw the individual job control from the user
and plunk down the whole stuff and let the cupsd
do whatever it will do with your user's job.

But the cupsd is generic software which can only
act based on generic general rules (e.g. error policy)
which are set up by an admin but out of the control
of the individual user who had submitted the print job
and in the end the result is likely an annoyed HP customer.

For example think about a travelling user with a laptop
who submits a job to a print queue for a network printer
(which is of course not accessible at all while travelling).

Here "travelling" also means when the user walks with his laptop
through a bigger company building using wireless networking
where this or that network printer may become available or not.

When your backend just keeps retrying everything is
perfectly o.k. because:

If the user submitted the job by mistake to this queue
he can easily cancel it.

When the user submitted the job intentionally
to this queue to get it printed whenever the connection
to the printer can be established, your backend must
keep retrying for a long time (e.g. for one week).

In contrast the company admin who may have set up
the laptop cannot decide in advance via whatever
setting for the cupsd what should happen with each
individual print job.

By the way regarding "If retry-job is prefered, then the
default error-policy should be changed":

Usually no.

A default error-policy "retry" is usually not possible
because it results endless printing of the same job
again and again for certain issues with network printers
which fail to communicate correctly a "success" at the
end of a job.

Only a queue specific error policy "retry" may help here
but why should the cupsd retry the whole job
(i.e. also re-run all the filters and so on) only because
the backend cannot establish the communication?

If the error during communication startup is really fatal
(ony the backend can decide this), the bakend must
stop the queue (because job retry does not make sense)
and in any other case the backend must retry.