Concoct some scheme for monitoring mailin/mailout

Bug #386097 reported by Paul Everitt
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
KARL3
Fix Released
Medium
LarsN

Bug Description

HIstorically, email-in and email-out have been brittle, mysterious operations. When they fail, we're the last to know.

This task is purposely vague. Namely, implement some ways in which we know if things are wedged, or when things bounced, or when things succeeded. Suggestions:

- Provide a URL that shows the last N processed and bounced. Show some minimal forensics: time, from, to, subject, reason for bounce.

Changed in karl3:
assignee: nobody → Chris Rossi (chris-archimedeanco)
milestone: m18 → m19
Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

I think this should be something compatible with whatever Six Feet Up uses to monitor. One idea would be to have a couple of WSGI apps that we wire in that check the status of mailin and mailout and return a simple text response ("Ok" or "Error") along with corresponding status code (200, 500). Six Feet Up can wire their monitoring app to poll the url's of these two apps.

As far as what the monitoring apps actually do, they can probably just watch for the accumulation of mail in the corresponding maildirs--if mail is starting to stack up, then signal a problem.

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Since we got wedged again today, I'm increasing the importance to medium.

Chris, have we decided that there is no realistic way to detect stderr vs. stdout and send an email when there's a problem?

Changed in karl3:
importance: Low → Medium
milestone: m19 → m20
Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

Probably won't get to this this week.

Changed in karl3:
milestone: m20 → m21
Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

Mail-in will now generate a .error file when it is wedged which can potentially be monitored by Six Feet Up. Assigning to Lars for wiring up the monitoring and notification. (This could land back in my pile if Six Feet Up decides they need a URL and not just a file--a quick cgi or wsgi script is pretty easy if needed.)

Changed in karl3:
assignee: Chris Rossi (chris-archimedeanco) → LarsN (lars-sixfeetup)
milestone: m21 → m22
status: New → In Progress
Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

Error file is at /home/zope/Maildir/.error

It's presence indicates there is an error. It must be deleted manually to clear the error state.

Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

I've just spent today implementing a quarantine for mail-in so that bad messages won't wedge mail-in. This impacts monitoring, because now we can get rid of the somewhat kludgy .error file and just inspect the contents of the quarantine to see if there's anything there that needs our attention.

At the moment I have a sample shell script in kdi-dev (/home/zope/show_quarantine.sh) that can be used to determine whether there are any quarantined messages. Alternatively, OSI might like a web UI of some sort for viewing the quarantine, in which case that could be used. Stay tuned, as we figure this out.

Revision history for this message
LarsN (lars-sixfeetup) wrote :

The following script is being run every 60 seconds. This is a bandaid until a more robust solution is engineered.

#!/bin/sh
tempfile=/home/zope/quarantine.out

if [ -e ${tempfile} ]; then
        rm $tempfile
fi

/usr/bin/sqlite3 /home/zope/Maildir/pending.db \
    'select message_id from pending where quarantined=1;' > $tempfile

if [ -s ${tempfile} ]; then
        mail -s "KARL Quarantine Output" <email address hidden> < $tempfile
fi

Changed in karl3:
status: In Progress → Fix Committed
Revision history for this message
LarsN (lars-sixfeetup) wrote :

I believe this process is now deprecated by the monitoring Andrew put in place that's integrated with Zenoss

Changed in karl3:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.