feat: add tool to analyze deferred queue#946
Conversation
910a8e5 to
b54ad4f
Compare
|
This currently prints reports for destinations, with most common at the end (so you don't need to scroll), e.g.: More important than the script is adding documentation, so operators can do something about it. E.g. for domains it is possible to setup reverse proxy even for someone else's server and then add it to /etc/hosts to try to fix the problem. Or setup a tunnel. Can also add some examples of destination misconfiguration (e.g. domain mismatch resulting in TLS failures, link-local addresses in IPv6 etc.), they can be solved if there is a contact with destination operator. Probably better wait for filtermail deployment, because error messages will look different as most errors displayed here are The other related problem I noticed is that most of the "deferred" queue size consists of some destinations that keep failing because they are misconfigured and are unlikely to recover, e.g. MX record pointing to the host that refuses to accept messages for the domain. Messages end up in the queue and are not delivered to broken destinations that keep failing with temporary errors and use hundreds of MBs easily. Script can estimate this with "message_size" column if it is useful, but it is unclear what operator can do about it and if we count the queue size, how to show that some messages are accounted twice because they have multiple failing destinations. There is no configuration in postfix to limit queue size per sender or per destination. Maybe we should clean up some destinations out of the queue if they use too much space, or block commonly failing destinations with permanent errors in filtermail-transport, or use NDNs on the client to decide that if some destinations keeps bouncing and recipient has 5 relays, we select 3 of them but only those that have not resulted in NDNs recently. |
|
|
||
|
|
||
| def deferred(): | ||
| """Run postqueue -j and yield parsed JSON lines for deferred mails""" |
There was a problem hiding this comment.
this is literally just saying what the few next code lines do. I thought you weren't a fan of such docstrings or comments?
There was a problem hiding this comment.
I want to have docstrings for everything, but now that script has boiled down to counting reasons from deferred queue, maybe it should be moved into the documentation itself, together with snippets to count active users etc.
There was a problem hiding this comment.
At this point the whole script can probably be turned into some jq plus perl^Wawk oneliner. Original idea was being smart and suggest something like "this destination is broken, talk to the operator to fix DNS", "this destination has likely ran out of space" and "this destination times out, need to reroute", but in practice raw errors are probably more useful, can't predict how things will get broken for others, better write a troubleshooting FAQ.
d902186 to
b164c4d
Compare
With filtermail-transport errors now look like this: Some errors are still useful: But "Failed to connect to any mail server" could be more informative, to distinguish between timeouts or connection refusal. Wonder what postfix reports here if there are multiple MX records and IPs, just the last error? Errors from the host are still relayed, but not saying which MX server returned permanent error: E.g. previously we had errors that included MX name: |
b164c4d to
c2c73fc
Compare
It prints all destinations with the number of recipients and all the reasons. Operator can then try to fix the problems for destinations, e.g. by manually adding reverse proxy addresses to /etc/hosts for failing domains or routing IP addresses to another interface.
c2c73fc to
f896ce6
Compare
It prints all destinations with the number of recipients and all the reasons. Operator can then try
to fix the problems for destinations,
e.g. by manually adding reverse proxy
addresses to /etc/hosts for failing domains
or routing IP addresses to another interface.