Skip to content

feat: add tool to analyze deferred queue#946

Draft
link2xt wants to merge 1 commit into
mainfrom
link2xt/analyze-queue
Draft

feat: add tool to analyze deferred queue#946
link2xt wants to merge 1 commit into
mainfrom
link2xt/analyze-queue

Conversation

@link2xt
Copy link
Copy Markdown
Contributor

@link2xt link2xt commented May 2, 2026

It prints all destinations with the number of recipients and all the reasons. Operator can then try
to fix the problems for destinations,
e.g. by manually adding reverse proxy
addresses to /etc/hosts for failing domains
or routing IP addresses to another interface.

@link2xt link2xt force-pushed the link2xt/analyze-queue branch from 910a8e5 to b54ad4f Compare May 2, 2026 05:40
@link2xt
Copy link
Copy Markdown
Contributor Author

link2xt commented May 2, 2026

This currently prints reports for destinations, with most common at the end (so you don't need to scroll), e.g.:

cm.example.net (1 recipients)
  1: conversation with cm.example.net[192.168.30.50] timed out while receiving the initial server greeting
[192.168.50.30] (8 recipients)
  8: connect to 192.168.50.30[192.168.50.30]:25: Connection timed out
example.org (30 recipients)
  20: connect to example.org[192.168.1.2]:25: Connection timed out
  10: delivery temporarily suspended: connect to example.org[192.168.1.2]:25: Connection timed out

More important than the script is adding documentation, so operators can do something about it. E.g. for domains it is possible to setup reverse proxy even for someone else's server and then add it to /etc/hosts to try to fix the problem. Or setup a tunnel.

Can also add some examples of destination misconfiguration (e.g. domain mismatch resulting in TLS failures, link-local addresses in IPv6 etc.), they can be solved if there is a contact with destination operator.

Probably better wait for filtermail deployment, because error messages will look different as most errors displayed here are smtp transport errors and they will be different for filtermail-transport. E.g. "timed out while receiving the initial server greeting" comes from postfix smtp here and will look different with filtermail:
https://github.com/vdukhovni/postfix/blob/6a5744630d0cf07931c3a67a6de76b6060d5c848/postfix/src/smtp/smtp_trouble.c#L483-L484
It actually means that connection succeeds but gets filtered out later.
Simpler "connection timed out" means that TCP connection falied as likely SYN-ACK did not arrive back.

The other related problem I noticed is that most of the "deferred" queue size consists of some destinations that keep failing because they are misconfigured and are unlikely to recover, e.g. MX record pointing to the host that refuses to accept messages for the domain. Messages end up in the queue and are not delivered to broken destinations that keep failing with temporary errors and use hundreds of MBs easily. Script can estimate this with "message_size" column if it is useful, but it is unclear what operator can do about it and if we count the queue size, how to show that some messages are accounted twice because they have multiple failing destinations. There is no configuration in postfix to limit queue size per sender or per destination. Maybe we should clean up some destinations out of the queue if they use too much space, or block commonly failing destinations with permanent errors in filtermail-transport, or use NDNs on the client to decide that if some destinations keeps bouncing and recipient has 5 relays, we select 3 of them but only those that have not resulted in NDNs recently.

Comment thread chatmaild/src/chatmaild/deferred.py Outdated


def deferred():
"""Run postqueue -j and yield parsed JSON lines for deferred mails"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is literally just saying what the few next code lines do. I thought you weren't a fan of such docstrings or comments?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to have docstrings for everything, but now that script has boiled down to counting reasons from deferred queue, maybe it should be moved into the documentation itself, together with snippets to count active users etc.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point the whole script can probably be turned into some jq plus perl^Wawk oneliner. Original idea was being smart and suggest something like "this destination is broken, talk to the operator to fix DNS", "this destination has likely ran out of space" and "this destination times out, need to reroute", but in practice raw errors are probably more useful, can't predict how things will get broken for others, better write a troubleshooting FAQ.

@link2xt link2xt force-pushed the link2xt/analyze-queue branch 2 times, most recently from d902186 to b164c4d Compare May 5, 2026 01:04
@link2xt
Copy link
Copy Markdown
Contributor Author

link2xt commented May 5, 2026

Probably better wait for filtermail deployment, because error messages will look different as most errors displayed here are smtp transport errors and they will be different for filtermail-transport

With filtermail-transport errors now look like this:

  7: host 127.0.0.1[127.0.0.1] said: 421 Failed to connect to any mail server (in reply to end of DATA command)

Some errors are still useful:

chat-mail.nl.eu.org (15 recipients)
  7: host 127.0.0.1[127.0.0.1] said: 421 DNS resolution failed for chat-mail.nl.eu.org (in reply to end of DATA command)

But "Failed to connect to any mail server" could be more informative, to distinguish between timeouts or connection refusal. Wonder what postfix reports here if there are multiple MX records and IPs, just the last error?

Errors from the host are still relayed, but not saying which MX server returned permanent error:

  652: host 127.0.0.1[127.0.0.1] said: 451 4.3.0 Error: queue file write error (in reply to end of DATA command)

E.g. previously we had errors that included MX name:

t-online.de (2 recipients)
  1: host mx03.t-online.de[194.25.134.73] refused to talk to me: 554 IP=x.x.x.x - Dialup/transient IP not allowed. Use a mailgateway or contact toda@rx.t-online.de if obsolete. (DIAL)
  1: host mx00.t-online.de[194.25.134.8] refused to talk to me: 554 IP=x.x.x.x - Dialup/transient IP not allowed. Use a mailgateway or contact toda@rx.t-online.de if obsolete. (DIAL)

It prints all destinations with the number of recipients
and all the reasons. Operator can then try
to fix the problems for destinations,
e.g. by manually adding reverse proxy
addresses to /etc/hosts for failing domains
or routing IP addresses to another interface.
@link2xt link2xt force-pushed the link2xt/analyze-queue branch from c2c73fc to f896ce6 Compare May 12, 2026 17:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants