E-Mail problems: a detective story

Few weeks ago, the company where I work for changed the internet provider. Faster access, lower price, you know the drill.

But problems started to appear, namely DNS issues, that took some time to solve, but at the end it worked ok.

Anyway, complains started to appear that people have send e-mails to us, and they were never answered or received…

A quick check on the Postfix logs, showed that those customers where always disconnected:

The smtp transaction was something like:

connect from smtp.domain.com[X.Y.Z.L]
lost connection after DATA (0 bytes) from smtp.domain.com[X.Y.Z.L]
disconnect from smtp.domain.com[X.Y.Z.L]

So It looked like those customer servers connected and disconnected after a few seconds.

A quick search on Google (:-) ) showed that it might be a Firewall issue, a MTU issue or a Postfix bug (!).

We’ve checked, and double checked the firewall. It was ok, and anyway it wouldn’t explain why other server had no problem connecting and transfering mail.

The MTU issue: Ok, we’ve changed the Internet Provider, but all our firewall interfaces are on Ethernet. Anyway we’ve changed the MTU to 1492: ifconfig eth0 mtu 1492. It didn’t solved.Back to square zero.

Postfix bug: It has been working fine on the last 6 months, so very strange… We’ve disable PostGrey (the GLD Daemon). Not a greylisting issue. Bummer.

So we enabled the Postfix debuging features by adding the line: debug_peer_list = smtp.domain.com ( instructions here: http://www.postfix.org/DEBUG_README.html ) and waited for a connection.

Anyway to cut a long story short, we’ve found out that all SMTP transaction was OK until the DATA portion, where it looked that the other e-mail server disconnected. What we did see was that we had a long list o RBL’s to check, and it was taking to long to check every one of them, so it looked like the other e-mail server disconnected after a time-out period even during the email transfer process. I think that is a bug on the other peer, but the issue was ours…

So we’ve cut to half our RBL list, only keeping NJABL, Spamcop and Spamhaus lists, and the servers with problems started to connect and transfer right away the stacked queued messages to us.

Moral of the story: The longer the RBL list the longer the time to process incoming data. Some email servers will just “barf” at these long times.

Right now: Zero problems.

