E-Mail problems: a detective story

Few weeks ago, the company where I work for changed the internet provider. Faster access, lower price, you know the drill.

But problems started to appear, namely DNS issues, that took some time to solve, but at the end it worked ok.

Anyway, complains started to appear that people have send e-mails to us, and they were never answered or received…

A quick check on the Postfix logs, showed that those customers where always disconnected:

The smtp transaction was something like:

connect from smtp.domain.com[X.Y.Z.L]
lost connection after DATA (0 bytes) from smtp.domain.com[X.Y.Z.L]
disconnect from smtp.domain.com[X.Y.Z.L]

So It looked like those customer servers connected and disconnected after a few seconds.

A quick search on Google (:-) ) showed that it might be a Firewall issue, a MTU issue or a Postfix bug (!).

We’ve checked, and double checked the firewall. It was ok, and anyway it wouldn’t explain why other server had no problem connecting and transfering mail.

The MTU issue: Ok, we’ve changed the Internet Provider, but all our firewall interfaces are on Ethernet. Anyway we’ve changed the MTU to 1492: ifconfig eth0 mtu 1492. It didn’t solved.Back to square zero.

Postfix bug: It has been working fine on the last 6 months, so very strange… We’ve disable PostGrey (the GLD Daemon). Not a greylisting issue. Bummer.

So we enabled the Postfix debuging features by adding the line: debug_peer_list = smtp.domain.com ( instructions here: http://www.postfix.org/DEBUG_README.html ) and waited for a connection.

Anyway to cut a long story short, we’ve found out that all SMTP transaction was OK until the DATA portion, where it looked that the other e-mail server disconnected. What we did see was that we had a long list o RBL’s to check, and it was taking to long to check every one of them, so it looked like the other e-mail server disconnected after a time-out period even during the email transfer process. I think that is a bug on the other peer, but the issue was ours…

So we’ve cut to half our RBL list, only keeping NJABL, Spamcop and Spamhaus lists, and the servers with problems started to connect and transfer right away the stacked queued messages to us.

Moral of the story: The longer the RBL list the longer the time to process incoming data. Some email servers will just “barf” at these long times.

Right now: Zero problems.

Linux mail gateway

I’ve run where I work for 4 years a Mandrake based firewall with Postfix and Mailscanner. I really, really liked mailscanner, but for my colleagues the setup was “too complicated”. So I moved to EFW, Endian Firewall comunity edition. What it brings in ease of use it lacks in flexibility.

Finally my prayers where listen, and I’m going to move again to a custom build full fledged mail gateway with Mailscanner. Check out: this howto.

Linux firewalls

Where I work, despite being a Windows shop (small one), nobody trusts ISA Server as a firewall… 🙂 so we have Linux running non stop as a firewall/proxy since 2003 with Postfix, Mailscanner, Spamassassin and iptables and doing a fine job.
So far so good, but I though that after 5 years of non stop service I should look for something easier to manage to my Linux challenged colleagues 🙂

I looked basically to two solutions: IPCop and Endian Firewall:

IPCop: Is basically oriented for the home user. Mail processing is done through a SMTP proxy that doesn’t look too solid. It’s also an add on to the basic IPCop system.

Endian Firewall: It looks like it’s IPCop based, but mail processing is done with PostFix and Amavisd and Spamassassin. It also scans mails with clamav.

Both solutions have web based interfaces, traffic graphs, and almost no need to go into a shell. I do prefer Mailscanner better than Amavisd for mail filtering. First in MailScanner, blocked e-mails can be unblocked and delivered to the user, without too much of a problem. In Amavisd you must feed them again into the system because the “blocked” format is raw, so if you really need that blocked email, the only way I know (yet) is to use Outlook Express for viewing and forward the email.

Both system lack basic tools like wget, nslookup, dig, whois that can help debugging your internet connection. You need to add them after installing, and that can be quite a challenge.

Also clamd daemon, doesn’t seem too solid. It has the habit of crashing without any trace or any bit of information on the log files…. In my original firewall system we use Mcaffee for Linux and it worked always, but we are also paying for it…

So until clamd started crashing out constantly last week I had a good impression of EFW firewall, but I’ll replace the virus scanner for using the command line clamscan instead of the daemon clamd. Them clamd people must sort it’s instability issues as soon as possible. It’s not EFW fault.