Botched Config

Posted 5 years ago.

Today is a quick tale about that one time I broke all mail in my organisation for about 15 minutes.

At work, we use a system called EXIM for the main email routing for all mail for staff accounts (domain.com). Students are on a different subdomain (student.domain.com) and are passed through Office365, so we don't have to worry about it. EXIM is hooked into something called SpamAssassin which was used to filter spam, however the Bayesian rules are all gummed up and Office365 does a better job anyway, so when we migrated all ^{^{^{(see: most)}}} of our staff to Office365, we turned off the client access to the frontend of SpamAssassin and ran a script to disable it on all user accounts.

The way that EXIM knows whether or not a user has the service enabled on their account is pretty simple. Each user has a file called "spamconf" which contains the following lines:

# this is the level of filtering we want, where 1 is harsh and 5 is lenient
spamlevel: 5

# these are the user's whitelisted addresses and domains, in regular expression
whitelist: friend@friends-isp\.com|@domain\.ac\.uk

# these are the user's blacklisted addresses and domains, in regular expression
blacklist: @spammers\.net|@spamisp\.com|spammer@aol\.com|\.faith|\.loan|\.date

# is SpamAssassin enabled for this user?
autospam: no

# when SpamAssassin catches something, should the user be notified?
notify: no

Basically, when we moved an account over to Office365 we ran a command similar to this:

nat@mailserv:$ sed -i 's/autospam: yes/autospam: no/g' /path/to/user/spamconf
nat@mailserv:$ sed -i 's/notify: yes/notify: no/g' /path/to/user/spamconf

Just use sed to change all of the yeses to nos for AutoSpam and the notifications that a user would get about it. Seems like a simple fix, right?

Wrong.

When I said that all users have a spamconf file, I was lying. Okay, I wasn't lying. All users should have a spamconf file, but some don't. I don't know why, and the system is hella legacy so I can't be bothered to check. What I do know is that every now and then some users get an SpamAssassin notification when they really shouldn't. "Okay", we say to ourselves, "let's take a look and see why that happens."

nat@mailserv:$ cat /path/to/user/spamconf
cat: spamconf: No such file or directory.

Here comes watlady to explain my emotions at this stage. wat

Thanks watlady. So, my colleagues and I start digging around config files to figure out why these notifications are even being sent out in the first place. We find our answers in the main EXIM configuration file, called "configure". Helpful naming is helpful, as always.

Scrolling through the config file, I run across the routers settings. Each of these routers has a name, some flags on what to accept, and what to do if those conditions are met. The format looks something like this:

router_name:
  driver = accept|reject
  check_local_user # make sure they exist
  condition = "${if \
               and{\
                  {thing_one}\
                  {thing_two}\
              } {1} {0}}"
  domains = domain.com
  transport = vacation_reply|blocked_user_bounce|remote_smtp

So, logically the right way to fix this would be to add a router to the top of the stack that ignores all of the autospam stuff below, right? I crafted the following:

user_autospam_override:
  driver = accept
  check_local_user
  condition = "${if \
               and{\
                  {first_delivery}\ # first_time.jpg
                  {eq{$h_X-Spam-Flag:}{YES}}\ # is_this_spam.jpg
                  {!exists {SPAMCONF_FILE}}\ # does the user NOT have a spamconf file?
               } {1} {0}}" # true, else false
  domains = domain.com # run on our domain
  transport = remote_smtp # forward to exchange

Boom, looks good right? I checked it over with a colleague of mine who is vastly more knowledgeable in this area than I am, and then we submitted a change request. There were a few things that were combated during the meeting, such as how are we going to roll this out to all of our EXIM mail servers (we have more than one), how will we roll this back (just swap the new config file with the old one) and such. I took them into consideration and made the appropriate changes in Git so that they could get pushed out via Puppet, but as I was doing so I came across a different but very promising line in the configuration file:

USER_AUTOSPAM = ${lookup{autospam}lsearch{SPAMCONF_FILE}{$value}{NULL:}}

Basically, look up for the word "autospam" in the recipient's spamconf file and get it's value. If you can't get a value, just use NULL. Okay, looks happy. What if I just change it to the following line:

USER_AUTOSPAM = ${lookup{autospam}lsearch{SPAMCONF_FILE}{$value}no}

So here all we're doing is changing the "NULL" value to just the string "no", which is what spamconf files with SpamAssassin disabled have anyways.

I got more approval from my colleagues and was cleared to make the change. I change the files over, merge the branch in Git and go about my day.

Until I decide to check in on the panic.log file on the main EXIM server about 5 minutes later, that is...

1970-01-01 00:00:00 MESSAGE-ID failed to expand condition
"${
  if
  and{
    {first_delivery}
    {eq{$h_X-Spam-Flag:}{YES}}
    {!eq
      ${lookup{autospam}lsearch{/path/to/user/$local_part/spamconf}{$value}no}
    }{no}}
  } {1}{0}
}" for user_autospamconf router:
syntax error in "lookup" item - "fail" expected inside "and{...}" condition

💩, what did I do?!

How fragged is this?

Okay, don't panic. Just think. Go back over that line, see where you went wrong.

What if "no" needs to be wrapped in braces, just like {NULL:}? I was running on the assumption that the NULL operator in this configuration relies on braces and everything else doesn't, but maybe that's not it.

I quickly scanned over some other lines of the config until I found the word "yes" wrapped in braces. I guess that's what I was missing then?

I opened up the config file and changed the USER_AUTOSPAM variable:

USER_AUTOSPAM = ${lookup{autospam}lsearch{SPAMCONF_FILE}{$value}{no}}

I save the config and tail the panic.log. I wait 30 seconds. While before I was getting dozens of errors a second, now I'm getting none. Result!

I log back into Git and make a new merge request to update the botched config, then merge it and let Puppet do it's work. I sign into all of the Puppet controlled boxes one at a time and check 'em to make sure that the config has been pulled. If it hasn't I quickly make the adjustments manually and hope for Puppet to override them when it gets round to it.

Anywho, that's how I broke email for 15 minutes.