So What Makes a Good Spam Filter Anyway? By Alan Hearnshaw
Spam Filters. Most of us know we need one. Some of know we need a better one, but how many stop to think what actually makes a good spam filter in first place?
This is not just a rhetorical question. It is a question that many users – and many developers - do not ask, and consequently, goes unanswered.
Maybe this could be better answered by defining here qualities of perfect spam filter. We’ll call our perfect spam filter “SpamSplatter 3000”. Here are some of defining qualities of “SpamSplatter 3000”
1. It requires zero interaction from user. 2. It produces zero false positives (good messages identified as bad) and zero false negatives (bad messages identified as good). 3. It is transparent – that is, you only ever see good messages and never need even be aware that spam exists.
That’s it. Not much of a shopping list is it? Of course, “SpamSplatter 3000” hasn’t been invented yet (and if it does, I want a piece of action), but it does give us a frame of reference when looking for best filter we can find.
Let’s take each point in turn:
It requires zero interaction from user There are two kinds of filters that come near to this ideal currently: Bayesian Filters and Community Filters. Bayesian filters strip messages down to small “word bites”, or tokens and maintain a database containing lists of good and bad tokens. When a new message is encountered, filter strips this message down to tokens, compares it to database, and applies a formula based on British scientist Alan Bayes’ formula for probability calculation. Over time, Bayesian filter “learns” characteristics of spam messages.
Community Filters simply work on a voting system whereby every user that receives a spam message “votes” it as spam. This information is stored on a central server and when enough votes are received message is banned from all users in community.
As can be seen, user interaction from these types of filters is mainly limited to two button operation – correcting wrongly identified messages – and more accurate filter, less those buttons are used.
OK, so that’s pretty good. Not exactly zero interaction, but if filter is accurate enough, then it should be pretty near. That brings us to point two:
It produces zero false positives or negatives This is area in which most spam filter development is concentrating and things are getting pretty good nowadays. It is not at all unusual to see an efficient modern filter achieve accuracy of 96% or better. It is, of course, far better to have a false negative than a false positive if you are ever going to tear yourself away from killed mail folder!