Posted on June 24, 2004February 27, 2005 by lhl

Response to Gordon Cormack’s Study of Spam Detection – John Zdziarski (DSPAM) follows up the recently /. study. IMO, two of the most significant weaknesses of Cormack’s study:

What’s more, there was no archive window, because Cormack didn’t perform any initial training before taking measurement. Statistical filters know nothing when you train them. Therefore, if you’re going to measure their accuracy, you need to train them first. If you start measuring before you’ve taught the filter anything, then you’re going to end up with some pretty mediocre results.

and:

SpamAssassin is immediately eliminated from the credibility of these results because the test corpus was classified by SpamAssassin (twice) and the test was ultimately a product of SpamAssassin’s decisions.

Zdziarski goes on in some detail on many of the weaknesses in Cormack’s study. Some arguments are stronger than others, but well worth reading if you have an interest in the area (especially when you compare Cormack’s study to others done at the MIT Spam Conferences). Interestingly, Cormack is a full fledged professor and the University of Waterloo. I suppose the question to ask, is if the test results are reflective of real world performance (and if not, then it is a disservice to all the people who glance at the /. headlines and take it at face value).

(Note, these days, I run CRM114 with very good (99%+) results (on a 7 year old email account), along-side Mail.app’s filter (~95%) and Brightmail (~90%), so I’m probably biased about this. But with Cormack having gotten the results he’s getting I’d tend to think that he’s doing something very, very wrong. At some point, when I have time and a better organized site I should write more about my setup)