Marketing & SEO Discussion List - LED Digest

Home arrow Indexed Topics arrow Fighting Spam arrow Techniques for Fighting Spam - Part 1 of 2
Techniques for Fighting Spam - Part 1 of 2 Print E-mail
Written by Tom Aman   
Wednesday, 01 February 2006

[click here for Part 2 of this article]

Hi Adam

The recent posts re Internet scams reminded me that I had intended to share my experiences in figuring out how to deal with SPAM.  This is a very long post, but I feel the info might be of use to many LEDers.

First, a definition of SPAM.  There is often disagreement among individuals, but from the many SPAM fighting sites, there is usually agreement that all UCE (Unsolicited Commercial Email) is not SPAM. SPAM is considered to be UBE (Unsolicited Bulk Email), meaning email that is sent by the thousands, automatically, to one or more large email lists.  For example, if I visit your site, discover a lot of broken links and, as a result, email you suggesting you try my software, that is UCE but it is not SPAM.  On the other hand, if I used one or more of the many lists available and blindly sent a message flogging my software to thousands and thousands of people without ever seeing their sites or knowing if they might find it useful, that is UBE and would definitely be SPAM.

This all started because I have an email account that received so much SPAM that I was considering abandoning it so it meant that it was a good one to use for some serious research.  If things got worse, I could drop the account, if they got better, I could keep it active (my first choice).

First, I decided to really check out some SPAM filtering software to see how good a job it would do.  That exercise in itself was revealing.  One problem is that the trial period for many of the filters is really too short to make a really good judgement call. Most of the filters claim accuracy rates in excess of 99%, but with a single exception, I never found one that achieved that level of accuracy.  On doing more checking, the reason for the difference between the filter's claims and the actual experience was relatively clear.  During development, to test the accuracy of a filter, the typical approach is to have 4 or 5 thousand emails available that are about 1/2 SPAM.  The filter is then designed to filter this sample set with a very high degree of accuracy and with few or no "false positives" (good email identified wrongly as SPAM).  The problem with this approach is that the accuracy depends on the user actually receiving SPAM that is similar to the mix used in the test set.  If the SPAM a user receives is significantly different, the success rate will be different and very likely lower.

Filters try to identify SPAM and, depending on the filter and with one exception that I found, can either delete it or direct it to a specific folder so it can easily be reviewed.

One filter I tested extensively over a long period claimed an accuracy rate of 98% at the outset, rising to 99.8+% after training and a "false positive" rate of less than .01% (1 in 10000).  First, the question is, in the less than 1 in 10000, does that mean 10000 emails, SPAM included, or 10000 good emails.  Makes a big difference.  Over the test period of 65 days, I received 12,953 emails of which 170 were actually good (12,783 SPAM).  The false positive rate (8 emails), even based on the 12,953 count was .06% but based on the 170 was an alarming 4.7%.  And the SPAM identification rate average out at 78.5% (nowhere near the 99.8+% claimed).

The best filter I found claimed an accuracy rate of 99.9+% after training. This filter is the one exception mentioned above.  It does its thing with the email, you correct any errors it may have made, then tell it to process. The email is still on the mail server at this point and it will delete those marked as SPAM by you or the filter.  Then you start your email client and download the good stuff that is left.  Even after entering all "friends" (from your address book), this filter is all over the place the first few days but it learns very rapidly, usually in 3 or 4 days.  I actually achieved an accuracy of 98.7%, close to that advertised, during the test period.  However, being more agressive in its filtering, the false positive rate was a bit higher coming out at 10 emails (.077% of 12,953, 5.88% of 170).  I have continued to use this filter and am now running at 99.7% with a false positive rate of about .007% (or about .15% of good emails).  Much better, but still not perfect.  This filter is called a Markovian filter, one step beyond the more common Bayesian filter and can be found at http://www.spamrip.com. I am using Version 1, the current download is for Version 2 and I will be downloading that shortly.  It is free.

My conclusion after all of this was that I would never, ever trust any filter (Brightmail included) to just delete SPAM without me having a chance to review its choices.  Even the best will occasionally register a "false positive" and, with my luck, that would be the most important one I would ever receive.

Tom Aman
Aman Software


Comments (0)add comment

Write comment

security image
Write the displayed characters


busy