Spam filtering using statistical filters

The problem

A computer is good at doing mathematics but doesn't understand much of the message itself of an incoming email. The email might come from a friend, which you (the reader) easily understand the contents of. You will also easily understand if an incoming message is a spam email, and might delete it since it is of no value to you. The principle is: You read the message, and then decide if it's a spam email or a good email (aka ham).

The mathematics of spam filtering

A computer program which handles email has to be based on statistical means to decide whether an email is likely to be spam or ham. Since a filtering program is based on statistics on words (ham and spam words), it can compute a likelihood of spam or ham mail, but not 100 % correctly all the time. An email might be a false positive, for example.

To make a spam filter program, you will need to know the probabilities of spam words and ham words. A filter program must therefore be "trained" to separate good and bad words in some way. Most of today's spam filter program do this by using Bayesian filtering, but serveral candidates have evolved from this theory, which seem to work more accurate.

For example, the word "buy" (viagra, medicine, holiday etc) might be found in 30 ham emails and in 66 spam emails, occurring respectively 40 and 93 times in each category of ham and spam. This fact in my example shows that an email containing the word "buy" probably is a spam mail.

The filtering program should also consider the email sender. You can put your friends in a friends list (whitelist). If the sender is in your whitelist, the program doesn't have to examine the probabilities of the spam at all. Just let this email pass, because it's from a friend.

On the other hand, if the sender is unknown, the spam filter program has to examine every word in the message to determine the final spam probability.

One of the statistical means available is Bayes Theorem, which is a foundation of the Bayesian Filtering. This theorem is stating that

 (B1, B2,..., Bk forms a partition of a set S, and A is any event in S)

I'll adapt and apply this theorem later to compute the spam probability for a whole email, below..

At this stage it's obvious that you will need a database of words (tokens), together with their spam and ham probabilities, in order to compute an email's spam score. A spam score is nothing else than the probability value of an email to be spam given it include spam words.

Make a word database

Make a word database which you populate with words coming from spam and ham emails. This database has at least three fields, a Word text field, a HamWord number field and a SpamWord number field. You will need at least one spam corpus and one ham corpus. A corpus is a set of emails, for example a folder in Outlook (Express). About 500 emails or so should do fine.

Now, start with one of the corpora, for example the ham corpus - taking one email at the time. However, first you have to define a set of word delimiters. These are Delim in {'&', '/', '(', ')', ',', 'Space'}.

1. Take the next ham email.
2. Read one new line at a time (a line consists of several words, ending with a CR/LF).
3. Make a function which counts the number of tokens (words) in this line.
4. For each of the tokens, if it's not in your database - add it.
5. In any case, if it's there or not, also accumulate the number of occurrences of this token by adding 1 to the HamWord number field.
6. If there are more lines left to read, goto point 2), otherwise goto point 7)
7. If there are more emails in this corpus, goto point 1)
8. End.

Do the same with the spam corpus, adding each spam word to your database's SpamWord number field.

Your database now ought to count several thousands of records.

Checking an email for spam

For computing the spam score, we will adapt the Bayes Theorem in this way:

 For one word (this spam value is element a1): P(spam) = a1 / (a1 + (1-a1)) For two words with corresponding occurrences a1 and a2 : P (spam) = (a1*a2) / (a1*a2 + ((1-a1)* (1-a2)) and so on

Once you have your database with tens of thousands words in it, together with each word's ham and spam occurrences, you can start using it to compute an email's spam score. The mathematics of a spam score is something like this:

1. Defines: S is a string to hold a line. aWord is a string to contain one word, Count is an integer to keep the numbers of words in a line. I is a loop integer for dealing with each aWord. a is an array of real numbers which must be a dynamic length array (we don't know how many words an email contains, but the Count variable will tell. Also define a variable L of type Real or Double, and likewise a real variable Numerator. P is a real number to compute the spam scor (probability)
2. For each email left do
3. Set Numerator to the value of 1.0
4. Read one new line S
5. Set the number of elements of the real array a to the number of words in S
6. Set the value of L to 1.0
7. Let Count be the number of words in this line
8. Loop on I from I = 1 to Count
1. Extract word number I from S
2. Let a [I] be
the number of spam occurrences from your database for this word or 1, divided with
(the number of spam occurrences or 1+ the number of ham occurrences or 1) for this very same word
3. if a [I] is equal to 0 (no such word in your table), make a guess for this value, for example 0.5
4. Let Numerator equal to (Numerator* a [I] )
5. Let L be equal to (L * (1 - a [I] ) )

9. End of loop on I

10. If there are more lines in the email, goto 4)

11. If there are more emails to process, goto 2)

12. Compute the spam score from the formula P = Numerator / (Numerator + L)

13. End

P is now the spam score.

Testing emails for spam

This system is implemented on this web server, go to the spam computing page. This spam computing web page is a standalone server, which might incidentally be down. (You might drop a notification to tore [at] aasli.com if the server should be down)

To test for spam, try these words and notice how the bayesian spam filter is reacting

Test 1 : Buy
Test 2: Buy viagra
Test 3: Buy now
Test 4: Buy now!
Test 5: Buy viagra now!

etc, and watch how the filter is returning an increasing spam score! You might also test some of your ham and spam emails if you like.