Machine Learning Basics
When we choose to approach spam filtering from a machine learning perspective, we view the problem as a classification problem. That is, we aim to classify an email as spam or not spam (ham) depending on its features. In our case, the features are the count of each word in the email. (More on this in the pre-processing section.)
A machine learning system operates in two modes: training and testing.
Training
During training, the machine learning system is given labeled data from a training data set. In our project, the labeled training data are a large set of emails that are labeled spam or not spam (ham). During the training process, the classifier (part of the machine learning system that actually predicts labels of future emails) learns from the training data by determining the connections between the features of an email and its label.
Testing
During testing, the machine learning system is given unlabeled data. In our case, these data are emails without the spam/ham label. Depending on the features of an email, the classifier predicts whether the email is spam or ham. This classification is compared to the true value of spam/ham to measure performance.
A machine learning system operates in two modes: training and testing.
Training
During training, the machine learning system is given labeled data from a training data set. In our project, the labeled training data are a large set of emails that are labeled spam or not spam (ham). During the training process, the classifier (part of the machine learning system that actually predicts labels of future emails) learns from the training data by determining the connections between the features of an email and its label.
Testing
During testing, the machine learning system is given unlabeled data. In our case, these data are emails without the spam/ham label. Depending on the features of an email, the classifier predicts whether the email is spam or ham. This classification is compared to the true value of spam/ham to measure performance.
For our project, we used a training data set of 800 emails and a testing data set of 200 emails from the Text Retrieval Conference (TREC) 2007 corpus.