Email Categorization Advisor for Help Desk
Keywords:
data mining, clustering, unstructured data, unsupervised learning, email foldering, CSV file, stopwordsAbstract
Data is a collection of words which can be used for analysis. Digital data is classified into three categories - structured, semi-structured and unstructured data. Structured data conforms to a pre-defined data model and can be read easily by computer programs. It has a specific structure and schema. Only 10% of data in the world is structured data. Unstructured data does not conform to a pre-defined data model and cannot be read easily by computer programs. It doesn’t have a specific structure and schema. 90% of data in the world is unstructured data. In our project, we deal with unstructured data in the form of emails. These days, with email communication on the rise, it is important to quickly sort through all the data and extract only the relevant information. Emails’ data mining and analysis can be done for several purposes such as spam and ham detection and classification, subject classification, etc. In this project, we make use of a large set of personal emails for the purpose of categorizing emails. We use machine learning algorithms which are developed to perform clustering on this large text collection. We compare various clustering methods to find the one which has the best accuracy. The sample dataset used is the Enron Corpus which contains about 0.5 million emails from 150 users.