Guest post: Sentiment analysis using the Dutch Netlog Corpus
[This guest post has been written by Sarah Schrauwen, a Master's student in computer linguistics at the University of Antwerp who has written her Master's thesis on Sentiment Analysis in collaboration with Mollom.]
Since 2009, Mollom has been protecting the 4 million messages posted daily by more than 40 million Netlog users (in more than 25 languages) by analyzing them for spam and unwanted content. This collaboration between one of the largest social networking websites in Europe and Mollom uncovers many interesting research opportunities, and has been the ground for my Master's thesis.
My thesis, bearing the bulky title “Machine Learning Approaches to Sentiment Analysis Using the Dutch Netlog Corpus”, has been written under the supervision of prof. Walter Daelemans at the Computational Linguistics Department (CLiPS) at the University of Antwerp.
To first give an overview of what this study is about, we have to explain its subject: sentiment analysis. Sentiment analysis deals with the computational treatment of opinion, which basically means trying to ‘teach’ computers how to distinguish between different kinds of human opinion or emotion. For example, to extract the general opinion of movie reviews, it is interesting to ask the following questions: is the writer positive about the film, didn’t he like it at all, or does he even have a strong opinion about it?
Numerous applications of sentiment analysis exist today, and they are requisite for online services. Mining customer reviews or feedback for opinions about a given product (e.g. digital camera, car, dishwasher) can provide companies with information as to whether the customers are happy and satisfied, or whether they are disgruntled. For customers, this information is also very valuable in deciding whether to buy the product or not. Opinion mining proves to be very useful for moderating: a website moderating team should be able to react fast and efficiently to messages posted on forums or discussion boards wherein dissatisfied clients divulge product deficiencies or to any “heated debates” or “flame wars” going on. Furthermore, sentiment analysis allows for tracking emotion or opinion over time and for tracking (mood) trends online, which is interesting data for marketing research, trend watchers and recommendation systems.
The machine learning approaches used for sentiment analysis in this study require an annotated corpus to train and test data. We built a manually annotated corpus from Dutch data extracted from Netlog: the Dutch Netlog Corpus (DNC). It has been annotated on three levels: one level for sentiment analysis (called ‘valence’, with five classes: ‘positive’, ‘negative’, ‘both’, ‘neutral’ and ‘n/a’) and two levels to evaluate the language performance of the writers (with three classes each: ‘standard’, ‘dialect’ and ‘n/a’ for the ‘performance’ level, and ‘chat’, ‘non-chat’ and ‘n/a’ for the ‘chat’ level). The majority of the data in the DNC is written in dialect and chat Dutch, which differs greatly from standard and non-chat Dutch in not being uniform: its orthography and lexicon is constantly changing and evolving. The entire World Wide Web bulks chat language, and computationally dealing with these forms of language is becoming more and more relevant, since the Web is currently the largest resource of freely available (user-generated) data. In this study, we have experienced that sentiment analysis can be done with dialect and chat text, which was not examined before.
In the experiments, we used three classifiers: the Naïve Bayes, Maximum Entropy and Decision Tree classifiers. We experienced that the Naïve Bayes classifier delivers the best results for the valence and performance classification, while the Decision Tree classifier achieves the highest results for chat classification.
Mollom is currently processing the results of this thesis into its service.