Towards Social Media Analysis for Low-Resource Languages
Modern web-based social networks have become platforms where individuals can express personal views and discuss relevant issues in real-time. The possibility of analysing this massive aggregation of thoughts and opinions has applications in several domains, ranging from finance and marketing to the social sciences. However, the development of social media analysis tools is still slow and expensive, as most of the current approaches depend on hand-crafted lexicons, extensive feature engineering and large amounts of labeled data. This is even more problematic for languages other than English, where annotated corpora and linguistic resources are either scarce or non-existent. This talk addresses the issue of building sentiment analysis systems for social media with limited resources. We present two methods that leverage word embeddings computed from raw text, to reduce the manual efforts required to develop these applications. The first, consists of training a predictive model to induce large-scale lexicons using embeddings as features and pre-existing lexicons as labeled data. The second, is an approach to jointly learn a classifier and task-specific features from unsupervised embeddings, when only small and noisy labeled datasets are available. We estimate a projection to a small embedding subspace that captures the most relevant information for the task. This allows us to adapt all the word representations, even if they do not occur on the labeled data. At the same time, we reduce the number of parameters of the model, thus reducing the risk of overfitting. These methods were used to participate in the 2015 edition of SemEval Twitter sentiment analysis benchmarks, attaining state-of-the-art results. We will present our participation in this challenge, and report on additional experiments that attest to the adequacy of the proposed approaches.