In language, codeswitching occurs when a speaker uses two or more languages in the context of one conversation. Speculation on motivation for switching is multifaceted; it is possible that a person may switch languages to hide certain information from listening native speakers, to better express themselves because certain words cease to exist in a given language, or to accommodate the person with whom they are speaking. Although some linguistic theories for codeswitching exist, in our work we aim to understand the mechanisms of codeswitching using machine learning approaches. In this talk we will describe our ongoing efforts to this end. Specifically, we have identified two problems of interest in our work: clustering and prediction. First, our hope is to combine clustering algorithms such as k-means with data visualization to understand different codeswitching styles. Secondly, we plan to train a classifier to predict if a codeswitching will occur in a given utterance. We have identified Naive Bayes as the most promising avenue of attack in this case. This builds on our work during the CSoI workshop where we identified the most common words that precede a code switching event and visualized dynamics of conversations where multiple code switching styles could be observed. A common challenge for both of these problems is to identify appropriate word embedding approaches. Since most machine learning techniques assume features which are real numbers, while in natural language processing features are often categorical (e.g. words), the step of constructing such feature vectors is crucial in a variety of natural language processing tasks.
Event Link: https://csoi.adobeconnect.com/education