Last semester, I took a seminar on “Deep Learning”. Trapit Bansal, Jun Wang, and I did our final project on sentiment analysis for Twitter (i.e. SEMEVAL 2016, Task 4). Here’s a quick summary of what we did.
First of all, there are several great resources on understanding recurrent neural networks (RNNs) and long short-term memory (LSTM) networks:
- The Unreasonable Effectiveness of Recurrent Neural Networks. Andrej Karpathy
- Visualizing and Understanding Recurrent Networks. Andrej Karpathy, Justin Johnson, Li Fei-Fei
- Understanding LSTM Networks. Christopher Olah
So, I won’t go into the details here, since those guys explain the model a lot better than I can. The important thing is, LSTM’s are a type of RNN that learn how to forget past observations. This makes them somewhat more robust to noise, and better able to capture “long-term dependencies” in a sequence. In NLP, when people say “long-term dependency”, this usually means something like “longer-than-bigram” dependencies.
He wouldn't say that he hated the movie, per se.
How did ‘he’ feel about the movie? If I were reading this sentence using e.g. a trigram-based model, I would probably say that he hated the movie. Using a model with a longer memory allows me to capture the effect of “He wouldn’t say that…” on the semantics.
In our project, we compared word-level models vs. character-level models. In word-level models, the objective is to learn a distributed representation of words in a lexicon. Think word2vec. While these kinds of models give you the famous “vector(‘king’) - vector(‘man’) + vector(‘woman’) = vector(‘queen’)” behavior, there are a couple of problems. Word-level models use a fixed word lookup table at test time to get a learned embedding for a word. As Ling et. al. (2015) point out, “This means that word lookup tables cannot generate representations for previously unseen words, such as Frenchification, even if the components French and -ification are observed in other contexts.” These lookup tables also have to store all of the morphological variations of each word – moreover, the vectors corresponding to these variations are independent. These two problems are particularly relevant in the context of Twitter, where people make up new words all the time and spell them however they want.
Character-level models, on the other hand, store representations for each character in whatever character vocabulary you decide to use. We compared ASCII (English) to UTF-8 (lots of languages, including English). The “lookup tables” are more compact but there are more parameters to learn during training. Our thinking was, this tradeoff is okay: tweets can’t be longer than 140 characters anyway. Also, as a sidenote, I think it’s kind of neat that the UTF-8 character model is implicitly multi-lingual.
Here are some visuals showing our ASCII model’s confidence (i.e. categorical cross entropy) over the course of a character sequence.
Once the model has read “I lov”/”I hat”, its sentiment prediction for the two sentences diverges.
In the first two cases, the model correctly predicts “positive” sentiment, regardless of the variation “likes” vs. “does like”. The second two cases show that it picks up on the effect of negation: it ends up in the “negative” region after reading “does not like” and “doesn’t like”. The last case shows that the model is sensitive to morphology – it still picked up on the negation, even though “not” was encoded as an affix on “does” rather than its own lexical item. Note: here, the y-axis should read “confidence in ‘positive’ prediction”.
Here are visualizations of some actual tweets:
(Note: the y-axis should read “confidence in ‘negative’ prediction”)