Online conversations, specially around contentious topics, are complex and dynamic. Mapping them is not just a matter of gathering enough data and applying sophisticated algorithms. It's critical to adjust the map to the questions you want to answer; like models in general, no map is true, but some are more useful, in some contexts, than others.
This post shows a quick example of a type of map I've found useful in practice. It shows the key semantics of tweets around the Irish abortion referendum of past May, based on data collected by Justin Littman. This is how the main topics of conversation looked like on May 3 — the size of each "star" in this constellation representing the relative weight of mentions of each key term (click on each graph to enlarge):
Points close to each other represent semantically similar terms (e.g., Dublin, Ireland, and irish), and the size of each "star" is proportional to the relative weight of each term in the set of key terms.
Generating a map like this takes multiple steps, and each step requires choices that can make the map relevant or not to a specific use case. In this case, I wanted to know how what people talked about in this debate shifted through time, and represent it graphically. This final goal shaped the process:
- I applied a key terms extraction algorithm to large sets of tweets from May 3, treating the entire discussion as a text; this is inappropriate, of course, if you want to see how different groups talk, or to compare and contrast hashtags, but it's compatible with my specific goal of drawing a map of the general debate's main semantic contents.
- I then used the well-known GloVe encodings to represent each key term in a high-dimensional vectorial space — for a map's geometry to make sense, geometric relationships between vectors have to be meaningful, and this is one of the key benefits of using this kind of representation.
- After that, I performed an isometric embedding of those key vectors into a two-dimensional plane, a disastrous step if we were to pass the data to another algorithm, but one to allows the drawing of a human-readable map where, if nothing else, nearness between words represents semantic nearness.
- Using those projected vectors as a reference grid, we can finally plot the "constellation" representing how much each key concept was mentioned during that day, in a way that's hopefully more semantically meaningful than a word cloud or list of mentions.
Every one of these choices, in a very real sense, is about what information to discard, and what possible insights to lose. We retained the information relevant to what we wanted to do — a two-dimensional representation of the main concepts, so we could later see how they changed — but this same pipeline would be worse than useless, for example, for the early detection of new topics.
Even choosing to encode individual words is a potentially questionable choice. We could, for example, use something like Facebook's InferSent or Google's Universal Sentence Encoder to encode the full text of each tweet, and then use standard dimensionality reduction algorithms like PCA or a narrow autoencoder to turn those points into a map. That's potentially very useful for some operations, but it turns out to be less than effective at creating a global map of the debate.
This is what the tweets above look through the eyes of an autoencoder projection to two dimensions of each tweet encoded using Google's Deep Learning sentence encoder:
As we can see, the structure this process picks is almost entirely one-dimensional — and, poking at the individual tweets, with no obvious semantic pattern. Complex models (and sentence encoding models are quite complex, if easy to use once trained) capture a lot of patterns. Blindly applied — asking "what do you think of this?" instead of a specific question — they aren't likely to return anything of immediate use. (This isn't to say that having a human ask the questions is always the right choice; you want a robot in charge of a system to have access to the full picture, not one filtered and simplified enough for a human to be able to follow. Humans viewing a chart and strategic software reacting to a situation have vastly different capabilities, and the last thing you want to do is to code for the minimum common denominator — but that's a different discussion.)
Back to our less powerful process, but one better tailored to the human-readable map we want to make, this is how the semantic constellation looked on May 24, one day before the referendum, with the previous size of each key term/star overlapped as a thin red circle for reference:
The weight of most key terms is relatively unchanged, but we see how terms like Dublin, tomorrow, and home gained relative weight (while others like baby, great, and even 8th lost it); this is compatible with both the timing of these tweets (the day before the vote), and the visibility of multiple initiatives to "bring home" people specifically for the referendum.
The point isn't that this particular combination of processing (keyword extraction), geometrical semantics (isometric embedding of a larger GloVe embedding), and change representation (overlapped circles) should be used in every or even most cases. The graph above is one possible answer to one specific question ("how did the things talked about the most in the Twitter debate over the referendum changed between those two dates?"), and we have seen how every stage of the processing had to be tailored towards answering it. Different questions — differences between groups of users, fast-emerging topics, links between sentiment and topic — will demand different approaches.
We aren't yet at a point where we can answer everything we want to know about an online conversation using a single modeling approach. Not because of lack of data or processing power, but perhaps because we have yet to work out the right concepts to think about the problem — as if we had the astronomical observations, but not yet had derived the physics. If so, we can look forward to a radical change in marketing, politics, sociological analysis, and every other area dealing with large-scale discourse once we do, and the work we do now, ad hoc as it is, is by bits and pieces getting us closer to that.