Extracting topics from Twitter and representing them using Wikipedia page titles


Last year, we have published a very exciting research article. In this article, we have discussed if Wikipedia page titles can be used to represent topics that are talked about in Twitter and proposed an approach to do that.

In contrast to existing topic extraction methods that extracts topics from only one tweet, our approach extracts topics from multiple posts. We assume that, elements of topics that crowds talk about distribute to multiple tweets by multiple users.

Our approach also differs in representation of topics in contrast to approaches that represent topics as a set of words such as LDA, phrases, or representative tweets. Our approach represents topics using Wikipedia page titles.

We used a simple computation. We compared contents of the tweets with contents of the Wikipedia pages using cosine similarity. This computation is not easy as Wikipedia has over four million articles. You need distinguishness of words in tweets. To measure distinguishness of words, inverse document frequency (idf) of words have to be computed.



For the details on computation you can g see the published paper. But now I want to show interesting results here.

We have experimented our method on tweet sets gathered during 2012 US elections debates. Debate tweets are interesting because they span a  wide range of topics such as women issues, tax, unemployment, social services.

The following figure shows two scores of two topics over debate times. The first plot (a) is for the Wikipedia page "Big Bird", and the second plot is for "Christianity and Abortion".

Big bird received high scores around 28th minute of the first presidential debate. Mitt Romney (the former Republican party candidate for elections) said something like : I will cut subsidy to PBS even if I love big bird etc. It received a quick response from the twitter environment.

Abortion is a critical subject in the United States. Whoever the candidate is, he needs to say something about abortion. The moderator asked both vice presidential candidates about their opinion of abortion as they are Catholics. Since public is sensitive to this matter, it also received a quick response. It should be noticed that no one used "Christianity and abortion" phrase in tweets. But our approach revealed this topic by considering aggregation of what is talked about.

The following picture is a heat map of several topics over debate times.

Third debate was about foreign issues. Our approach gave high scores to related topics such as Foreign relations of china, Iran-United States relations, Israel-United States relations, and Foreign policy of the United States. It can be tracked from the darkness of the topics from the heat map that, in the second half of the third debate, in order, Israel-United states, Iran United States, Israel-United states again, Osama bin Laden, and China are talked about.

Other topics are also meaningful. For instance, obamacare (patient protection and affordable care act) was mostly the issue of the first presidential debate. This can be seen from the heat map.

We have also experimented the 2016 debates between Donald Trump and Hillary Clinton. The results will be published soon.

Stay tuned!! :)
Next Post Newer Post Previous Post Older Post Home