Interesting papers from WWW2016

Posted on September 28, 2016 • Tagged with interesting papers

I have been a big fan of the World Wide Web conference for a long time. Here are my choice of the Top 5 papers from this year's conference (full proceeding):

The QWERTY Effect on the Web: How Typing Shapes the Meaning of Words in Online Human-Computer Interaction

Previous psycholinguistics study1 has shown evidence for the QWERTY effect: words with more characters from the right side of QWERTY keyboard are often associated with more positive valence. In this paper, the authors examine several large datasets of user ratings and reviews collected from various web services, and discover that the QWERTY effect does exist in the context of both text interpretation / decoding (user ratings) and text creation / encoding (user reviews). full paper

Disinformation on the Web: Impact, Characteristics, and Detection of Wikipedia Hoaxes

How does hoax Wikipedia article look like, and what is the impact of such hoax article?

The authors of this paper answer these questions by studying more than 20,000 hoax articles on Wikipedia. They find that even though Wikipedia is very efficient at identifying and removing hoax articles, a small number of hoax articles can still survive for long time, receive lots of page views, and even get cited by external sources.

The authors also find that it is relatively easy to make a hoax article appear legitimate by having lots of text, wiki markups, or links. However, since the outlinks of legitimate article are much more coherent than hoax article, it is not easy to create a "realistic" network fingerprint for hoax article. Using those "beyond the surface" features, the authors manage to build machine learning models that outperform humans in detecting hoax articles. full paper

Social Networks Under Stress

In this paper, the authors study how structure and communication patterns of social network can be affected by external events. To do so, the authors examine a dataset that contains millions of instant messages sent among decision makers in a hedge fund and between those decision makers and their clients. It is found that the decision makers in the studied network tend to "turtle up" (stronger ties and higher clustering coefficient) their communications when facing "stress" (changes in stock price) instead open up. It is also discovered that the cognition and affect level of communications in the studied network also change as stock price fluctuates. full paper

Detecting Good Abandonment in Mobile Search

For a long time, abandoned search queries are considered as indications of user dissatisfaction. However, there are use cases where users can still get satisfied without clicking on search result page. Using labeled datasets collected from a small-scale user study and crowdsourcing, the authors find that there is a statistically significant correlation between certain user gestures on mobile phone (e.g., swipe) and user satisfaction with abandoned search. full paper

The Lifecycle and Cascade of WeChat Social Messaging Groups

As the largest messaging service in China, WeChat has more than 600 million Monthly Active Users (from April, 2016). Unlike other large-scale online communities (e.g., Facebook and Twitter), little is known about the group dynamics of WeChat. In this paper, the authors examine a dataset that covers ~500 thousand groups and ~250 million users in WeChat. It is found that strong group dichotomy in terms of group lifetime exists in Wechat, i.e., many groups are active for only a few days, while many other groups can remain active for more than 30 days. The long-lived and short-lived groups are different in various ways, in terms of triad count2, edge density, Wiener index3, and etc. In addition, the authors also study the membership cascade process, i.e., how existing group members invite new users to join group. full paper

  1. The QWERTY Effect: How typing shapes the meanings of words. Jasmin, K. & Casasanto, D. Psychon Bull Rev (2012) 19: 499. doi:10.3758/s13423-012-0229-7 

  2. Triad. Wikipedia 

  3. Wiener Index. Wikipedia 

Plotting like xkcd

Posted on September 10, 2016 • Tagged with python, visualization

I have been using ggplot2 for years, and it has been my favourite plotting system. One thing I have been missing is an xkcd theme until I discovered recently that ggplot, the plotting system for Python based on ggplot2, comes with an xkcd theme.

Applying the xkcd theme with ggplot doesn't need much explanation: It's exactly the same as applying themes like theme_bw(). Let's just plot something using Fisher's Iris data set 1:

ggplot(iris_data, aes(x='sepal_length', colour='class')) +
                stat_density() +


ggplot(iris_data, aes(x='class', y='petal_width')) +
                geom_boxplot() +


So, when will The Oatmeal theme be added?

  1. Fisher's Iris data set is provided by UCI Machine Learning Repository, University of California, Irvine. 

Interesting papers from KDD2016

Posted on September 07, 2016 • Tagged with interesting papers

Skinny-dip: Clustering in a Sea of Noise

This paper introduces SkinnyDip, a clustering algorithm that leverages the dip test of unimodality. SkinnyDip is noisy-robust and can detect clusters of varying shapes and density. In addition, SkinnyDip does not perform pair-wise distance calculation and its run-time grows linearly with the data. link

Smart Reply: Automated Response Suggestion for Email

This paper describes the system architecture and algorithms used to build Google's Smart Reply feature in Inbox. link

“Why Should I Trust you?” Explaining the Predictions of Any Classifier

It is always challenging to interpret complex models like random forest and neural network. This paper introduces a novel technique that can explain predictions of complex classifier by training simple, interpretable models (e.g., linear model) locally around the predictions. link

Overcoming Key Weaknesses of Distance-based Neighbourhood Methods using a Data Dependent Dissimilarity Measure

This paper proposes the mass-based dissimilarity, a dissimilarity measure that captures the key property of dissimilarity perceived by humans, i.e., two instances in a dense region are less similar to each other than two instances of the same pair-wise distance in a sparse region. link

XGBoost: A Scalable Tree Boosting System

In this paper, the authors of XGBoost explain in details the design of XGBoost and why it works well. link

Just One More: Modeling Binge Watching Behavior

Nowadays, many watch several episodes or even the whole season of TV shows in single watch sessions, which is referred as "binge watching". This paper introduces a statistical mixture model that characterizes such binge watching behavior in a real-world Video-on-Demand service. link