Sometimes (Data) Science

Interesting papers from KDD2016

Skinny-dip: Clustering in a Sea of Noise

This paper introduces SkinnyDip, a clustering algorithm that leverages the dip test of unimodality. SkinnyDip is noisy-robust and can detect clusters of varying shapes and density. In addition, SkinnyDip does not perform pair-wise distance calculation and its run-time grows linearly with the data. link

Smart Reply: Automated Response Suggestion for Email

This paper describes the system architecture and algorithms used to build Google's Smart Reply feature in Inbox. link

“Why Should I Trust you?” Explaining the Predictions of Any Classifier

It is always challenging to interpret complex models like random forest and neural network. This paper introduces a novel technique that can explain predictions of complex classifier by training simple, interpretable models (e.g., linear model) locally around the predictions. link

Overcoming Key Weaknesses of Distance-based Neighbourhood Methods using a Data Dependent Dissimilarity Measure

This paper proposes the mass-based dissimilarity, a dissimilarity measure that captures the key property of dissimilarity perceived by humans, i.e., two instances in a dense region are less similar to each other than two instances of the same pair-wise distance in a sparse region. link

XGBoost: A Scalable Tree Boosting System

In this paper, the authors of XGBoost explain in details the design of XGBoost and why it works well. link

Just One More: Modeling Binge Watching Behavior

Nowadays, many watch several episodes or even the whole season of TV shows in single watch sessions, which is referred as "binge watching". This paper introduces a statistical mixture model that characterizes such binge watching behavior in a real-world Video-on-Demand service. link