Abstract:
The world's corpora of data grow in size and complexity every day, making it
increasingly difficult for experts to make sense out of their data. Although
machine learning offers algorithms for finding patterns in data
automatically, they often require algorithm-specific parameters, such as an
appropriate distance function, which are outside the purview of a domain
expert. We present a system that allows an expert to interact directly with a
visual representation of the data to define an appropriate distance function,
thus avoiding direct manipulation of obtuse model parameters. Adopting an
iterative approach, our system first assumes a uniformly weighted Euclidean
distance function and projects the data into a two-dimensional scatterplot
view. The user can then move incorrectly-positioned data points to locations
that reflect his or her understanding of the similarity of those data points
relative to the other data points. Based on this input, the system performs
an optimization to learn a new distance function and then re-projects the
data to redraw the scatterplot. We illustrate empirically that with only a
few iterations of interaction and optimization, a user can achieve a
scatterplot view and its corresponding distance function that reflect the
user's knowledge of the data. In addition, we evaluate our system to assess
scalability in data size and data dimension, and show that our system is
computationally efficient and can provide an interactive or near-interactive
user experience.