Abstract:
Performing exhaustive searches over a large number of text documents can be
tedious, since it is very hard to formulate search queries or define filter
criteria that capture an analyst's information need adequately.
Classification through machine learning has the potential to improve search
and filter tasks encompassing either complex or very specific information
needs, individually. Unfortunately, analysts who are knowledgeable in their
field are typically not machine learning specialists. Most classification
methods, however, require a certain expertise regarding their parametrization
to achieve good results. Supervised machine learning algorithms, in contrast,
rely on labeled data, which can be provided by analysts. However, the effort
for labeling can be very high, which shifts the problem from composing
complex queries or defining accurate filters to another laborious task, in
addition to the need for judging the trained classifier's quality. We
therefore compare three approaches for interactive classifier training in a
user study. All of the approaches are potential candidates for the
integration into a larger retrieval system. They incorporate active learning
to various degrees in order to reduce the labeling effort as well as to
increase effectiveness. Two of them encompass interactive visualization for
letting users explore the status of the classifier in context of the labeled
documents, as well as for judging the quality of the classifier in iterative
feedback loops. We see our work as a step towards introducing user controlled
classification methods in addition to text search and filtering for
increasing recall in analytics scenarios involving large corpora.