Comparing Clusterings Using Bertin's Idea

Alexander Pilhofer, Alexander Gribov, Antony Unwin
Classifying a set of objects into clusters can be done in numerous ways, producing different results. They can be visually compared using contingency tables [27], mosaicplots [13], fluctuation diagrams [15], tableplots [20] , (modified) parallel coordinates plots [28], Parallel Sets plots [18] or circos diagrams [19]. Unfortunately the interpretability of all these graphical displays decreases rapidly with the numbers of categories and clusterings. In his famous book A Semiology of Graphics [5] Bertin writes the discovery of an ordered concept appears as the ultimate point in logical simplification since it permits reducing to a single instant the assimilation of series which previously required many instants of study. Or in more everyday language, if you use good orderings you can see results immediately that with other orderings might take a lot of effort. This is also related to the idea of effect ordering [12], that data should be organised to reflect the effect you want to observe. This paper presents an efficient algorithm based on Bertin's idea and concepts related to Kendall's t [17], which finds informative joint orders for two or more nominal classification variables. We also show how these orderings improve the various displays and how groups of corresponding categories can be detected using a top-down partitioning algorithm. Different clusterings based on data on the environmental performance of cars sold in Germany are used for illustration. All presented methods are available in the R package extracat which is used to compute the optimized orderings for the example dataset.