Jennifer G. Dy

Professor
Department of Electrical and Computer Engineering
Northeastern University

Clustering High-Dimensional Data:

Clustering is the process of grouping "similar" objects/samples together. "Similarity" is typically defined by a metric or a probability model, which are highly dependent on the features/descriptors representing each sample. Many clustering algorithms assume that relevant features have been determined by the domain experts. But, not all features are important. Some of the features may be redundant, some may be irrelevant, and some can even misguide the clustering results. In addition, reducing the number of features increases interpretability and ameliorates the problem with some algorithms that break down with high dimensional data. Research on clustering high-dimensional data addresses these problems.

I have explored several different ways of addressing clustering in high dimensions. See the publications below for more details. This research is supported by NSF CAREER grant No. IIS-0347532.

Publications in Clustering:

Y. Cui, X. Fern, and J. G. Dy, "Non-Redundant Multi-View Clustering Via Orthogonalization," Proceedings of the IEEE International Conference on Data Mining, Omaha, NE, October 2007, to appear. (pdf version).

T. Su and J. G. Dy, "In Search of Deterministic Methods for Initializing K-Means and Gaussian Mixture Clustering," Intelligent Data Analysis, Vol. 11, No. 4, pp. 319-338, 2007. (pre-press pdf version).

J. G. Dy, "Unsupervised Feature Selection," invited book chapter in Computational Methods of Feature Selection, edited by Huan Liu and Hiroshi Motoda, Chapman and Hall/CRC Press, to appear 2007.

K. Sanghai, T. Su, J. G. Dy, and D. Kaeli, "A Multinomial Clustering Model for Fast Simulation of Computer Architecture Designs," Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August, 2005, Chicago, Illinois. (pdf version).

J. G. Dy and C. E. Brodley, "Feature Selection for Unsupervised Learning," Journal of Machine Learning Research, Volume 5, pp. 845-889, August, 2004.
JMLR version (pdf), (ps).
Technical Report version (with the extended appendix of experimental results).

T. Su and J. G. Dy, "Automated Hierarchical Mixtures of Probabilistic Principal Component Anayzers," Proceedings of the 21st International Conference on Machine Learning, pages 775-782, July, 2004, Banff, Alberta, Canada. (pdf version).

T. Su and J. G. Dy, "A Deterministic Method for Initializing K-means Clustering," Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence, pages 784-786, November, 2004, Boca Raton, Florida. (pdf version).

J. G. Dy and C. E. Brodley, "Visualization and Interactive Feature Selection for Unsupervised Data," Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 360-364, August 20-23, 2000, Boston, MA. Abstract, (ps version), (pdf version).

J. G. Dy and C. E. Brodley, "Feature Subset Selection and Order Identification for Unsupervised Learning," Proceedings of the Seventeenth International Conference on Machine Learning, June 29-July 2, 2000, Stanford University, CA. Abstract, (ps version), (pdf version).

Students working on this project are:

Ting Su
Ying Cui