Tutorial 4: On the Power of Ensemble: Supervised and Unsupervised Methods Reconciled


Ensemble methods have emerged as a powerful method for improving the robustness as well as the accuracy of both supervised and unsupervised solutions. Moreover, as enormous amounts of data are continuously generated from different views, it is important to consolidate different concepts for intelligent decision making. In the past decade, there have been numerous studies on the problem of combining competing models into a committee, and the success of ensemble techniques has been observed in multiple disciplines, including recommendation systems, anomaly detection, stream mining, and web applications. The ensemble techniques have been mostly studied in supervised and unsupervised learning communities separately. However, they share the same basic principles, i.e., combination of diversified base models strengthens weak models. Also, when both supervised and unsupervised models are available for a single task, merging all of the results leads to better performances. Therefore, there is a need of a systematic introduction and comparison of the ensemble techniques, combining the views of both supervised and unsupervised learning ensembles. In this tutorial, we will present an organized picture on ensemble methods with a focus on the mechanism to merge the results. We start with the description and applications of ensemble methods. Through reviews of well-known and state-of-the-art ensemble methods, we show that supervised learning ensembles usually "learn" this mechanism based on the available labels in the training data, whereas unsupervised ensembles simply combine multiple clustering solutions based on "consensus". We end the tutorial with a systematic approach to combine both supervised and unsupervised models.

Tutors' Biographies:

  • Jing Gao, received the BEng and MEng degrees, both in Computer Science from Harbin Institute of Technology, China, in 2002 and 2004, respectively. She is currently working toward the Ph.D. degree in the Department of Computer Science, University of Illinois at Urbana Champaign. She is broadly interested in data and information analysis with a focus on data mining and machine learning. In particular, her research interests include ensemble methods, transfer learning, mining data streams and anomaly detection. She has published more than 20 papers in refereed journals and conferences, including KDD, NIPS, ICDCS, ICDM and SDM conferences.

  • Wei Fan, received his PhD in Computer Science from Columbia University in 2001 and has been working in IBM T.J.Watson Research since 2000. He published more than 60 papers in top data mining, machine learning and database conferences, such as KDD, SDM, ICDM, ECML/PKDD, SIGMOD, VLDB, ICDE, AAAI, ICML etc. Dr. Fan has served as Area Chair, Senior PC of SIGKDD'06, SDM'08 and ICDM'08/09, sponsorship co-chair of SDM'09, award committee member of ICDM'09, as well as PC of several prestigious conferences in the area including KDD'09/08/07/05, ICDM'07/06/05/04/03, SDM'09/07/06/05/04, CIKM'09/08/07/06, ECML/PKDD'07/06, ICDE'04, AAAI'07, PAKDD'09/08/07, EDBT'04, WWW'09/08/07, etc. He is on the advisory board of KD2U. Dr. Fan was invited to speak at ICMLA'06. He served as US NSF panelist in 2007/08. His main research interests and experiences are in various areas of data mining and database systems, such as, risk analysis, high performance computing, extremely skewed distribution, cost-sensitive learning, data streams, ensemble methods, easy-to-use nonparametric methods, graph mining, predictive feature discovery, feature selection, sample selection bias, transfer learning, novel applications and commercial data mining systems. He is particularly interested in simple, unconventional, but effective methods to solve difficult problems. His thesis work on intrusion detection has been licensed by a start-up company since 2001. His co-teamed submission that uses Random Decision Tree has won the ICDM'08 Contest Crown Awards. His co-authored paper in ICDM'06 that uses "Randomized Decision Tree" to predict skewed ozone days won the best application paper award. His co-authored paper in KDD'97 on distributed learning system "JAM" won the runner-up best application paper award.

  • Jiawei Han (Ph.D., Univ. of Wisconsin at Madison), is a professor in the Department of Computer Science, University of Illinois at Urbana-Champaign. He has been working on research into data mining, data warehousing, stream data mining, spatial and multimedia data mining, and bio-medical data mining, with over 300 conference and journal publications. He has chaired or served in over 100 program committees of international conferences and workshops, including ACM SIGKDD Conferences (2001 best paper award chair, 2002 student award chair, 1996 PC co-chair), SIAM-Data Mining Conferences (2001 and 2002 PC co-chair), ACM SIGMOD Conferences (2000 exhibit program chair), International Conferences on Data Engineering (2004 and 2002 PC vice-chair), International Conferences on Data Mining (2005 PC co-chair) and International Conference on Very Large Data Bases (2006 VLDB Americas Chair). He also served or is serving as EIC of ACM Transactions on Knowledge Discovery from Data and on the editorial boards for Data Mining and Knowledge Discovery, IEEE Transactions on Knowledge and Data Engineering, Journal of Intelligent Information Systems, and Journal of Computer Science and Technology. Jiawei has received the Outstanding Contribution Award at the 2002 International Conference on Data Mining, ACM Service Award (1999) and ACM SIGKDD Innovations Award (2004), and IEEE CS Technical Achievement Award (2005). He is an ACM and IEEE Fellow. He is the first author of the textbook "Data Mining: Concepts and Techniques" 2nd ed., (Morgan Kaufmann, 2006).