EECE 5698: Parallel Processing for Data Analytics

Course Syllabus

This course covers the fundamentals of parallel machine learning algorithms, tailored specifically to learning tasks involving large datasets. The course reviews methods for dealing with both large and high-dimensional datasets, emphasizing distributed implementations. Beyond covering the theory behind statistical data analysis, the course also offers a hands-on approach, using Spark as a development platform for parallel learning and the Massachusetts Green High Performance Computing Cluster (MGHPCC) as a programming environment. In detail, the course will cover:

  • Apache Spark fundamentals, multi-threaded/cluster execution.
  • Resilient distributed data structures, map-reduce operations, persistence and iterative algorithms, lazy evaluation.
  • Working with key-value pairs, joins.
  • Convex sets and functions, convex optimization, gradient descent.
  • Linear regression, Gauss Markov theorem, generalized linear models, ridge and lasso regularization.
  • Feature Selection, cross validation. Variance vs bias trade-off.
  • Classification, logistic regression, loss functions. ROC curves and AUC.
  • Stochastic gradient descent. Matrix and tensor factorization.
  • Graph-parallel algorithms & sparsity.
  • Perceptron algorithm & deep neural networks.

Grading

There will be 4 homework assignments, all of which will involve a programming component, as well as a midterm and a final course project. The grade breakdown is as follows:

  • Homework: 40%
  • Midterm exam: 30%
  • Course project: 30%

Reference Textbooks

  • Karau, H., Konwinski, A., Wendell, P. and Zaharia, M., 2015. Learning Spark: Lightning-Fast Big Data Analysis. Available online at the NEU library.
  • Boyd, S., and Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press. Available online.
  • Friedman, J., Hastie, T., and Tibshirani, R. The Elements of Statistical Learning. Springer. Available online.

Programming

All homework assignments are in Apache Spark, using the Discovery Cluster as a computing environment. Knowledge of Python is recommended but not strictly required; the first few lectures of the course cover Python to the extent necessary to proceed with the course.

Prerequisites

EECE 5644 Introduction to Machine Learning and Pattern Recognition or equivalent; the prerequisite can be waived with permission from the instructor.

Blackboard

Students enrolled in the class can find additional information in the course's Blackboard website.