This course covers the fundamentals of parallel machine learning algorithms, tailored specifically to learning tasks involving large datasets. The course reviews methods for dealing with both large and high-dimensional datasets, emphasizing distributed implementations. Beyond covering the theory behind statistical data analysis, the course also offers a hands-on approach, using Spark as a development platform for parallel learning and the Massachusetts Green High Performance Computing Cluster (
MGHPCC) as a programming environment. In detail, the course will cover: