Skip Navigation

AMS 598, Big Data Analysis

This course introduces the application of the supercomputing to statistical data analyses, particularly on big data. Implementations of various statistical methodologies within parallel computing framework are demonstrated through all lectures. The course will cover (1) parallel computing basics, including architecture on interconnection networks, communications methodologies, algorithm and performance measurements, and (2) their applications to modern data mining techniques, including modern variable selection/Dimension reduction, linear/logistical regression, tree-based classification methods, Kernel-based methods, non-linear statistical models, and model inference/Resampling methods.

Prerequisite:  AMS 572AMS 573 and  AMS 578
3 credits, ABCF grading 


"Applied Parallel Computing"; by Yuefan Deng; 2012; World Scientific Publishing Company; ISBN: 9789814307604 (recommended/optional)

"The Elements of Statistical Learning: Data Mining, Inference, and Prediction", by Trevor Hastie, Robert Tibshirani, and Jerome Friedman; Second Edition; 2011; Springer Series in Statistics; ISBN: 9780387848570 (hardcover) (recommended/optional)

"Mining of Massive Datasets", by Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman; 2nd edition, 2014; Cambridge University Press; ISBN: 9781107077232 (recommended/optional)

Offered every  Fall semester  beginning Fall 2016

Learning Outcomes:

Demonstrate knowledge of parallel computing basics:

  • Node architecture, central processing units, and accelerators;
  • Distributed – and shared-memory

Demonstrate skills with software architecture and R:

  • Communication patterns and protocols;
  • Process creation and management;
  • Mapreduce framework;
  • Hapdoop in R;
  • Demonstrate mastery of basic tools for big data analysis:
  • Linear regression
  • Logistic regression
  • Dimension reduction

Demonstrate understanding of advanced methods for big data analysis:

  • Classification and regression trees
  • Random forest
  • Gradient boosting
  • Support vector machine
  • Neural network

Demonstrate understanding of model selection and performance evaluation:

  • Best subset; forward selection; backward selection
  • Cross-validation
  • Bootstrap
Login to Edit