Martin-Luther-Universität Halle-Wittenberg

DS_Spaltenbild

Weiteres

Login für Redakteure

Data Science I & II

Course Description

Part I runs in the winter semester and pursues three main goals. First, we build up the R skills required to investigate all subsequent topics in R in a hands-on fashion. No previous programming experience is required. Second, we cover the first steps of the data processing pipeline in data collection, preparation and visualization – honing our R skills along the way. Third and finally, we turn our attention to modelling with a focus on supervised learning. This part includes basic regression and classification models, feature engineering and model evaluation.

Data Science I - Introductory Modelling

Data Science I - Introductory Modelling

Part II in the summer semester builds on part I. First, we enhance our model toolkit with more advanced models and feature engineering methods. Second, we take a step back and investigate the notion of generalization, i.e. under which circumstances we can hope to learn anything at all about the world from our data. We will discover strategies to achieve generalization and learn to select models based on these strategies.

Data Science II - Intermediate Modelling

Data Science II - Intermediate Modelling

Course Structure

The hands-on nature of these modules requires us to build hard skills in part I of the sequence. The winter semester thus consists of three weekly sessions: a lecture, the discussion of a complementary R Script, and an exercise where we discuss the solutions to the weekly R assignments. You may opt to work through the R script yourself and then complete the assignment but especially beginners greatly benefit from the extra session in R that puts them on track to do the assignments.

Participants of part II in the summer semester are expected to bring sound command of R (or the willingness and effort to catch up). There is a weekly lecture and a complementary R session in which we revisit the content.

Content

This is a rough sketch of the course content. Especially the content of part II is still tentative. Current topics at the end of the course are subject to change and updated regularly.

Part I
Week 1Introduction
Week 2Data
Week 3Data Collection
Week 4Data Preparation
Week 5Visualization
Week 6Introduction to Modelling
Week 7Linear Regression
Week 8k-NN, Tree Regression, Smoothing
Week 9Logistic Regression
Week 10MNL Regression, Decision Trees
Week 11Model Evaluation
Week 12Features I: Engineering
Week 13Guest lecture: Applications in Business/Economics
Week 14Features II: Clustering
Week 15Summary, Outlook, Exam preparation
Part II
Week 1Recap, Empirical Loss Minimization
Week 2Features III: PCA
Week 3Features IV: Kernels & Adaptive Basis Functions
Week 4TBD (SVM)
Week 5Neural Networks I
Week 6Neural Networks II
Week 7Generalization
Week 8Validation, Information Criteria
Week 9Regularization, Model Selection
Week 10Ensembles: Bagging
Week 11Ensembles: Boosting
Week 12Probabilistic Models I: Maximum Likelihood
Week 13Probabilistic Models II: Mixture Models
Week 14Guest Lecture
Week 15Current Topics

Coursework

Part I

For part I, you will conduct your first small data analysis of a dataset of your choice. You will pick a regression or classification question, explore a suitable dataset, engage in feature engineering, run and evaluate at least 2 different models and hand in your work as a reproducible R notebook. In an oral exam, we will discuss your notebook and other topics covered in the course.

Data Science I - Coursework
Coursework_DataScience_I.pdf (113,2 KB)  vom 09.03.2023

Examples

To meet popular demand, here are some (slightly modified) examples of previous submissions. They highlight different aspects of the coursework. These examples are not free of errors and not meant to be copy/paste templates. But they are excellent examples of students demonstrating their ability to independently inquire into their chosen datasets using R; employing, adapting and extending the R code from the course scripts; and putting in considerable effort. Good scripts are also characterized by a description of the thought process as the inquiry unfolds. They do not hide/remove parts of the script that did not improve the model but state the conclusions drawn from such parts instead. This coursework is not a Kaggle competition and predictive performance is not graded.

Example - Data Collection
StudentExample_Data_Collection.html (676,7 KB)  vom 09.03.2023

Example - Data Preparation
StudentExample_Data_Preparation.html (1,7 MB)  vom 23.03.2023

StudentExample_Data_Description_ Model_Evaluation
StudentExample_DataDescription_ModelEvaluation.html (1 MB)  vom 23.03.2023


Part II

For part II, you will once again pick a regression or a classification  problem, run both basic and advanced models, evaluate and compare their  performance, engage in model selection based on model generalization and  hand in your work as a reproducible R notebook. Your work will be  presented and discussed in class.

Examples

Examples will follow later as the format of the coursework has changed.

Zum Seitenanfang