Weiteres

Schrift: größer , kleiner

Data Science I & II

Course Description

Part I runs in the winter semester and pursues three main goals. First, we build up the R skills required to investigate all subsequent topics in R in a hands-on fashion. No previous programming experience is required. Second, we cover the first steps of the data processing pipeline in data collection, preparation and visualization – honing our R skills along the way. Third and finally, we turn our attention to modelling with a focus on supervised learning. This part includes basic regression and classification models, feature engineering and model evaluation.

Data Science I - Introductory Modelling

Part II in the summer semester builds on part I. First, we enhance our model toolkit with more advanced models and feature engineering methods. Second, we take a step back and investigate the notion of generalization, i.e. under which circumstances we can hope to learn anything at all about the world from our data. We will discover strategies to achieve generalization and learn to select models based on these strategies.

Data Science II - Intermediate Modelling

Course Structure

The hands-on nature of these modules requires us to build hard skills in part I of the sequence. The winter semester thus consists of three weekly sessions: a lecture, the discussion of a complementary R Script, and an exercise where we discuss the solutions to the weekly R assignments. You may opt to work through the R script yourself and then complete the assignment but especially beginners greatly benefit from the extra session in R that puts them on track to do the assignments.

Participants of part II in the summer semester are expected to bring sound command of R (or the willingness and effort to catch up). There is a weekly lecture and a complementary R session in which we revisit the content.

Content

This is a rough sketch of the course content. Especially the content of part II is still tentative. Current topics at the end of the course are subject to change and updated regularly.

Part I
Week 1	Introduction
Week 2	Data
Week 3	Data Collection
Week 4	Data Preparation
Week 5	Visualization
Week 6	Introduction to Modelling
Week 7	Linear Regression
Week 8	k-NN, Tree Regression, Smoothing
Week 9	Logistic Regression
Week 10	MNL Regression, Decision Trees
Week 11	Model Evaluation
Week 12	Features I: Engineering
Week 13	Guest lecture: Applications in Business/Economics
Week 14	Features II: Clustering
Week 15	Summary, Outlook, Exam preparation

Part II
Week 1	Recap, Empirical Loss Minimization
Week 2	Features III: PCA
Week 3	Features IV: Kernels & Adaptive Basis Functions
Week 4	TBD (SVM)
Week 5	Neural Networks I
Week 6	Neural Networks II
Week 7	Generalization
Week 8	Validation, Information Criteria
Week 9	Regularization, Model Selection
Week 10	Ensembles: Bagging
Week 11	Ensembles: Boosting
Week 12	Probabilistic Models I: Maximum Likelihood
Week 13	Probabilistic Models II: Mixture Models
Week 14	Guest Lecture
Week 15	Current Topics

Coursework

Part I

For part I, you will conduct your first small data analysis of a dataset of your choice. You will pick a regression or classification question, explore a suitable dataset, engage in feature engineering, run and evaluate at least 2 different models and hand in your work as a reproducible R notebook. In an oral exam, we will discuss your notebook and other topics covered in the course.

Data Science I - Coursework
Coursework_DataScience_I.pdf (113,2 KB) vom 09.03.2023

Examples

To meet popular demand, here are some (slightly modified) examples of previous submissions. They highlight different aspects of the coursework. These examples are not free of errors and not meant to be copy/paste templates. But they are excellent examples of students demonstrating their ability to independently inquire into their chosen datasets using R; employing, adapting and extending the R code from the course scripts; and putting in considerable effort. Good scripts are also characterized by a description of the thought process as the inquiry unfolds. They do not hide/remove parts of the script that did not improve the model but state the conclusions drawn from such parts instead. This coursework is not a Kaggle competition and predictive performance is not graded.

Example - Data Collection
StudentExample_Data_Collection.html (676,7 KB) vom 09.03.2023

Example - Data Preparation
StudentExample_Data_Preparation.html (1,7 MB) vom 23.03.2023

StudentExample_Data_Description_ Model_Evaluation
StudentExample_DataDescription_ModelEvaluation.html (1 MB) vom 23.03.2023

Part II

For part II, you will once again pick a regression or a classification problem, run both basic and advanced models, evaluate and compare their performance, engage in model selection based on model generalization and hand in your work as a reproducible R notebook. Your work will be presented and discussed in class.

Examples

Examples will follow later as the format of the coursework has changed.

Zum Seitenanfang

Martin-Luther-Universität Halle-Wittenberg

Themennavigation

Hauptnavigation

Die Universität

Fakultäten

International Office

Zentrale Einrichtungen

Graduierten-Akademie

Wissenschaftliche Zentren

An-Institute

Universitätsklinikum

Weiteres

Data Science I & II

Course Description

Course Structure

Content

Coursework

Part I

Examples

Part II

Examples

Fußnavigation

Letzte Aktualisierung: