BIO4101 Introduction to Biostatistics and Machine Learning

Startsemestre

- Course code
  BIO4101
- Number of credits
  10
- Teaching semester
  2024 Autumn
- Language of instruction and examination
  English
- Campus
  Hamar
- Required prerequisite knowledge
  Recommended prior knowledge: basic knowledge of Linux, basic statistics and bioinformatics.

Course content

Machine Learning (ML) is an umbrella concept to computational algorithms, aiming at solving (large scale) problems without precise instructions on how to find solutions. The different algorithms (e.g., Perceptron, Self-Organizing Maps, Support Vector Machines, Random Forests, Autoencoders, Transformers, etc.) have in common that they can radically upscale otherwise relatively simple statistical or mathematical concepts. It is, however, important to understand the inner workings of ML and to interpret the results according to exactly how the problem is formulated. A common way to set up ML experiments is to use existing libraries, such as TensorFlow, and configure the algorithms using the Python programming language. Python is thus a necessary tool to perform experiments.

The following topics are covered:

Basic statistical concepts and terminology such as sampling, variation, probability, modeling, inference, etc
Read in data in various file formats such as Excel, comma-separated text,
Data manipulation, graphics, use of the R environment for descriptive and exploratory data analysis.
Null Hypothesis, scientific testing, z-values of the normal distribution, t-distribution and t-tests, multiple testing
Linear models, OLS, logistic regression, residuals, model validation and model selection (multiple regression)

ML topics covered include:

The Linux/Unix command line shell (refresher)
Basic programmeming concepts
Programmeming in Python
Algorithms in ML and their applications
Using ML libraries in Python
Parsing and loading data
Processing data using ML algorithms

Learning outcome

Upon passing the course, students have achieved the following learning outcomes:

Knowledge

Students

have an understanding of statistical concepts and terminology, and also syntax and conventions for describing and interpreting statistical models in R.
have knowledge of how to write and run Python scripts.
have thorough knowledge of how to set up different Machine Learning systems using existing ML libraries.
have knowledge of evaluating the results of an ML system.
have knowledge of detecting and avoiding over-fitting and batch effects.

Skills

Students

can apply statistical models and interpret model outcomes and predictions, for example the lm() function for linear regression.
can present statistical results in scientific reports and identify the most important findings.
can use documentation and help texts for R packages that implement statistical methods that they are familiar with.
are able to analyze a problem in bioinformatics, define requirements for data collection, and design ML experiments to test hypotheses.
understand the inner workings of ML algorithms, as well as their limitations, in the context of biological data analysis.

General competence

Students

can analyze, visualize, and manipulate data and present their analysis results in a clear and scientific form, using text and graphics.
can write and run Python programs to solve simple biological tasks, including data quality control and formatting.
can interface Python programs with existing ML libraries.
can apply statistical methods in R to datasets that they meet later in studies and work.
can transfer data into R and validating the quality of the data, running appropriate analyzes, interpreting and presenting the results in a form that is useful to the end user.

Working and teaching methods

Lectures and exercises/assignments based on examples from biological systems such as biomedicine and biotechnology.

Lectures: Detailed lectures are given to provide an overview of the main topics included in the course.
Computer lab: Practical exercises throughout the semester where students will read and analyze data in R with R-Studio and write Python programs to conduct a series of experiments.
Compulsory assignment: Individual written reports of an assigned problem, one using R/R-studio to analyze a provided dataset and one applying ML algorithms.

Normally, evaluation of all courses must be carried out. Time/date and method are decided in consultation with student representatives. The course coordinator is responsible for ensuring that the evaluation is carried out.

Compulsory activities

Attendance of at least 80% for all scheduled lectures and computer lab exercises.

Examination

Form of assessment	Grading scale	Grouping	Duration of assessment	Support materials	Proportion	Comments
Written assignment	ECTS - A-F	Individual			50
Written examination with supervision	ECTS - A-F	Individual	4 Hour(s)		50

Form of assessment

Individual written report (50%)
4-hour individual written school exam (50%)

Performance is assessed using a grading scale from A-F, where E is the lowest passing grade. All examinations must be passed in order for the course to be assessed as passed.

Reading list

No reading list available for this course

Faculty

Faculty of Applied Ecology, Agricultural Sciences and Biotechnology

Department

Department of Biotechnology

Area of study

Matematisk-naturvitenskapelige fag/informatikk

Programme of study

Master's Degree in Applied and Commercial Biotechnology

Course level

Second degree level (500-HN)