BIO4101 Introduction to Biostatistics and Machine Learning

    • Course code
      BIO4101
    • Number of credits
      10
    • Teaching semester
      2024 Autumn
    • Language of instruction
      English
    • Campus
      Hamar
    • Required prerequisite knowledge

      Recommended prior knowledge: basic knowledge of Linux, basic statistics and bioinformatics. 

Course content

Machine Learning (ML) is an umbrella concept to computational algorithms, aiming at solving (large scale) problems without precise instructions on how to find solutions. The different algorithms (e.g., Perceptron, Self-Organizing Maps, Support Vector Machines, Random Forests, Autoencoders, Transformers, etc.) have in common that they can radically upscale otherwise relatively simple statistical or mathematical concepts. It is, however, important to understand the inner workings of ML and to interpret the results according to exactly how the problem is formulated.  A common way to set up ML experiments is to use existing libraries, such as TensorFlow, and configure the algorithms using the Python programming language. Python is thus a necessary tool to perform experiments.     

The following topics are covered: 

  • Basic statistical concepts and terminology such as sampling, variation, probability, modeling, inference, etc 
  • Read in data in various file formats such as Excel, comma-separated text, 
  • Data manipulation, graphics, use of the R environment for descriptive and exploratory data analysis. 
  • Null Hypothesis, scientific testing, z-values of the normal distribution, t-distribution and t-tests, multiple testing 
  • Linear models, OLS, logistic regression, residuals, model validation and model selection (multiple regression) 

ML topics covered include: 

  • The Linux/Unix command line shell (refresher) 
  • Basic programmeming concepts  
  • Programmeming in Python 
  • Algorithms in ML and their applications 
  • Using ML libraries in Python 
  • Parsing and loading data 
  • Processing data using ML algorithms 

Learning Outcome

Upon passing the course, students have achieved the following learning outcomes:

Knowledge

Students

  • have an understanding of statistical concepts and terminology, and also syntax and conventions for describing and interpreting statistical models in R. 
  • have knowledge of how to write and run Python scripts. 
  • have thorough knowledge of how to set up different Machine Learning systems using existing ML libraries. 
  • have knowledge of evaluating the results of an ML system. 
  • have knowledge of detecting and avoiding over-fitting and batch effects. 
Skills

Students

  • can apply statistical models and interpret model outcomes and predictions, for example the lm() function for linear regression. 
  • can present statistical results in scientific reports and identify the most important findings. 
  • can use documentation and help texts for R packages that implement statistical methods that they are familiar with. 
  • are able to analyze a problem in bioinformatics, define requirements for data collection, and design ML experiments to test hypotheses. 
  • understand the inner workings of ML algorithms, as well as their limitations, in the context of biological data analysis.
General competence

Students

  • can analyze, visualize, and manipulate data and present their analysis results in a clear and scientific form, using text and graphics. 
  • can write and run Python programs to solve simple biological tasks, including data quality control and formatting. 
  • can interface Python programs with existing ML libraries. 
  • can apply statistical methods in R to datasets that they meet later in studies and work. 
  • can transfer data into R and validating the quality of the data, running appropriate analyzes, interpreting and presenting the results in a form that is useful to the end user. 
Teaching and working methods

Lectures and exercises/assignments based on examples from biological systems such as biomedicine and biotechnology. 

  • Lectures: Detailed lectures are given to provide an overview of the main topics included in the course. 
  • Computer lab: Practical exercises throughout the semester where students will read and analyze data in R with R-Studio and write Python programs to conduct a series of experiments. 
  • Compulsory assignment: Individual written reports of an assigned problem, one using R/R-studio to analyze a provided dataset and one applying ML algorithms.  

Normally, evaluation of all courses must be carried out. Time/date and method are decided in consultation with student representatives. The course coordinator is responsible for ensuring that the evaluation is carried out. 

Required coursework
  • Attendance of at least 80% for all scheduled lectures and computer lab exercises. 
Assessments
Form of assessmentGrading scaleGroupingDuration of assessmentSupport materialsProportionComment
Written assignment
ECTS - A-F
Individual
50
Written examination with invigilation
ECTS - A-F
Individual
4 Hour(s)
50
Form of assessment
  • Individual written report (50%)
  • 4-hour individual written school exam (50%)

 

Performance is assessed using a grading scale from A-F, where E is the lowest passing grade. All examinations must be passed in order for the course to be assessed as passed.

Faculty
Faculty of Applied Ecology, Agricultural Sciences and Biotechnology
Department
Department of Biotechnology
Area of study
Matematisk-naturvitenskapelige fag/informatikk
Programme of study
Master's Degree in Applied and Commercial Biotechnology
Course level
Second degree level (500-HN)