COMP 6960 - 090 Programming for BioMedical Data Science

University of Utah

Semester: Fall 2024

Time: Self-Paced

Location: Online

Instructor: Rebecca Barter (rebecca.barter@hsc.utah.edu)

Faculty Coordinator: Jeff Phillips (jeff.m.phillips@utah.edu)

General Information

This course will provide an introduction to programming in R with topics and pace designed for biomedical students interested in data science. Prior programming experience is not required. Students will learn how to write code for handling data, focusing on dataframe representations. Using these common representations, students will learn to prepare data for analysis starting from various formats, visualize its contents, and perform basic analysis to evaluate the data veracity. This course is structured as a series of stackable short-courses, where students need to select and complete 4 short courses in the semester to fulfill requirements for this credit-earning course.

This course is 100% self-paced and online

Short courses

This is a composite course consisting of the four self-paced online short courses, listed below.

You should complete all four of these courses sequentially, starting with Introduction to R for Data Analysis, and ending with Statistics with R.

If you finish a course early, you are welcome to start the next course (you do not need to wait until after the previous course’s due date).

1. Introduction to R for Data Analysis

Registration: Click here to register for this course on Headlamp.

Length: 5.5 hours of pre-recorded lectures, 3 projects

Due date: September 18

Description: This course introduces the R programming language and is designed for beginners who are new to R and coding. Specific topics covered include using RStudio and writing documents with quarto along with basic coding principles, defining variables, vectors, and data frames, pipes, data manipulation with dplyr, and data visualization with ggplot2.

Course Learning Objectives:

Perform operations with character, numeric, logical, and Boolean type objects in R
Use the dplyr library functions select, mutate, filter, summarize, and group_by to manipulate, and summarize tabular data
Use the ggplot2 library to create customizable data visualizations, including histograms, bar charts, and scatterplots

2. Advanced R for Data Analysis

Length: 4 hours of pre-recorded lectures, 4 projects

Due date: October 9

Description: You’ve learned the basics of R and the tidyverse, and now you’re ready to conduct more sophisticated data manipulation and analysis. This course is designed to elevate your R programming expertise, building on the foundations laid in the Introduction to R for Data Analysis course. You’ll learn advanced techniques and powerful tools that will transform your data workflows and analytical capabilities, such as creating custom R functions to automate your data science pipeline, reducing code redundancy with map functions, refactoring column variables, joining multiple data frames, and reshaping data. This course utilizes the R programming language, and it is assumed that learners have taken our “Introduction to R for Data Analysis” course or have equivalent experience.

Course Learning Objectives:

Write your own custom R functions
Reduce redundancy in your code using iteration techniques with the purrr package
Refactor column variables
Reshape your data frames
Join multiple data frames together

3. Data Cleaning with R

Length: 6 hours of pre-recorded lectures, 6 projects

Due date: November 7

Description: Real-world data can never perfectly capture the real world, and it rarely arrives in a ready-for-analysis format. Before your data is ready for analysis, it is critical that it is formatted appropriately, and you have ensured that it is representing the reality it was originally designed to capture to the greatest extent possible. The process of molding a dataset into a format that satisfies these criteria is known as “data cleaning”. While it may seem like a boring process, data cleaning is arguably the most important stage of the entire data science life cycle, since it ensures that you have a clear understanding of how real-world information is represented in your data as well as its limitations. Since every dataset is “messy” in its own way, the process of data cleaning will be unique to every dataset. This course introduces a series of steps that can be used to help you to understand your data and create a custom data cleaning procedure for any dataset, with a focus on biomedical data, such as electronic health records and health survey data. Specific topics covered include identifying and addressing missing values, handling data quality issues, such as invalid and inconsistent values, reshaping data into a “tidy” format, and creating a custom function in R that provides a reproducible and modifiable data cleaning pipeline for your project. This course utilizes the R programming language, and it is assumed that learners have taken our “Introduction to R for Data Analysis” course, or equivalent. It is recommended that students have also taken our “Advanced R for Data Analysis” course, but this is not a requirement.

Course Learning Objectives:

Data collection procedure and data dictionaries
Loading and pre-formatting data in R
Identifying and handling missing values
Identifying and handling invalid and inconsistent values
Converting data to a tidy data format
Creating and implementing a reproducible data cleaning pipeline

4. Statistics with R

Length: TBD

Due date: December 6

Description: TBD

Course Learning Objectives: TBD

Grading

Grade: Each short course contains of a number of projects that are each graded out of 4 points. For any given project, a score of:

1 point: you did not correctly complete the majority of the required tasks
2 points: you correctly completed the majority, but not all of, the required tasks
3 points: you correctly completed all of the required tasks.
4 points: you correctly completed all of the required tasks AND your code and figures were tidy and polished, where relevant.

Your grade for each short course will be based on the average of your individual project scores.

Your grade for the entire composite course will be correspond to the average score you receive across the four short courses.

Late policy: If you do not complete a short-course by its due date (see above), you will lose 2% of your grade from that short-course per day, unless instructor permission is granted.

Final exam: There is no final exam for this course.

Contact

Please email Jeff Phillips (jeff.m.phillips@utah.edu) and Rebecca Barter (rebecca.barter@hsc.utah.edu) with questions or for more information.