2  Variables

2.1 Introduction

A variable is anything that can be measured or counted. In general, we think of our datasets as a rectangle, with a column for each variable and a row for each observation (there are some exceptions when discussing long/wide formatted data, but that is out of the scope of this course).

We care about four attributes of the variables:

  • the variable’s type (statistical relevance)
  • the different values the variable can take (statistical relevance)
  • is the variable clean (i.e. ready to use in an analysis?) (statistical relevance)
  • the name of the variable (useful for us)

We can use the fourth attribute to help us remember the first three.

2.2 Continuous Variables

A variable is continuous there is a meaningful “distance” between values.

For example:

  • Temperature
  • Weight
  • Height
  • BMI
  • Blood pressure

Clean continuous variables can be given the prefix con_ to denote that they are clean. For example, temperature could be called con_temperature after it has been cleaned and is ready for analysis.

2.3 Binary Variables

A variable is binary if it can only hold two values.

For example:

  • 0 or 1
  • True or false
  • Male or female
  • Sick or healthy
  • Born in Norway vs Born outside of Norway

Clean binary variables can be given the prefix is_ to denote that they are clean, binary, and reference the “active state”. For example, an unclean variable called sex could be recoded as 0 for female and 1 for male, then called is_male to denote that it is clean (ready for analysis), binary, and “male” is the active state when is_male=1.

2.4 Categorical Variable

A variable is categorical if there is no meaningful “distance” between values.

For example:

  • Sick or healthy
  • Born in Norway vs Born outside of Norway
  • Cancer stage (I, II, III, or IV)
  • BMI category (underweight, normal, or overweight)

Clean categorical variables can be given the prefix cat_ to denote that they are clean. For example, BMI category could be called cat_bmi after it has been cleaned and is ready for analysis.

2.5 Censored Variables

Censored variables are a subset of continuous variables. They are artificially cutoff (“censored”) at some point.

For example:

  • Height – if everyone over 175cm is recorded as “175+”
  • Age – if everyone under 10 years old is recorded as “<=10”
  • Time alive since receiving illness diagnosis if there is loss to followup (i.e. we know that the person has lived at least 4 years before we lost track of them)

Clean censored variables can be given the prefix cen_ to denote that they are clean. For example, time alive since receiving illness diagnosis could be called cen_time_alive after it has been cleaned and is ready for analysis.

2.6 Count Variables

Count variables are a subset of continuous variables. They can only have integer values (e.g. 0, 1, 2, 3).

For example:

  • Number of cars that use the parking lot in a day
  • Number of influenza patients who use the hospital every day
  • Number of tuberculosis patients who are screened every year

Clean count variables can be given the prefix cou_ to denote that they are clean. For example, number of cars could be called cou_num_cars after it has been cleaned and is ready for analysis.

2.7 Independent Versus Dependent Variables

An independent variable is often called an exposure or predictor variable. In an experiment, this variable is manipulated by the researcher.

A dependent variable is often called the outcome. In research, we generally want to see if (the following all mean the same thing):

  • The dependent variable is dependent on the independent variable
  • The predictor variable predicts the outcome.
  • The exposure affects the outcome

For ease of understanding, we will use the terms “outcome” and “exposure” for the rest of this course.

2.8 Dataset workflow (pipeline)

  • We begin with a raw dataset (this is never altered)
  • We clean the raw dataset and create new variables as needed
  • We save a “clean” dataset (all variables have the prefixes c_ or is_)
  • We ONLY run analyses on the clean dataset

We do all of this in “do files” that allow us to recreate the clean dataset from the raw dataset.

Think of our analysis as making dinner, with the do files as our recipe and the raw dataset as our raw ingredients. The recipe tells us how to prepare the raw ingredients (clean dataset) and how to cook (analyse) the prepared ingredients (clean dataset) to produce the food (results).

All we need are raw ingredients and the recipe! The prepared ingredients and the food are downstream by-products!

This means that if our code is written correctly, we can delete our clean datasets and results without any worry, because the raw dataset and do files are sufficient.