2.1 Introduction
A variable is anything that can be measured or counted. In general, we think of our datasets as a rectangle, with a column for each variable and a row for each observation (there are some exceptions when discussing long/wide formatted data, but that is out of the scope of this course).
We care about four attributes of the variables:
- the variable’s type (statistical relevance)
- the different values the variable can take (statistical relevance)
- is the variable clean (i.e. ready to use in an analysis?) (statistical relevance)
- the name of the variable (useful for us)
We can use the fourth attribute to help us remember the first three.
2.2 Continuous Variables
A variable is continuous there is a meaningful “distance” between values.
For example:
- Temperature
- Weight
- Height
- BMI
- Blood pressure
Clean continuous variables can be given the prefix con_
to denote that they are clean. For example, temperature
could be called con_temperature
after it has been cleaned and is ready for analysis.
2.3 Binary Variables
A variable is binary if it can only hold two values.
For example:
- 0 or 1
- True or false
- Male or female
- Sick or healthy
- Born in Norway vs Born outside of Norway
Clean binary variables can be given the prefix is_
to denote that they are clean, binary, and reference the “active state”. For example, an unclean variable called sex
could be recoded as 0 for female and 1 for male, then called is_male
to denote that it is clean (ready for analysis), binary, and “male” is the active state when is_male=1
.
2.4 Categorical Variable
A variable is categorical if there is no meaningful “distance” between values.
For example:
- Sick or healthy
- Born in Norway vs Born outside of Norway
- Cancer stage (I, II, III, or IV)
- BMI category (underweight, normal, or overweight)
Clean categorical variables can be given the prefix cat_
to denote that they are clean. For example, BMI category
could be called cat_bmi
after it has been cleaned and is ready for analysis.
2.5 Censored Variables
Censored variables are a subset of continuous variables. They are artificially cutoff (“censored”) at some point.
For example:
- Height – if everyone over 175cm is recorded as “175+”
- Age – if everyone under 10 years old is recorded as “<=10”
- Time alive since receiving illness diagnosis if there is loss to followup (i.e. we know that the person has lived at least 4 years before we lost track of them)
Clean censored variables can be given the prefix cen_
to denote that they are clean. For example, time alive since receiving illness diagnosis
could be called cen_time_alive
after it has been cleaned and is ready for analysis.
2.6 Count Variables
Count variables are a subset of continuous variables. They can only have integer values (e.g. 0, 1, 2, 3).
For example:
- Number of cars that use the parking lot in a day
- Number of influenza patients who use the hospital every day
- Number of tuberculosis patients who are screened every year
Clean count variables can be given the prefix cou_
to denote that they are clean. For example, number of cars
could be called cou_num_cars
after it has been cleaned and is ready for analysis.
2.7 Independent Versus Dependent Variables
An independent variable is often called an exposure or predictor variable. In an experiment, this variable is manipulated by the researcher.
A dependent variable is often called the outcome. In research, we generally want to see if (the following all mean the same thing):
- The dependent variable is dependent on the independent variable
- The predictor variable predicts the outcome.
- The exposure affects the outcome
For ease of understanding, we will use the terms “outcome” and “exposure” for the rest of this course.
2.8 Dataset workflow (pipeline)
- We begin with a raw dataset (this is never altered)
- We clean the raw dataset and create new variables as needed
- We save a “clean” dataset (all variables have the prefixes
c_
oris_
) - We ONLY run analyses on the clean dataset
We do all of this in “do files” that allow us to recreate the clean dataset from the raw dataset.
Think of our analysis as making dinner, with the do files
as our recipe
and the raw dataset
as our raw ingredients
. The recipe
tells us how to prepare the raw ingredients
(clean dataset
) and how to cook
(analyse
) the prepared ingredients
(clean dataset
) to produce the food
(results
).
All we need are raw ingredients
and the recipe
! The prepared ingredients
and the food
are downstream by-products!
This means that if our code is written correctly, we can delete our clean datasets
and results
without any worry, because the raw dataset
and do files
are sufficient.