4  Simple regression (fixed effects)

4.1 Regression in general

Regression is the explicit modelling of a parametric association between an outcome and an exposure.

One such parametric association might be the following:

\[\text{outcome} = 3 + 2 \times \text{exposure}\]

Depending on the type of outcome, different types of regression will need to be used.

For all regressions, the exposure can be:

  • Continuous
  • Binary (0 or 1)
  • Categorical (0, 1, 2, …)
  • Count data

Regressions can both:

  • Perform hypothesis testing (same as the previous tests we have learned about)
  • Estimate numerically the effect size of the association between outcome and exposure (new!)

4.2 Linear regression

In the most basic form, we have:

\[\text{outcome} = \beta_0 + \beta_1 \times \text{exposure} + \text{error}\]

Where we aim to estimate values for \(\beta_0\) and \(\beta_1\).

For example, if we run an ice cream shop:

\[\text{number of ice creams sold} = 5 + 3 \times \text{temperature} + \text{error}\]

If today’s temperature is 30C, we can expect our shop to sell \(5 + 3 \times 30 = 95\) ice creams. Because \(\beta_1\) (\(=3\)) was not zero, we have a significant association between temperature and number of ice creams sold.

Another example, if we work as a midwife:

\[\text{Child's birthweight} = 3 + 0 \times \text{temperature at day of delivery} + \text{error}\]

If today’s temperature is 30C, we can expect that children born today will be (on average) \(3 + 0 \times 30 = 3\) kg. If tomorrow’s temperature is 10C, we can expect that children born today will be (on average) \(3 + 0 \times 10 = 3\) kg. Because \(\beta_1\) was zero, we do not have a significant association between temperature and birthweight.

4.2.1 Aim/Outcome/Exposure/Parametric/Dependencies

Aim: Hypothesis testing and estimating the effect size of the association between outcome and exposure

Outcome: Continuous variable

Exposure: Continuous, Binary, Categorical, Count variable

Parametric assumptions: Residuals are distributed as a Normal distribution

Dependencies: None (all observations independent)

4.2.2 Example 1

\(\rightarrow\) Testing if average birth weight (continuous outcome) is associated with parents’ income (continuous exposure)

\[\text{birth weight} = \beta_0 + \beta_1 \times \text{parent's income} + \text{error}\]

\[\text{H}_0: \beta_1 = 0\] \[\text{H}_1: \beta_1 \ne 0\]

4.2.3 Example 2

\(\rightarrow\) Testing if average birth weight (continuous outcome) is associated with child’s sex (binary exposure)

\[\text{birth weight} = \beta_0 + \beta_1 \times \text{is boy} + \text{error}\]

\[\text{H}_0: \beta_1 = 0\] \[\text{H}_1: \beta_1 \ne 0\]

4.2.4 Example 3

\(\rightarrow\) Testing if average BMI levels (continuous outcome) differ across Scandinavia (categorical exposure)

\[\text{bmi} = \beta_0 + \beta_1 \times \text{is Norway} + \beta_2 \times \text{is Sweden} + \text{error}\]

\[\text{H}_0: \beta_1 = \beta_2 = 0\] \[\text{H}_1: \beta_1 \ne 0 \text{ and/or } \beta_2 \ne 0\]

4.2.5 Example 4 (student fills in)

\(\rightarrow\)


\(H_0:\)


\(H_1:\)


4.2.6 Example 5 (student fills in)

\(\rightarrow\)


\(H_0:\)


\(H_1:\)


4.3 Similarities between t-tests, ANOVA, and linear regression

4.3.1 Example 1

Two-sample unpaired t-test:

\(\rightarrow\) Testing if average birth weight (continuous outcome) is different in female children versus male children

\[\text{H}_0: \mu_{\text{boys}} = \mu_{\text{girls}}\] \[\text{H}_1: \mu_{\text{boys}} \ne \mu_{\text{girls}}\]

ANOVA:

\(\rightarrow\) Testing if average birth weight (continuous outcome) is different in female children versus male children

\[\text{H}_0: \mu_{\text{boys}} = \mu_{\text{girls}}\] \[\text{H}_1: \mu_{\text{boys}} \ne \mu_{\text{girls}}\]

Linear regression:

\(\rightarrow\) Testing if the effect of child’s sex on average birth weight (continuous outcome) is different than zero

\[\text{birth weight} = \beta_0 + \beta_1 \times \text{is boy} + \text{error}\] \[\text{H}_0: \beta_1 = 0\] \[\text{H}_1: \beta_1 \ne 0\]

Conclusion:

  • Two-sample unpaired t-tests are ANOVAs with only two groups
  • Two-sample unpaired t-tests are linear regressions with a binary (0/1) exposure
  • ANOVA is a linear regression with a categorical exposure

4.3.2 Example 2

Two-sample unpaired t-test:

\(\rightarrow\) Testing if average number of hours sleep (continuous outcome) is different in adults who are parents versus those who are childless

\[\text{H}_0: \mu_{\text{parents}} = \mu_{\text{childless}}\] \[\text{H}_1: \mu_{\text{parents}} \ne \mu_{\text{childless}}\]

ANOVA:

\(\rightarrow\) Testing if average number of hours sleep (continuous outcome) is different in adults who are parents versus those who are childless

\[\text{H}_0: \mu_{\text{parents}} = \mu_{\text{childless}}\] \[\text{H}_1: \mu_{\text{parents}} \ne \mu_{\text{childless}}\]

Linear regression:

\(\rightarrow\) Testing if the effect of being a parent on average number of hours sleep (continuous outcome) is different than zero

\[\text{birth weight} = \beta_0 + \beta_1 \times \text{is parent} + \text{error}\] \[\text{H}_0: \beta_1 = 0\] \[\text{H}_1: \beta_1 \ne 0\]

Conclusion:

  • Two-sample unpaired t-tests are ANOVAs with only two groups
  • Two-sample unpaired t-tests are linear regressions with a binary (0/1) exposure
  • ANOVA is a linear regression with a categorical exposure

4.3.3 Example 3 (student fills in)

Two-sample unpaired t-test:

\(\rightarrow\)


\(H_0:\)


\(H_1:\)


ANOVA:

\(\rightarrow\)


\(H_0:\)


\(H_1:\)


Linear regression:

\(\rightarrow\)


\(H_0:\)


\(H_1:\)


4.4 Similarities between ANOVA and linear regression

4.4.1 Example 1

ANOVA:

\(\rightarrow\) Testing if average birth weight (continuous outcome) differs between Scandinavian countries

\[\text{H}_0: \mu_{\text{Norway}} = \mu_{\text{Denmark}} = \mu_{\text{Sweden}}\] \[\text{H}_1: \mu_{\text{Norway}} \ne \mu_{\text{Denmark}} \text{ and/or } \mu_{\text{Norway}} \ne \mu_{\text{Sweden}} \text{ and/or } \mu_{\text{Denmark}} \ne \mu_{\text{Sweden}}\]

Linear regression:

\(\rightarrow\) Testing if the effect of country on average birth weight (continuous outcome) is different than zero

\[\text{birth weight} = \beta_0 + \beta_1 \times \text{is Norway} + \beta_2 \times \text{is Denmark} + \text{error}\] \[\text{H}_0: \beta_1 = \beta_2 = 0\] \[\text{H}_1: \beta_1 \ne 0 \text{ and/or } \beta_2 \ne 0\]

Conclusion:

  • ANOVA is a linear regression with a categorical exposure

4.4.2 Example 2 (student fills in)

ANOVA:

\(\rightarrow\)


\(H_0:\)


\(H_1:\)


Linear regression:

\(\rightarrow\)


\(H_0:\)


\(H_1:\)


4.5 Logistic regression models

Logistic regression is essentially the same as linear regression, but it is used when:

  • You have a binary (0/1) outcome
  • You are doing a case-control study [case control studies can ONLY be analysed using logistic regression]

4.5.1 Aim/Outcome/Exposure/Parametric/Dependencies

Aim: Hypothesis testing and estimating the effect size of the association between outcome and exposure

Outcome: Binary variable

Exposure: Continuous, Binary, Categorical, Count variable

Parametric assumptions: No

Dependencies: None (all observations independent)

4.5.2 Example 1

\(\rightarrow\) Testing if percentage of women (binary outcome) differ across the bydels of Oslo (categorical exposure)

\[\text{log}\left(\frac{\text{Pr(Is woman)}}{\text{Pr(Is man)}}\right) = \beta_0 + \beta_1 \times \text{bydel}_1 + \beta_2 \times \text{bydel}_2 + \beta_3 \times \text{bydel}_3 + \text{error}\]

\[\text{H}_0: \beta_1 = \beta_2 = \beta_3 = 0\] \[\text{H}_1: \beta_1 \ne 0 \text{and/or} \beta_2 \ne 0 \text{and/or} \beta_3 \ne 0\]

4.5.3 Example 2

\(\rightarrow\) Testing if risk of unemployment (binary outcome) is associated with parents’ income (continuous exposure)

\[\text{log}\left(\frac{\text{Pr(Is unemployed)}}{\text{Pr(Is employed)}}\right) = \beta_0 + \beta_1 \times \text{parent's income} + \text{error}\]

\[\text{H}_0: \beta_1 = 0\] \[\text{H}_1: \beta_1 \ne 0\]

4.5.4 Example 3

\(\rightarrow\) Testing if risk of smoking (binary outcome) is associated with parents’ smoking status (binary exposure)

\[\text{log}\left(\frac{\text{Pr(Is smoker)}}{\text{Pr(Is not smoker)}}\right) = \beta_0 + \beta_1 \times \text{parent's are smokers} + \text{error}\]

\[\text{H}_0: \beta_1 = 0\] \[\text{H}_1: \beta_1 \ne 0\]

4.5.5 Example 4 (student fills in)

\(\rightarrow\)


\(H_0:\)


\(H_1:\)


4.5.6 Example 5 (student fills in)

\(\rightarrow\)


\(H_0:\)


\(H_1:\)


4.6 Poisson/negative-binomial regression models

Poisson/negative-binomial regression is essentially the same as linear regression, but it is used when:

  • You have a count outcome

Negative-binomial regression is a more flexible version of poisson regression. Poisson regression requires that the residual variation (after fitting the model) is equal to the expected mean. This is quite often not the case. Negative-binomial regression fits the variation and the mean separately, removing this problem. It is therefore recommended that you always use a negative-binomial regression instead of a poisson regression. The only exception is if you encounter statistical errors with the negative-binomial regression (i.e. it won’t converge/run), then a poisson regression is your only option.

4.6.1 Aim/Outcome/Exposure/Parametric/Dependencies

Aim: Hypothesis testing and estimating the effect size of the association between outcome and exposure

Outcome: Count variable

Exposure: Continuous, Binary, Categorical, Count variable

Parametric assumptions for Poisson: Mean equals variable

Parametric assumptions for negative-binomial: No

Dependencies: None (all observations independent)

4.6.2 Example 1

\(\rightarrow\) Testing if average number of influenza cases (count outcome) is different between 2000-2009 and 2010-2015 (binary exposure) in Norway

\[\text{yearly number of influenza cases} = \beta_0 + \beta_1 \times \text{is 2010 to 2015} + \text{error}\]

\[\text{H}_0: \beta_1 = 0\] \[\text{H}_1: \beta_1 \ne 0\]

4.6.3 Example 2 (student fills in)

\(\rightarrow\)


\(H_0:\)


\(H_1:\)


4.7 Cox regression models

Cox regression models should be used when your outcome is “time-to-event”.

The most common example of this is when you are following a cohort of people over time, trying to observe an (e.g. sickness, death, response). Your outcome is “length of time until person X gets disease Y”. However, a number of your participants stop responding at some point, so you only know “person X was healthy up until 200 days, when we lost contact”.Thus person X’s outcome has been censored at day 200.

4.7.1 Aim/Outcome/Exposure/Parametric/Dependencies

Aim: Hypothesis testing and estimating the effect size of the association between outcome and exposure

Outcome: Censored variable (time-to-event)

Exposure: Continuous, Binary, Categorical, Count variable

Parametric assumptions: Proportional hazards

Dependencies: None (all observations independent)

4.7.2 Example 1

\(\rightarrow\) Testing if time-to-death (outcome) is associated with having a hospital-acquired-infection after hip surgery (binary exposure)

\[\lambda(t | X_i) = \lambda_0(t) \times \text{exp}(\beta_1 \times \text{had HAI})\]

Where \(\lambda(t | X_i)\) is the hazard rate of dying at time \(t\) for subject \(i\).

\[\text{H}_0: \beta_1 = 0\] \[\text{H}_1: \beta_1 \ne 0\]

4.7.3 Example 2 (student fills in)

\(\rightarrow\)


\(H_0:\)


\(H_1:\)