## 4.1 Aim

We are given a dataset containing counts of diseases from multiple geographical areas. We want to identify:

• Is there a general yearly trend (i.e. increasing or decreasing from year to year?)
• Is variable `x` associated with the outcome?

## 4.2 Creating the data

The data for this chapter is available at: https://www.csids.no/longitudinal-analysis-for-surveillance/data/chapter_4.csv

``````library(data.table)
library(lme4)``````
``Loading required package: Matrix``
``````set.seed(4)

fylkeIntercepts <- data.table(fylke=1:20,fylkeIntercepts=rnorm(20))

d <- data.table(fylke=rep(1:20,each=100))
d <- merge(d,fylkeIntercepts,by="fylke")
d[,mainIntercept:=3]
d[,x:=runif(.N)]
d[,year:=sample(c(1950:2018),.N,replace=T)]
d[,mu := exp(mainIntercept + fylkeIntercepts + 3*x)]
d[,y:=rpois(.N,mu)]``````

## 4.3 Investigating the data

We can see from the data that we have 20 geographical areas (`fylke`) with 100 observations for each fylke, but the sampling did not happen consistently (some years have multiple measurements, other years have no measurements).

This means we have:

• multiple geographical areas
• multiple observations in each geographical area
• not panel data
``print(d)``
``````      fylke fylkeIntercepts mainIntercept          x year        mu   y
1:     1       0.2167549             3 0.93831909 1961 416.42739 407
2:     1       0.2167549             3 0.24217109 1960  51.58692  41
3:     1       0.2167549             3 0.56559453 1999 136.12022 132
4:     1       0.2167549             3 0.18089910 1963  42.92490  37
5:     1       0.2167549             3 0.90449929 2001 376.24959 423
---
1996:    20      -0.2834446             3 0.89237059 1958 220.00872 230
1997:    20      -0.2834446             3 0.80522348 1959 169.39375 173
1998:    20      -0.2834446             3 0.59989167 1974  91.49007  95
1999:    20      -0.2834446             3 0.04148228 1950  17.13293  19
2000:    20      -0.2834446             3 0.77673920 1971 155.51980 164``````

## 4.4 Regression

For this scenario, we use the `lme4::glmer` function in R. We need to introduce a `(1|fylke)` term to identify the geographical areas (i.e. clusters). In STATA we use the `meglm` function and introduce a `|| fylke:` term to identify the geographical areas (i.e. clusters).

``````// STATA CODE STARTS
insheet using "chapter_5.csv", clear

gen yearMinus2000 = year-2000
meglm y x yearMinus2000 || fylke:, family(poisson)
// STATA CODE ENDS``````
``````# R CODE
d[,yearMinus2000:=year-2000]
summary(fit <- lme4::glmer(y~x + yearMinus2000 + (1|fylke),data=d,family=poisson()))``````
``````Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: poisson  ( log )
Formula: y ~ x + yearMinus2000 + (1 | fylke)
Data: d

AIC      BIC   logLik deviance df.resid
15502.5  15524.9  -7747.2  15494.5     1996

Scaled residuals:
Min      1Q  Median      3Q     Max
-3.3647 -0.6802 -0.0153  0.6681  4.3638

Random effects:
Groups Name        Variance Std.Dev.
fylke  (Intercept) 0.6114   0.7819
Number of obs: 2000, groups:  fylke, 20

Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept)   3.375e+00  1.749e-01  19.298   <2e-16 ***
x             3.002e+00  6.000e-03 500.407   <2e-16 ***
yearMinus2000 8.884e-05  7.270e-05   1.222    0.222
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
(Intr) x
x           -0.024
yearMns2000  0.006  0.033
optimizer (Nelder_Mead) convergence code: 0 (OK)
Model is nearly unidentifiable: very large eigenvalue
- Rescale variables?
Model is nearly unidentifiable: large eigenvalue ratio
- Rescale variables?``````

You can see that the format of the results is the same as an ordinary regression.