4  Not panel data: Multiple areas

4.1 Aim

We are given a dataset containing counts of diseases from multiple geographical areas. We want to identify:

  • Is there a general yearly trend (i.e. increasing or decreasing from year to year?)
  • Is variable x associated with the outcome?

4.2 Creating the data

The data for this chapter is available at: https://www.csids.no/longitudinal-analysis-for-surveillance/data/chapter_4.csv

library(data.table)
library(lme4)
Loading required package: Matrix
set.seed(4)

fylkeIntercepts <- data.table(fylke=1:20,fylkeIntercepts=rnorm(20))

d <- data.table(fylke=rep(1:20,each=100))
d <- merge(d,fylkeIntercepts,by="fylke")
d[,mainIntercept:=3]
d[,x:=runif(.N)]
d[,year:=sample(c(1950:2018),.N,replace=T)]
d[,mu := exp(mainIntercept + fylkeIntercepts + 3*x)]
d[,y:=rpois(.N,mu)]

4.3 Investigating the data

We can see from the data that we have 20 geographical areas (fylke) with 100 observations for each fylke, but the sampling did not happen consistently (some years have multiple measurements, other years have no measurements).

This means we have:

  • multiple geographical areas
  • multiple observations in each geographical area
  • not panel data
print(d)
      fylke fylkeIntercepts mainIntercept          x year        mu   y
   1:     1       0.2167549             3 0.93831909 1961 416.42739 407
   2:     1       0.2167549             3 0.24217109 1960  51.58692  41
   3:     1       0.2167549             3 0.56559453 1999 136.12022 132
   4:     1       0.2167549             3 0.18089910 1963  42.92490  37
   5:     1       0.2167549             3 0.90449929 2001 376.24959 423
  ---                                                                  
1996:    20      -0.2834446             3 0.89237059 1958 220.00872 230
1997:    20      -0.2834446             3 0.80522348 1959 169.39375 173
1998:    20      -0.2834446             3 0.59989167 1974  91.49007  95
1999:    20      -0.2834446             3 0.04148228 1950  17.13293  19
2000:    20      -0.2834446             3 0.77673920 1971 155.51980 164

4.4 Regression

For this scenario, we use the lme4::glmer function in R. We need to introduce a (1|fylke) term to identify the geographical areas (i.e. clusters). In STATA we use the meglm function and introduce a || fylke: term to identify the geographical areas (i.e. clusters).

// STATA CODE STARTS
insheet using "chapter_5.csv", clear

gen yearMinus2000 = year-2000
meglm y x yearMinus2000 || fylke:, family(poisson)
// STATA CODE ENDS
# R CODE
d[,yearMinus2000:=year-2000]
summary(fit <- lme4::glmer(y~x + yearMinus2000 + (1|fylke),data=d,family=poisson()))
Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: poisson  ( log )
Formula: y ~ x + yearMinus2000 + (1 | fylke)
   Data: d

     AIC      BIC   logLik deviance df.resid 
 15502.5  15524.9  -7747.2  15494.5     1996 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.3647 -0.6802 -0.0153  0.6681  4.3638 

Random effects:
 Groups Name        Variance Std.Dev.
 fylke  (Intercept) 0.6114   0.7819  
Number of obs: 2000, groups:  fylke, 20

Fixed effects:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)   3.375e+00  1.749e-01  19.298   <2e-16 ***
x             3.002e+00  6.000e-03 500.407   <2e-16 ***
yearMinus2000 8.884e-05  7.270e-05   1.222    0.222    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
            (Intr) x     
x           -0.024       
yearMns2000  0.006  0.033
optimizer (Nelder_Mead) convergence code: 0 (OK)
Model is nearly unidentifiable: very large eigenvalue
 - Rescale variables?
Model is nearly unidentifiable: large eigenvalue ratio
 - Rescale variables?

You can see that the format of the results is the same as an ordinary regression.