# R CODE
<- glm(y~yearMinus2000 + dailyrainfall + sin365 + cos365, data=d, family=poisson())
fit1 residuals(fit1, type = "response")
1.1 Introduction
There are two important definitions in this course:
- Panel data
- Autocorrelation
Panel data is a set of data with measurements repeated at equally spaced points. For example, weight data recorded every day, or every week, or every year would be considered panel data. A person who records three weight measurements randomly in 2018 would not be considered panel data.
When you have panel data, autocorrelation is the correlation between subsequent observations. For example, if you have daily observations, then the 1 day autocorrelation is the correlation between observations 1 day apart, and likewise the 2 day autocorrelation is the correlation between observations 2 days apart.
In this course we will consider 5 scenarios where we have multiple observations for each geographical area:
- Panel data: One geographical area, no autocorrelation
- Panel data: One geographical area, with autocorrelation
- Not panel data: Multiple geographical areas
- Panel data: Multiple geographical areas, no autocorrelation
- Panel data: Multiple geographical areas, with autocorrelation
Note, the following scenario can be covered by standard regression models:
- Multiple geographical areas, one time point/observation per geographical area
1.2 Method summary
1.2.1 Panel data: One geographical area, no autocorrelation
// STATA CODE
glm y yearminus2000 dailyrainfall cos365 sin365, family(poisson)
1.2.2 Panel data: One geographical area, with autocorrelation
// STATA CODE
glm y yearminus2000 cos365 sin365, family(poisson) vce(robust)
# R CODE
<- MASS::glmmPQL(y~yearMinus2000+sin365 + cos365, random = ~ 1 | ID,
fit family = poisson, data = d,
correlation=nlme::corAR1(form=~dayOfSeries|ID))
<- residuals(fit1, type = "normalized")
r pacf(r)
1.2.3 Not panel data: Multiple geographical areas
// STATA CODE
meglm y x yearMinus2000 || fylke:, family(poisson)
# R CODE
<- lme4::glmer(y~x + yearMinus2000 + (1|fylke),data=d,family=poisson()) fit
1.2.4 Panel data: Multiple geographical areas, no autocorrelation
// STATA CODE
meglm y yearminus2000 cos365 sin365 || fylke:, family(poisson)
# R CODE
<- MASS::glmmPQL(y~yearMinus2000+sin365 + cos365, random = ~ 1 | fylke,
fit family = poisson, data = d)
<- residuals(fit1, type = "normalized")
r pacf(r)
1.2.5 Panel data: Multiple geographical areas, with autocorrelation
// STATA CODE
meglm y yearminus2000 cos365 sin365 || fylke:, family(poisson) vce(robust)
# R CODE
<- MASS::glmmPQL(y~yearMinus2000+sin365 + cos365, random = ~ 1 | fylke,
fit family = poisson, data = d,
correlation=nlme::corAR1(form=~dayOfSeries|fylke))
<- residuals(fit1, type = "normalized")
r pacf(r)
1.3 Identifying your scenario
1.3.1 Step 1: Do you have panel data?
This step should be fairly simple. If your data has equally spaced time intervals between them, you have panel data.
1.3.2 Step 2: Do you have multiple geographical areas?
Again, fairly simple, just look at your data.
1.3.3 Step 3: Do you have autocorrelation?
Firstly, you must run a model pretending that you do not have autocorrelation.
You then inspect the residuals from the model and see if autocorrelation exists. This is done with two statistical procedures: pacf
(for autoregressive models
, the most common type of autocorrelation), and acf
(for moving average models
, a less common type of autocorrelation).
1.3.4 AR(1) data
<- round(as.numeric(arima.sim(model=list("ar"=c(0.5)), rand.gen = rnorm, n=1000))) y
With autoregressive
data, a pacf
plot contains a number of sharp significant lines, indicating how many subsequent observations have autocorrelation. i.e. if one line is significant, it means that each observation is only correlated with its preceeding observation (AR(1)
). If two lines are significant, it means that each observation is correlated with its two preceeding observations (AR(2)
). The following plot represents AR(1)
data.
pacf(y)
With autoregressive
data, an acf
plot contains a number of decreasing lines. The following acf
plot represents some sort of AR
data. Note that the acf
plot displays lag 0
(which is pointless and can be ignored), while the pacf
plot does not.
acf(y)
1.3.5 AR(2) data
<- round(as.numeric(arima.sim(model=list("ar"=c(0.5,0.4)), rand.gen = rnorm, n=1000))) y
The following pacf
plot represents AR(2)
data. This means that each observation is correlated with its two preceeding observations (AR(2)
).
pacf(y)
The following acf
plot represents some sort of AR
data:
acf(y)
1.3.6 MA(1) data
<- round(as.numeric(arima.sim(model=list("ma"=c(0.9)), rand.gen = rnorm, n=1000))) y
With moving average
data, a pacf
plot contains a number of decreasing lines. The following pacf
plot represents some sort of MA
data.:
pacf(y)
With moving average
data, an acf
plot contains a number of sharp significant lines, demarking how many subsequent observations have autocorrelation. i.e. if one line is significant, it means that each observation is only correlated with its preceeding observation. If two lines are significant, it means that each observation is correlated with its two preceeding observations. The following plot represents MA(1)
data. Note that the acf
plot displays lag 0
(which is pointless and can be ignored), while the pacf
plot does not.
acf(y)
1.3.7 MA(2) data
<- round(as.numeric(arima.sim(model=list("ma"=c(0.9,0.6)), rand.gen = rnorm, n=1000))) y
The following pacf
plot represents some sort of MA
data.
pacf(y)
The following acf
plot represents MA(2)
data. This means that each observation is correlated with its two preceeding observations.
acf(y)