1. Data Inspection

From year 1966 to year 1975, the number of boston armed robberies is increasing overtime. Though the number flucuates, the general trend is increasing. The data does not show evident seasonal component, which requires further investigation. The data shows equal variance.

2. Data Transformation.

The data shows an increasing trend. Therefore the data is not stationary. The differenced data shows that there is fluctuation, and the fluctuation becomes more and more intense overtime. Also, the differenced data shows unequal variance.

The sqrt, log and \(Y^{1/3}\) transformed data show an upward trend. 1/Y transformed shows a downward trend. The log transformed data shows equal variance. I would choose log transformed data for the remaining steps.

3. Preliminary Guess for ARIMA model

Sample ACF plot shows that the transformed data are not stationary. However, sample acf is decreasing with larger lag. Sample pacf plot shows that, though pacf with h = 1 is greater than 0.2, with h greater than 1, pacf values drop lower than 0.2 and become stable. The cut-off point for pacf is 1. The pattern looks like a AR process. I think the differenced data can provide more evidence about which model to use, ARIMA or ARMA.

The plots of raw and smoothed periodogram both show that the domain of the frequency is very low. There is basically no periodic component in the log transformed data.

Now let’s look at the differenced logY. The differenced data seems like stationary, which indicates that ARIMA model is better than ARMA. In the ACF plot, the cut off point is 1, which indicated that q = 1. Based on the pacf plot of logY, p = 1, q = 1 with differencing might be appropriate parameter values for the ARIMA model.

Based on those plots, I think arima(1,1,1) might be an appropriate model.

4. Fit Preliminary model

The fitted model is ARIMA(1,1,1) with drift, \(\psi_1 = 0.3699\) and \(\theta_1 = -0.6814\).

\[Y_t = \triangledown Xt\] \[Y_t - 0.3699Y_{t-1} = \varepsilon_t -0.6814\varepsilon_{t-1}\] and \(\{\varepsilon_t\}\) is white noise.

From the plot of fitted vs. observed logY, I can see that the fitted values and the observed values are very close. Arima(1,1,1) model fits the observed values.

## 
## Call:
## arima(x = BAR$logy, order = c(1, 1, 1))
## 
## Coefficients:
##          ar1      ma1
##       0.3699  -0.6814
## s.e.  0.1747   0.1324
## 
## sigma^2 estimated as 0.03851:  log likelihood = 24.42,  aic = -42.83

5. Residual Examination

To test whether the residuals of the Arima(1,1,1) model is iid, I performed a Box-Ljung test. The result of Box-Ljung test shows that the null hypothesis of iid variance cannot be rejected (p = 0.4216 > 0.05). The residuals are iid distributed. The histogram and qq plot of the residuals shows that the residuals are normally distributed with equal variance. The acf and pacf plot of the residuals also shows that the residuals are stationary. The residuals resemble Gaussian white noise.

## 
##  Box-Ljung test
## 
## data:  fit1$residuals
## X-squared = 10.218, df = 10, p-value = 0.4216

6. Model Selection

The model is selected using auto.arima() with AICc criterion. The selected model is ARIMA(1,1,1) with drift. The paramter estimates are \(\psi_1 = 0.5608\), \(\theta_1 = -0.9190\), drift = 0.0212. The standard errors of the parameter estimates are \(s.e\{\psi_1\} =0.1300\), \(s.e\{\theta_1\} = 0.0786\), \(s.e\{drift\} = 0.0036\)

\[Y_t = \triangledown Xt\] \[Y_t - 0.5608Y_{t-1} = \varepsilon_t -0.9190\varepsilon_{t-1}\] and \(\{\varepsilon_t\}\) is white noise.

## Series: BAR$logy 
## ARIMA(1,1,1) with drift         
## 
## Coefficients:
##          ar1      ma1   drift
##       0.5608  -0.9190  0.0212
## s.e.  0.1300   0.0786  0.0036
## 
## sigma^2 estimated as 0.03663:  log likelihood=28.57
## AIC=-49.13   AICc=-48.78   BIC=-38.08

7. Inspect the Residuals of Selected Model

As arima(1,1,1) model is selected based on AICc criterion, I would like to test whether the residuals of the Arima(1,1,1) model is iid. Box-Ljung test is performed and the result shows that the null hypothesis of iid variance cannot be rejected (p = 0.6041 > 0.05). The residuals are iid distributed.

The histogram and qq plot of the residuals shows that the residuals are normally distributed with equal variance. The acf and pacf plot of the residuals also shows that the residuals are stationary. The residuals resemble Gaussian white noise.

## 
##  Box-Ljung test
## 
## data:  fit2$residuals
## X-squared = 8.2532, df = 10, p-value = 0.6041

8. The (theoretical) Spectral Density of the Selected Model

The theoretical spectral density of model arma(1,1) shows that the high frequencies have larger spectrum. The spectrum approaches 0.05 as frequency becomes larger. There is basically no periodic component in theoretical spectral density. The theoretical spectrum density looks similar to the smooth periodogram of the differenced data, which indicates that the arima (1,1,1) model is appropriate.

The smoothing method here is modified Daniell’s method. The span value L is selected by using criterion Q. The equation for Q is : \[Q = \sum(pgrm.smooth - pgrm.raw) ^ 2 + \sum\frac{pgrm.raw^2}{L-1}\] where pgrm.smooth is the smoothed periodogram with span L. When Q has the minimum, L is the best span. L is 39 for the differenced y and 9 for the log transformed y.

9. Model Forecasting

Model Selection

The model in this part is fitted using the first 108 observations after log transformed. The chosen model is arima(0,1,2), \(\theta_1 = -0.3774\), \(\theta_2 = -0.2596\) and drift = 0.022.

\[Y_t = \triangledown Xt\] \[Y_t = \varepsilon_t -0.3774\varepsilon_{t-1} -0.2596\varepsilon_{t-2}\] and \(\{\varepsilon_t\}\) is white noise.

## Series: BAR1$logy 
## ARIMA(0,1,2) with drift         
## 
## Coefficients:
##           ma1      ma2  drift
##       -0.3774  -0.2596  0.022
## s.e.   0.0945   0.1025  0.007
## 
## sigma^2 estimated as 0.0388:  log likelihood=23.31
## AIC=-38.62   AICc=-38.23   BIC=-27.93

Residual Examination

To evaluate the residuals of arima(0,1,2) model, the Box-Ljung test is performed to test the null hypothesis that the residuals are iid distributed. The result of Box-Ljung test shows that the null hypothesis cannot be rejected (p = 0.8258 > 0.05). The residuals of arima(0,1,2) model are iid distributed.

The histogram and qq plot of the residuals shows that the residuals are normally distributed with equal variance. The acf and pacf plot of the residuals also shows that the residuals are stationary. The residuals resemble Gaussian white noise.

## 
##  Box-Ljung test
## 
## data:  fit3$residuals
## X-squared = 5.8734, df = 10, p-value = 0.8258

Model Forecasting

The Points Forecasts from Jan 1975 to Oct 1975 (continued below)
Month Jan Feb Mar Aprl May June July Aug
PointsForecasts 414.65 434.01 443.66 453.52 463.61 473.91 484.45 495.22
Month Sept Oct
PointsForecasts 506.23 517.48

The plot of the observed vs. forecasted shows that the observed values are within the 95% prediction interval. Some points forecast are smaller than the observed data. Overal, the model is able to forecast the data.

Appendix: R Code

knitr::opts_chunk$set(echo = TRUE)
BAR = read.table("~/Google Drive/Winter 2017/STA 137/project/bostonArmedRobberies.txt")
names(BAR) = c("time",'y')
library(forecast)
library(tseries)
library(pander)
library(astsa)

BAR$index = c(1:118)
series = c(0:9)*12+0.5
quartz()
plot(BAR$y ~BAR$index, type = 'l', xaxt ='n',
     main = 'Boston Armed Robberies', 
     ylab = 'Number of Robberies', xlab = 'Time')
axis(side = 1, at = series , labels = FALSE)
text(series, labels = c('1966','1967','1968','1969','1970',
                        '1971','1972','1973','1974','1975'), 
     par("usr")[3]-20, srt = 45, pos = 1, 
     xpd = TRUE)  
BAR$logy = log(BAR$y)
par(mfrow = c(2,2))
acf(BAR$logy, lag.max = 40); pacf(BAR$logy, lag.max = 40)



n = 118
m = floor(n/2)
# get the raw periodogram values at the Fourier frequencies
pgrm.raw = spec.pgram(BAR$logy, plot=F, log = 'no')$spec

# vector of candidate L values for smoothing
spans = (1:(m-1))*2+1

# vector to store criterion values for each L
Q = numeric(length(spans))

# go through the L values and compute Q for each
for(j in 1:length(spans)){
  L = spans[j]
  pgrm.smooth = spec.pgram(BAR$logy, spans=L,
                           plot=F, log = 'no')$spec
  Q[j] = sum((pgrm.smooth - pgrm.raw) ^ 2) + sum((pgrm.raw)^2)/(L-1)
}

# plot the values
#plot(x=spans, y=Q, type='b')

# figure out which L is best
L1 = spans[which.min(Q)]
spec.pgram(BAR$logy, log = 'no'); spec.pgram(BAR$logy, spans=L1, 
                                             main = paste0('Span ', L1), log = 'no')

par(mfrow = c(2,2))
acf(diff(BAR$logy), lag.max = 40); pacf(diff(BAR$logy), lag.max = 40)


n = 118
m = floor(n/2)
# get the raw periodogram values at the Fourier frequencies
pgrm.raw = spec.pgram(diff(BAR$logy), plot=F, log = 'no')$spec

# vector of candidate L values for smoothing
spans = (1:(m-1))*2+1

# vector to store criterion values for each L
Q = numeric(length(spans))

# go through the L values and compute Q for each
for(j in 1:length(spans)){
  L = spans[j]
  pgrm.smooth = spec.pgram(diff(BAR$logy), spans=L, plot=F, log = 'no')$spec
  Q[j] = sum((pgrm.smooth - pgrm.raw) ^ 2) + sum((pgrm.raw)^2)/(L-1)
}

# plot the values
#plot(x=spans, y=Q, type='b')

# figure out which L is best
L2 = spans[which.min(Q)]

spec.pgram(diff(BAR$logy), log = 'no'); spec.pgram(diff(BAR$logy), log = 'no', 
                                                   spans = L2 , main = paste0('Span ', L2))

fit1 = arima(BAR$logy, order = c(1,1,1))
fit1
quartz()
plot((BAR$logy - fit1$residuals), ylab = 'logY',
     main = 'Fitted and Observed Values')
points(BAR$logy,type = 'l', lty = 2)
legend('bottomright', c('fitted','observed'), 
       lty = c(1, 2))

Box.test(fit1$residuals, lag=10, type="Ljung")
par(mfrow = c(2,2))
ts.plot(fit1$residuals, ylab = 'residuals', main='Residual ts plot')
hist(fit1$residuals, main = "Histogram of Residuals")
qqnorm(fit1$residuals);qqline(fit1$residuals)
Acf(fit1$residuals); Pacf(fit1$residuals) 

fit2 = auto.arima(BAR$logy,max.p = 8, max.q = 8, max.d = 2, ic ='aicc')
fit2

Box.test(fit2$residuals, lag=10, type="Ljung")
par(mfrow = c(2,2))
ts.plot(fit2$residuals, ylab = 'residuals', main='Residual ts plot')
hist(fit2$residuals, main = "Histogram of Residuals")
qqnorm(fit2$residuals);qqline(fit2$residuals)
Acf(fit2$residuals); Pacf(fit2$residuals) 


par(mfrow = c(2,2))

spec.pgram(BAR$logy, spans=L1, log = 'no')
spec.pgram(diff(BAR$logy), spans=L2, log = 'no')
arma.spec(ar=c(0.56079663), ma=c(-0.91897129), var.noise = 0.03663,
          main='Spectral Density', log = 'no')

BAR1 = BAR[1:108,]
fit3 = auto.arima(BAR1$logy, ic ='aicc')
fit3

Box.test(fit3$residuals, lag=10, type="Ljung")
par(mfrow = c(2,2))
ts.plot(fit3$residuals, ylab = 'residuals', main='Residual ts plot')
hist(fit3$residuals,main = "Histogram of Residuals")
qqnorm(fit3$residuals);qqline(fit3$residuals)
Acf(fit3$residuals); Pacf(fit3$residuals) 

Xf = forecast(fit3, h=10)
pander(rbind( Month = c('Jan','Feb','Mar','Aprl','May','June',
                                      'July','Aug','Sept','Oct'),
        PointsForecasts =round(exp(Xf$mean[1:10]), 2)), 
       caption = "The Points Forecasts from Jan 1975 to Oct 1975")

n = nrow(BAR1)
quartz()
par(mfrow = c(1,1))
plot(x = 0, y = 0,type = 'l', ylab =  'Y' , xaxt ='n',
     xlab ='Time' , main =  'Observed vs. Forecast' , 
     ylim = c(0,900), xlim = c(1,120))
axis(side = 1, at = series , labels = FALSE)
text(series, labels = c('1966','1967','1968','1969','1970',
                        '1971','1972','1973','1974','1975'), 
     par("usr")[3]-20, srt = 45, pos = 1, xpd = TRUE)
polygon(c(109:118, 118:109), 
        c(exp(Xf$upper[,2]), exp(Xf$lower[,2])), col = 'gray', border = NA)
points(x = 109:118, exp(Xf$mean), 
       type = 'l' , col =  'blue', lty = 2)
points(x = 1:118, BAR$y, type = 'l')
legend( 'bottomright' , bty = "n",
        c( 'observed' , 'forecast','95% C.I'), 
        col = c( 'black' ,'blue', NA),
        fill = c(NA, NA, 'gray'), 
        lty = c(1,2), cex = 0.7,
        border = c(NA,NA,"gray"),
        x.intersp=c(2,2,0.5))


plot(x = 1:10, BAR$y[109:118], type = 'l',ylab =  'Y' ,xaxt ='n',
     xlab ='Month of 1975' , main =  'Observed vs. Forecast' , ylim = c(0,950))
axis(side = 1, at = 1:10 , labels = c('Jan','Feb','Mar','Aprl','May','June',
                                      'July','Aug','Sept','Oct'))
polygon(c(1:10, 10:1), c(exp(Xf$upper[,2]), exp(Xf$lower[,2])), 
        col = 'gray', border = NA)
points(x = 1:10, exp(Xf$mean), 
       type = 'l' , col =  'blue', lty=2)
points(x = 1:10, BAR$y[109:118], type = 'l')
legend( 'bottomright' , bty = "n",
        c( 'observed' , 'forecast','95% C.I'), 
        col = c( 'black' ,'blue', NA),
        fill = c(NA, NA, 'gray'), 
        lty = c(1,2), cex = 0.7,
        border = c(NA,NA,"gray"),
        x.intersp=c(2,2,0.5))