我想创建一个用于从时间序列数据中绘制ACF和PACF的代码。就像从minitab生成的图一样(如下)。
我已经尝试搜索该公式,但是我仍然不太了解它。 您介意告诉我该公式以及如何使用它吗? 上面的ACF和PACF图上的水平红线是什么?公式是什么?
谢谢,
我想创建一个用于从时间序列数据中绘制ACF和PACF的代码。就像从minitab生成的图一样(如下)。
我已经尝试搜索该公式,但是我仍然不太了解它。 您介意告诉我该公式以及如何使用它吗? 上面的ACF和PACF图上的水平红线是什么?公式是什么?
谢谢,
Answers:
自相关
两个变量之间的相关性定义为:
其中,E是期望算子,和是所述装置分别为和和是它们的标准偏差。
在单个变量(即自相关)的上下文中,是原始序列,而是其滞后形式。在上述定义中,顺序的样本自相关可以通过与观察到的一系列计算下面的表达式来获得,:
其中是数据的样本均值。
部分自相关
部分自相关在消除一个变量对两个变量的影响后,测量一个变量的线性相关性。例如,在去除了y t - 1对y t和y t - 2的影响之后,阶次部分自相关测量对的影响(线性相关性)。
每个部分自相关可以通过一系列形式的回归获得:
在那里是原系列减去样本均值,。的估计将给出阶次2的部分自相关的值。通过增加附加滞后来扩展回归,最后一项的估计将给出阶的部分自相关。
计算样本偏自相关的另一种方法是为每个阶求解以下系统:
其中是样本自相关。样本自相关和部分自相关之间的这种映射称为 Durbin-Levinson递归。这种方法相对容易实现以进行说明。例如,在R软件中,我们可以获得阶数为5的部分自相关,如下所示:
# sample data
x <- diff(AirPassengers)
# autocorrelations
sacf <- acf(x, lag.max = 10, plot = FALSE)$acf[,,1]
# solve the system of equations
res1 <- solve(toeplitz(sacf[1:5]), sacf[2:6])
res1
# [1] 0.29992688 -0.18784728 -0.08468517 -0.22463189 0.01008379
# benchmark result
res2 <- pacf(x, lag.max = 5, plot = FALSE)$acf[,,1]
res2
# [1] 0.30285526 -0.21344644 -0.16044680 -0.22163003 0.01008379
all.equal(res1[5], res2[5])
# [1] TRUE
置信带
置信带可以被计算为样品的自相关的值
.
"I want to create a code for plotting ACF and PACF from time-series data".
Although the OP is a bit vague, it may possibly be more targeted to a "recipe"-style coding formulation than a linear algebra model formulation.
The ACF is rather straightforward: we have a time series, and basically make multiple "copies" (as in "copy and paste") of it, understanding that each copy is going to be offset by one entry from the prior copy, because the initial data contains data points, while the previous time series length (which excludes the last data point) is only . We can make virtually as many copies as there are rows. Each copy is correlated to the original, keeping in mind that we need identical lengths, and to this end, we'll have to keep on clipping the tail end of the initial data series to make them comparable. For instance, to correlate the initial data to we'll need to get rid of the last data points of the original time series (the first chronologically).
Example:
We'll concoct a times series with a cyclical sine pattern superimposed on a trend line, and noise, and plot the R generated ACF. I got this example from an online post by Christoph Scherber, and just added the noise to it:
x=seq(pi, 10 * pi, 0.1)
y = 0.1 * x + sin(x) + rnorm(x)
y = ts(y, start=1800)
Ordinarily we would have to test the data for stationarity (or just look at the plot above), but we know there is a trend in it, so let's skip this part, and go directly to the de-trending step:
model=lm(y ~ I(1801:2083))
st.y = y - predict(model)
Now we are ready to takle this time series by first generating the ACF with the acf()
function in R, and then comparing the results to the makeshift loop I put together:
ACF = 0 # Starting an empty vector to capture the auto-correlations.
ACF[1] = cor(st.y, st.y) # The first entry in the ACF is the correlation with itself (1).
for(i in 1:30){ # Took 30 points to parallel the output of `acf()`
lag = st.y[-c(1:i)] # Introducing lags in the stationary ts.
clipped.y = st.y[1:length(lag)] # Compensating by reducing length of ts.
ACF[i + 1] = cor(clipped.y, lag) # Storing each correlation.
}
acf(st.y) # Plotting the built-in function (left)
plot(ACF, type="h", main="ACF Manual calculation"); abline(h = 0) # and my results (right).
OK. That was successful. On to the PACF. Much more tricky to hack... The idea here is to again clone the initial ts a bunch of times, and then select multiple time points. However, instead of just correlating with the initial time series, we put together all the lags in-between, and perform a regression analysis, so that the variance explained by the previous time points can be excluded (controlled). For example, if we are focusing on the PACF ending at time , we keep , , and , as well as , and we regress through the origin and keeping only the coefficient for :
PACF = 0 # Starting up an empty storage vector.
for(j in 2:25){ # Picked up 25 lag points to parallel R `pacf()` output.
cols = j
rows = length(st.y) - j + 1 # To end up with equal length vectors we clip.
lag = matrix(0, rows, j) # The storage matrix for different groups of lagged vectors.
for(i in 1:cols){
lag[ ,i] = st.y[i : (i + rows - 1)] #Clipping progressively to get lagged ts's.
}
lag = as.data.frame(lag)
fit = lm(lag$V1 ~ . - 1, data = lag) # Running an OLS for every group.
PACF[j] = coef(fit)[j - 1] # Getting the slope for the last lagged ts.
}
And finally plotting again side-by-side, R-generated and manual calculations:
That the idea is correct, beside probable computational issues, can be seen comparing PACF
to pacf(st.y, plot = F)
.
code here.
Well, in the practise we found error (noise) which is represented by the confidence bands help you to figure out if a level can be considerate as only noise (because about the 95% times will be into the bands).
Here is a python code to compute ACF:
def shift(x,b):
if ( b <= 0 ):
return x
d = np.array(x);
d1 = d
d1[b:] = d[:-b]
d1[0:b] = 0
return d1
# One way of doing it using bare bones
# - you divide by first to normalize - because corr(x,x) = 1
x = np.arange(0,10)
xo = x - x.mean()
cors = [ np.correlate(xo,shift(xo,i))[0] for i in range(len(x1)) ]
print (cors/cors[0] )
#-- Here is another way - you divide by first to normalize
cors = np.correlate(xo,xo,'full')[n-1:]
cors/cors[0]