时间序列预测模型有哪些，时间序列模型的新起点

时间序列预测模型有哪些，时间序列模型的新起点—Informer

抖帅宫 1033 2023-10-29

时间序列预测模型有哪些，时间序列模型的新起点—Informer-第1张-观点-玄机派

来源头条作者:无远不往“ 时间序列分析问题在日常生活中无处不在，时间是连续的，每一秒钟都会产生新的变化。”

—

AAAI2021，新型的时间序列模型—Informer

最新一篇论文来自于AAAI2021的会议，名为《Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting》为时间序列预测任务带来了新的曙光。时间序列预测模型的必要条件是：超强的长时间序列对齐能力，和超级处理长时间序列的输入和输出的操作能力。

—

背景知识和相关问题

近年来，尤其是2017年Google提出的Transformer模型在处理长时间序列问题的能力远超于传统的RNN模型，包括GRU，LSTM等模型。Transformer模型的优势在于信号传播路径长度短，避免了传统RNN系列网络的复杂循环结构，但是该模型过于吃透或者消耗GPU资源和服务器的存储资源，需要大量的硬件成本投入到模型的训练当中，所以对现实世界中的长时间序列预测任务的应用不是特别的切合实际。Transformer成为将其应用于LSTF问题的瓶颈，本文的研究目标是：can Transformer models be improved to be computation, memory, and architecture efficient, as well as maintain higher prediction capacity?

—

当前的挑战和解决方法

论文首先在abstract部分介绍，长时间序列任务是一项非常重要，而且随着时间的增长难度系数越来越大，即预测精准度逐渐降低。有效的预测，能给目前的研究工作带来巨大的突破。即针对目前非常热门的transformer模型，我们所面临的挑战和约束可总结为以下三点：

The quadratic computation of self-attention. The atom operation of self-attention mechanism, namely canonical dot-product, causes the time complexity and memory usage per layer to be O(L2).The memory bottleneck in stacking layers for long inputs. The stack of J encoder/decoder layer makes total memory usage to be O(J · L2), which limits the model scalability on receiving long sequence inputs.The speed plunge in predicting long outputs. The dynamic decoding of vanilla Transformer makes the step-by-step inference as slow as RNN-based model.

目前的最新研究工作，主要集中在解决第一个问题上，即self-attention的计算复杂度上和内存的使用上。针对以上的问题，文章提出了一个新型的预测模型。该模型的主要特点是集中的解决上述的三个问题，模型的主要贡献如下：

We propose Informer to successfully enhance the prediction capacity in the LSTF problem, which validates the Transformer-like model’s potential value to capture individual long-range dependency between long sequence time-series outputs and inputs.We propose ProbSparse Self-attention mechanism to efficiently replace the canonical self-attention and it achieves the O(LlogL) time complexity andO(L log L) memory usage.We propose Self-attention Distilling operation privileges dominating attention scores in J-stacking layers and sharply reduce the total space complexity to be O((2 − ε)L log L).We propose Generative Style Decoder to acquire long sequence output with only one forward step needed, simultaneously avoiding cumulative error spreading during the inference phase.本文提出的模型框架如下图所示，

图1 Informer模型的整体图。左侧是编码器，它接收大量的长序列输入（绿色序列）。我们已经用提议的ProbSparse self-attention注代替了规范的self-attention。蓝色梯形是一种self-attention的蒸馏操作，可提取主要注意力，从而大幅减少网络规模。层堆叠副本提高了鲁棒性。在右侧，解码器接收长序列输入，将目标元素填充为零，测量特征图的加权注意力成分，并立即以生成样式预测输出元素（橙色序列）。

—

解决方法和模型架构

目前比较热门的计算self-attention值的方法是根据输入的三元组(query，key，value)，计算某个query加权后的值，即第i个query的加权值，我们可用以下计算公式获取：

这里self-attention需要O(LQLK)的内存以及二次点积的计算为代价，这也是目前传统transformer存在的缺点。其次，本文对该方法进行了评估，稀疏性self-attention得分的情况呈现长尾分布，即少数点积对注意有贡献，其他的点积贡献极小，可以忽略不计。因此如何区分稀疏性至关重要，针对第i个query的稀疏性评估方法我们参考KL散度，即可由下述公式计算得到：

其中，第一项是所有键上qi的对数总和（LSE），第二项是它们上的算术平均值。如果第i个query获得较大的M(qi, K)，则其注意概率p更加“多样化”，并且很有可能在长尾self-attention分布自检的标头字段中包含主要的点积对。但是上述方法还是存在计算量过大和LSE操作存在潜在的数值稳定性问题。因此，该篇文章提出了一种对查询稀疏性度量的近似方法，提出了最大均值测量：

其中

是和q相同size的稀疏矩阵，它仅包含稀疏评估下

下Top-u的queries，由采样factor

所控制，我们令

, 这么做self-attention对于每个query-key lookup就只需要计算

的内积，内存的使用包含

,但是我们计算

的时候需要计算没对的dot-product，即，

,同时LSE还会带来潜在的数值问题，受此影响，本文提出了query sparsity 评估的近似。模型的框架图如下图所示：

图2 Informer编码器的体系结构。（1）每个水平堆栈代表单个编码器副本；（2）上层堆栈是主堆栈，它接收整个输入序列，而第二层堆栈则占输入的一半；（3）红色层是self-attention mechanism 点积矩阵，通过在每层上进行self-attention蒸馏而使其级联减少；（4）将2堆栈的功能图连接为编码器的输出。

4.1 模型的输入

图3 Informer的输入表示。输入的嵌入包括三个独立的部分：标量投影，本地时间戳（Position）和全局时间戳嵌入（Minutes, Hours, Week, Month, Holiday etc.）。

4.2 模型的Encoder

编码器设计，用于提取长时间序列输入的鲁棒的远程依赖关系。Self-attention Distilling，由于 ProbSparse self-attention mechanism 的自然结果，编码器的特征图具有值V的冗余组合。我们使用蒸馏操作为具有优势的特性赋予优等品特权，并在下一层制作有重点的自我注意功能图。看到图 2 中Attention块的N-heads权重矩阵（重叠的红色正方形），它会急剧地修剪输入的时间维度，our “distilling” procedure forwards from j-th layer into (j + 1)-th layer as

where, [·]AB contains the Multi-head ProbSparse self- attention and the essential operations in attention block, and Conv1d(·) performs an 1-D convolutional filters (ker- nel width=3) on time dimension with the ELU(·) activa- tion function. We add a max-pooling layer with stride 2 and down-sample Xt into its half slice after stacking a layer, which reduces the whole memory usage to be O((2 − ε)L log L), where ε is a small number. To enhance the robustness of the distilling operation, we build halving replicas of the main stack and progressively decrease the number of self-attention distilling layers by dropping one layer at a time, like a pyramid in Fig. 2, such that their output dimension is aligned.

4.3 模型的Decoder

我们在图1 中使用标准的解码器结构，它由2个相同的multi- head attention层的堆栈组成。但是，在长时间预测中，采用了生成推理来缓解速度下降。我们向解码器提供以下向量：

where Xtoken ∈ RLtoken×dmodel is the start token, Xt0 ∈RLy×dmodel is a placeholder for the target sequence (set scalar as 0). Masked multi-head attention is applied in theProbSparse self-attention computing by setting masked dot- products to −∞.

4.4 Generative Inference

Start token is an efficient tech- nique in NLP’s “dynamic decoding” , and we extend it into a generative way. Instead of choos- ing a specific flag as the token, we sample a Ltoken long sequence in the input sequence, which is an earlier slice before the output sequence. Take predicting 168 points as an example (7-day temperature prediction) in Fig.(1(b)), we will take the known 5 days before the target sequence as “start- token”, and feed the generative-style inference decoder withXfeed de = {X5d , X0 }. The X0 contains target sequence’s time stamp, i.e. the context at the target week. Note that our proposed decoder predicts all the outputs by one forward procedure and is free from the time consuming “dynamic decoding” transaction in the trivial encoder-decoder archi- tecture. A detailed performance comparison is given in the computation efficiency section.

本期学术论文解读结束，谢谢大家！！！下期继续分享学术论文和相关智能算法和法律知识！！！