AI中文摘要
相似性搜索是时间序列分析中的基本操作。然而,大多数现有技术要求用户提供精确的值序列(通常是整个时间序列对象)作为查询输入。这种严格的要求限制了实际应用,用户更希望表达模式、趋势或值范围。灵活的基于模式的搜索已在文本检索和复杂事件处理中得到探索,但在大规模分布式时间序列中仍未得到充分研究。为弥补这一差距,我们提出TSseek,一个基于正则表达式的分布式时间序列数据集搜索框架。TSseek的查询语言使用户能够组合包含趋势、值范围和通配符片段的模式。我们表明,传统的近似技术(如PAA和SAX)及其索引结构不适合此类查询,因为它们无法对正则表达式查询构造进行操作。在TSseek中,我们通过将时间序列对象近似为保留趋势(斜率方向)和值范围的线段序列,并将查询构造转换为边界矩形,将时间序列对象和查询构造映射到同一空间。为支持高效处理,我们构建了TSseek-X,一个基于时间序列片段的分布式空间索引。TSseek支持两种基本查询类型:全匹配查询(针对整个序列)和子序列匹配查询(针对序列内的任意窗口)。在基准和真实数据集上,全扫描、基于模型和基于SAX的基线方法要么牺牲准确性,要么牺牲速度,而TSseek能高效地返回精确答案。此外,对于子序列工作负载,它比最先进的子序列匹配引擎实现了显著的加速。
英文摘要
Similarity search is a fundamental operation in time series analysis. Most existing techniques, however, require users to supply a precise sequence of values (typically an entire time series object) as the query input. This rigid requirement limits real-world applications, where users instead want to express patterns, trends, or value ranges. Flexible, pattern-based search has been explored in text retrieval and complex event processing, but remains underexplored for large-scale distributed time series.
To close this gap, we propose TSseek, a regular-expression-powered search framework for distributed time series datasets. TSseek's query language enables users to compose patterns encompassing trends, value ranges, and wildcard segments. We show that conventional approximation techniques (e.g., PAA and SAX) and their index structures are ill-suited for such queries because they cannot operate on regular-expression query constructs.
In TSseek, we map the time series objects and the query constructs into the same space by approximating time series objects as sequences of line segments that retain both trend (slope direction) and value range, and translating query constructs into bounding rectangles. To support efficient processing, we build TSseek-X, a distributed spatial index over the time series segments. TSseek supports two fundamental query types, namely whole-matching queries (over entire series) and subsequence-matching queries (over arbitrary windows within a series).
Across benchmark and real-world datasets, full-scan, model-based, and SAX-based baselines all sacrifice either accuracy or speed, whereas TSseek returns exact answers efficiently. Also, for subsequence workloads, it achieves significant speedups over state-of-the-art subsequence matching engines.