Martingale Doppelgänger-Eval: An Identification Framework for Auditing Candlestick Understanding in Vision-Language Models
鞅双生评估:审计视觉语言模型对K线图理解的识别框架
Ziyao Wang
AI总结 提出Martingale Doppelgänger-Eval基准,通过受控实验识别VLM是否基于K线证据而非趋势外推进行判断,发现模型忽略或反向利用K线语义。
详情
我们引入了Martingale Doppelgänger-Eval,一个公开的影子市场基准,用于审计视觉语言模型(VLM)是否使用K线证据而非外推过去趋势。核心困难在于识别:在真实市场历史中,图表证据和趋势高度耦合,因此观测得分无法确定流畅的技术分析叙述是否基于局部视觉证据。我们形式化证明了这一局限性:在强耦合下,没有基于观测的图表-标签数据计算的评估函数能够区分基于证据的响应者和基于趋势捷径的响应者,而匹配的证据干预以指数速率区分相同的响应者,趋势-标签交换提供了独立的捷径压力测试。因此,该基准在四种受控机制下评估冻结的VLM:鞅零市场、注入阿尔法的反事实对、趋势混杂交换和制度转换。结构行为模型识别了零市场偏差、趋势敏感性、证据敏感性、提示/渲染器脆弱性和证据忠实性;附带的统计工具包提供了最小可检测效应、针对计量API的块感知序贯测试以及重叠加权伪影检查。在冻结的商业和开源VLM中,识别回归将大的正系数分配给过去趋势,但证据系数为零或与规则隐含符号相反。匹配对分析表明,模型要么忽略注入的K线语义,要么在响应时朝与规则隐含方向相反的方向移动。该基准隔离了标准观测图表基准无法检测的失败模式,并为具有可控标签机制的时间序列图像提供了可复用的审计模板。
We introduce Martingale Doppelgänger-Eval, a public shadow-market benchmark for auditing whether vision-language models (VLMs) use candlestick evidence rather than extrapolate past trends. The central difficulty is identification: on real market histories, chart evidence and trend are strongly coupled, so an observational score cannot determine whether a fluent technical-analysis narrative is grounded in local visual evidence. We prove this limitation formally: no evaluation functional computed from observational chart--label data can distinguish a grounded responder from a trend-shortcut responder under strong coupling, whereas matched evidence interventions separate the same responders at an exponential rate and trend--label swaps provide an independent shortcut stress test. The benchmark therefore evaluates frozen VLMs on rendered OHLCV charts under four controlled mechanisms: a martingale-null market, injected-alpha counterfactual pairs, trend-confounder swaps, and regime shifts. A structural behavioral model identifies null-market bias, trend sensitivity, evidence sensitivity, prompt/renderer fragility, and evidence faithfulness; the accompanying statistical toolkit provides minimum detectable effects, block-aware sequential testing for metered APIs, and an overlap-weighted artifact check. Across frozen commercial and open VLMs, the identified regression assigns large positive coefficients to past trend but evidence coefficients that are zero or opposite to the rule-implied sign. Matched-pair analyses show that models either ignore injected candlestick semantics or move opposite to the rule-implied direction conditional on responding. The benchmark isolates a failure mode that standard observational chart benchmarks cannot detect and gives a reusable audit template for time-series imagery with controllable label mechanisms.