Our proposed SRAM quantify uncertainty, thus allowing

the model to say "I do not know" in scenarios beyond its handling capacity.

Existing Video Temporal Grounding (VTG) models excel in accuracy but often overlook open-world challenges posed by open-vocabulary queries and untrimmed videos. This leads to unreliable predictions for noisy, corrupted, and out-of-distribution data. Adapting VTG models to dynamically estimate uncertainties based on user input can address this issue. To this end, we introduce SRAM, a robust network module that benefits from a two-stage cross-modal alignment task. More importantly, it integrates Deep Evidential Regression (DER) to explicitly and thoroughly quantify uncertainty during training, thus allowing the model to say "I do not know" in scenarios beyond its handling capacity. However, the direct application of traditional DER theory and its regularizer reveals structural flaws, leading to unintended constraints in VTG tasks. In response, we develop a simple yet effective Geom-regularizer that enhances the uncertainty learning framework from the ground up. To the best of our knowledge, this marks the first successful attempt of DER in VTG. Our extensive quantitative and qualitative results affirm the effectiveness, robustness, and interpretability of our modules and the uncertainty learning paradigm in VTG tasks. The code will be made available.

Overall architecture of SRAM. Firstly, an untrimmed video and masked query are encoded with a frozen encoder, then SRAM reconstructs the masked query tokens. In the second stage, SRAM performs temporal grounding on the video using the complete user's query. SRAM includes RFF blocks, an evidential head, a VTG head, and a Masked Language Model (MLM) head. The MLM head enclosed by the dashed box is trained only during the first stage.

Temporal continuity in videos often causes adjacent frames to share similar semantics, complicating precise boundary delineation and introducing subjective biases in annotations. To mitigate this, we model semantic boundary moments using Gaussian distributions. Specifically, the start and end moments of a video-query pair \( (V, Q) \) are each governed by distinct Gaussian distributions. Observations of the same type (either all starts or all ends) are assumed to be i.i.d.. Without loss of generality, we formulate as follows:

\[ b \sim\mathcal{N}(\mu,\sigma^2), \] where \( b \in \mathbb{R}^{1 \times \mathcal{H}}\) represents the start or end of moments observed \(\mathcal{H}\) times. The corresponding expectation \(\mu\) and variance \(\sigma^2\) of the Gaussian distribution subject to \( NIG \) prior: \[ \begin{align} p(\mu,\sigma^2\mid\underbrace{\gamma,\upsilon,\alpha,\beta}_{\boldsymbol{\varphi }}) &= \mathcal{N}(\mu|\gamma,\sigma^2 \upsilon^{-1}) \Gamma^{-1}(\sigma^2|\alpha,\beta) \\ &=\frac{\beta^\alpha\sqrt{\upsilon}}{\Gamma(\alpha)\sqrt{2\pi\sigma^2}}\left(\frac{1}{\sigma^2}\right)^{\alpha+1} \exp\left\{-\frac{2\beta+\upsilon(\gamma-\mu)^2}{2\sigma^2}\right\}. \end{align} \] where \(\boldsymbol{\varphi}=(\gamma, \upsilon, \alpha, \beta)\) are the prior \( NIG\) distribution parameters derived from the video content and user queries, serve as conditionals for the Gaussian estimates of \( b_i \), with \(\gamma \in \mathbb{R}, \upsilon > 0, \alpha > 1, \beta > 0\). The gamma function is denoted by \( \Gamma(\cdot)\). We use a linear evidential predictor to estimate \(\boldsymbol{\varphi }\), training it to maximize the likelihood. The maximum likelihood estimation for \(b_i\) is given by: \[ p(b_i \mid \boldsymbol{\varphi }) = \int_{\sigma^2=0}^\infty \int_{\mu=-\infty}^\infty p(b_i \mid \mu, \sigma^2) p(\mu, \sigma^2 \mid \boldsymbol{\varphi }) d\mu d\sigma^2 = \mathrm{St}(b_i; \gamma, \frac{\beta(1 + \upsilon)}{\upsilon \alpha}, 2\alpha). \] Since the likelihood function has a form of Student-t distribution \((\mathrm{St})\), we minimize the negative logarithmic likelihood (NLL) as follows. \[ \mathcal{L}^{\mathrm{NLL}}_{i}=-\log p(b_i|\boldsymbol{\varphi })= -\log\left(\mathrm{St}\left(b_i;\gamma,\frac{\beta(1+\upsilon)}{\upsilon\alpha},2\alpha\right)\right). \]Actually, the original DER propose a heuristic regularization, which aims to mitigate overconfidence by suppressing evidence, particularly for samples with high error: \[ \mathcal{L}^\mathrm{R}_i(\boldsymbol{\vartheta})=\Delta\cdot\Phi. \] where \( \Delta = |b_i-\gamma| \) represents the error, \(\Phi = 2\upsilon+\alpha\) denotes the evidence, and \(\boldsymbol{\vartheta}\) are the model parameters, with \(b_i\) as the ground truth.

Unfortunately, the vanilla regularizer tends to excessively suppress the evidence, as shown in the visualized gradient field.

To overcome these limitations, we introduce Geom-regularization, promoting the principle that "accurate predictions should have high evidence, while inaccurate ones should have low evidence": \[ \mathcal{L}^i(\boldsymbol{w})=\|\overline{\Phi}+\overline{\Delta}-1\|^2_2. \]

Please refer to our paper for more analysis.

```
@article{ma2024beyond,
title={Beyond Uncertainty: Evidential Deep Learning for Robust Video Temporal Grounding},
author={Ma, Kaijing and Huang, Haojian and Chen, Jin and Chen, Haodong and Ji, Pengliang and Zang, Xianghao and Fang, Han and Ban, Chao and Sun, Hao and Chen, Mulin and others},
journal={arXiv preprint arXiv:2408.16272},
year={2024}
}
```