The ambition of this project, based upon PIs’ extensive experience in concept-based video indexing and cross-modal 0-example search, is to fill in the knowledge gap between current state-of-arts in video recounting and captioning, while applying for web video surveillance. Example of search scenario is finding videos of “A man shouting while holding a flag”, where the expected results include recounting a candidate video with: who (person name), what (audio-visual objects like shouting and flag), where and when (the location and time of event), in natural language (English and Chinese) sentences. The associated challenges are three aspects: Attention – how to dynamically select query-relevant fragments from a long video for recounting; Captioning – how to generate sentences that explain queries and contrast the visual content among the retrieved video candidates by filling in sentences with name entities; Indexing – what are the processing required for enabling real-time interactive large-scale video search. The academic value of this proposal lies in bridging the knowledge gap on empowering video recounting capability with query-aware captioning, which is a new topic not previously addressed in the literature. The proposal also has significant translational value in speeding up time required for filtering false alarms in forensic and web monitoring applications, by generating textual snippets for recounting video relevancy and diversity. In this proposal, a system prototype will be built to demonstrate the proposed works for web surveillance of online videos.
本项目将构建一个新的查询感知(Query-Aware)视频诠释模型。该模型能对用户查询的具体语义需求进行分析、提取、和展示,有针对性地生成包含4W细节信息的多语言(中文和英文)视文片段,并有效地增加视频结果展示的相关性(Relevancy)和差异性(Diversity)。利用研究团队在视频检索、语义索引、标题提取、视频摘要、交互式检索等领域长期的研究积累,项目最终将生成一个实时的、可运行的视频检索原型系统。其科学价值在于将填补传统视频内容分析与视频诠释之间的空白,使相关研究形成一个包含语义索引-视频查询-结果展示-用户交互的完整闭环。其应用价值在于本项目的成果将有效减少视频检索及网络视频监控系统中的虚警率,提高检索和过滤的效率。
本项目构建了一个具有查询感知力的(Query-Aware)多模态数据诠释框架。该框架主要从注意力和可解释性两方面着手。针对传统的通过优化损失为导向的、直接对权重进行学习的注意力机制,我们构建了通过查询对象来设计注意力分布函数然后让损失优化学习注意力函数的相关参数的方法。这种方法将专家对于查询的知识转化成形式化的函数表达来达到知识注入的目的,同时也避免了直接学习注意力权重的随机性。针对传统方法中使用结果注意力在目标对象上的关注情况来进行解释的定性方式,我们提出将注意力转化为决策树(森林)的知识转化方法,能够产生人类专家可以直接解读的逻辑结构。两种机制的叠加使得框架的性能和实用性都得到了显著的提升。
{{i.achievement_title}}
数据更新时间:2023-05-31
基于 Kronecker 压缩感知的宽带 MIMO 雷达高分辨三维成像
感知的环境动态性与创业团队创新 ——基于团队成员的不确定性降低动机
教学视频播放速度与难易程度对学习的影响研究
教师手势对视频学习的影响及其认知神经机制
基于候客点规划的空闲出租车路线推荐算法
基于语义分析和视觉关注的视频自适应研究
基于内容的医学PACS图象索引及查询提取方法研究
基于认知计算模型和电影理论的多线索视频语义提取
基于视频流体模型的人体运动特征提取与运动过程语义建模