Abstract: Videos contain multimodal content, and exploring multi-branch cross-modal interactions with natural language queries can be of benefit to the text-video retrieval task (TVR). However, recent ...