The Chinese University of Hong Kong1
Among numerous videos shared on the web, well-edited ones always attract more attention.
However, it is difficult for inexperienced users to make well-edited videos
because it requires professional expertise and immense manual labor.
To meet the demands for non-experts, we present Transcript-to-Video -- a weakly-supervised framework
that uses texts as input to automatically create video sequences from an extensive collection of shots.
Specifically, we propose a Content Retrieval Module and a Temporal Coherent Module to
learn visual-language representations and model shot sequencing styles, respectively.
For fast inference, we introduce an efficient search strategy for real-time video clip sequencing.
Quantitative results and user studies demonstrate empirically that the proposed learning framework
can retrieve content-relevant shots while creating plausible video sequences in terms of style.
Besides, the run-time performance analysis shows that our framework can support real-world applications.
We will release codes and models.