Automatic Music Video Generation Based on Simultaneous Soundtrack Recommendation and Video Editing

doi:10.1145/3123266.3123399

標題:	Automatic Music Video Generation Based on Simultaneous Soundtrack Recommendation and Video Editing
作者:	Lin, Jen-Chun Wei, Wen-Li Yang, James Wang, Hsin-Min Liao, Hong-Yuan Mark 交大名義發表 National Chiao Tung University
關鍵字:	Automatic music video generation;cross-modal media retrieval;deep neural networks
公開日期:	1-一月-2017
摘要:	An automated process that can suggest a soundtrack to a user-generated video (UGV) and make the UGV a music-compliant professional-like video is challenging but desirable. To this end, this paper presents an automatic music video (MV) generation system that conducts soundtrack recommendation and video editing simultaneously. Given a long UGV, it is first divided into a sequence of fixed-length short (e.g., 2 seconds) segments, and then a multi-task deep neural network (MDNN) is applied to predict the pseudo acoustic (music) features (or called the pseudo song) from the visual (video) features of each video segment. In this way, the distance between any pair of video and music segments of same length can be computed in the music feature space. Second, the sequence of pseudo acoustic (music) features of the UGV and the sequence of the acoustic (music) features of each music track in the music collection are temporarily aligned by the dynamic time warping (DTW) algorithm with a pseudosong-based deep similarity matching (PDSM) metric. Third, for each music track, the video editing module selects and concatenates the segments of the UGV based on the target and concatenation costs given by a pseudo-song-based deep concatenation cost (PDCC) metric according to the DTW-aligned result to generate a music-compliant professional-like video. Finally, all the generated MVs are ranked, and the best MV is recommended to the user. The MDNN for pseudo song prediction and the PDSM and PDCC metrics are trained by an annotated official music video (OMV) corpus. The results of objective and subjective experiments demonstrate that the proposed system performs well and can generate appealing MVs with better viewing and listening experiences.
URI:	http://dx.doi.org/10.1145/3123266.3123399 http://hdl.handle.net/11536/152906
ISBN:	978-1-4503-4906-2
DOI:	10.1145/3123266.3123399
期刊:	PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17)
起始頁:	519
結束頁:	527
顯示於類別：	會議論文