Online Data Organizer: Micro-video Categorization by Structure-guided Multimodal Dictionary Learning
Abstract
Micro-videos have fast become one of the most dominant trends in the land of social media. Accordingly, how to organize them draws our attention. Distinct from the traditional long videos that would have multi-site scenes and tolerates the hysteresis, a microvideo: 1) usually records contents at one specific venue within a few seconds, whereby the venues are structured hierarchically regarding their category granularity. The geo-nature of micro-videos make it possible to organize them via their venue structure. 2) demands timely propagation over social circles. Thus the timeliness of microvideos desires effective online processing. However, only 1.22% of micro-videos are labeled with venue information when uploaded at the mobile end. To address this problem, we present a framework to organize micro-videos online. In particular, we first build a structureguided multi-modal dictionary learning model to learn the conceptlevel micro-video representation by jointly considering their venue structure and modality relatedness. We then develop an online learning algorithm to incrementally and efficiently strengthen our model, as well as categorize the micro-videos into a tree structure. Experiments on a real-world dataset validate our model well. In addition, we release the codes to facilitate other researchers.
Pipeline
Algorithm
In this part, we show the INTIMATE algorithm for you. The Algorithm 1 detailed describe the main pipeline of our method and Algorithm 2 shows the proposed tree-guided multi-modal dictionary learning. And all the equations used here you can find in our paper.
Dataset
In our work, we use the dataset proposed by the work of the paper "Shorter-is-Better: Venue Category Estimation from Micro-Video". They crawled the micro-videos from Vine through its public API (https://github.com/davoclavo/vinepy). In particular, they first manually chose a small set of active users from Vine as our seed users. They then adopted the breadth-first strategy to expand our user sets via gathering their followers. They terminated their expansion after three layers. For each collected user, they crawled his/her published videos, video descriptions and venue information if available. In such way, they harvested 2 million micro-videos. Thereinto, only about 24,000 micro-videos contain Foursquare check-in information. After removing the duplicate venue IDs, they further expanded their video set by crawling all videos in each venue ID with the help of vinepy API. This eventually yielded a dataset of 276,264 videos distributed in 442 Foursquare venue categories. Each venue ID was mapped to a venue category via the Foursquare API (https://developer.foursquare.com/categorytree), which serves as the ground truth. And 99.8% of videos are shorter than 7 seconds.
​
We adopt 10 fold validation method to test the effective of our proposed model. For each fold, we randomly select 5,396 videos as our offline data for the Tree-guided multi-modal dictionary learning, 10,807 videos as our online learning data for the online dictionary update and 2,170 videos as the test set.
Download
Our code can be available in here:
Matlab version: INTIMATE.rar
Dataset: Compelted_dataset.rar
We have completed the missing data of raw feature.
The code of baselines can be available in here: