top of page

Online Data Organizer: Micro-video Categorization by Structure-guided Multimodal Dictionary Learning

abstract

Abstract

Micro-videos have fast become one of the most dominant trends in the land of social media. Accordingly, how to organize them draws our attention. Distinct from the traditional long videos that would have multi-site scenes and tolerates the hysteresis, a microvideo: 1) usually records contents at one specific venue within a few seconds, whereby the venues are structured hierarchically regarding their category granularity. The geo-nature of micro-videos make it possible to organize them via their venue structure. 2) demands timely propagation over social circles. Thus the timeliness of microvideos desires effective online processing. However, only 1.22% of micro-videos are labeled with venue information when uploaded at the mobile end. To address this problem, we present a framework to organize micro-videos online. In particular, we first build a structureguided multi-modal dictionary learning model to learn the conceptlevel micro-video representation by jointly considering their venue structure and modality relatedness. We then develop an online learning algorithm to incrementally and efficiently strengthen our model, as well as categorize the micro-videos into a tree structure. Experiments on a real-world dataset validate our model well. In addition, we release the codes to facilitate other researchers.

pipeline

Pipeline

framework2.png
algorithm

Algorithm

In this part, we show  the INTIMATE algorithm for you. The Algorithm 1 detailed describe the main pipeline of our method  and  Algorithm 2 shows the proposed tree-guided multi-modal dictionary learning.   And all the equations used here you can find in our paper.

QQ截图20170419104025.png
Dataset

Dataset

In our work, we use the dataset proposed by the work of the paper "Shorter-is-Better: Venue Category Estimation from Micro-Video". They crawled the micro-videos from Vine through its public API (https://github.com/davoclavo/vinepy). In particular, they first manually chose a small set of active users from Vine as our seed users. They then adopted the breadth-first strategy to expand our user sets via gathering their followers. They terminated their expansion after three layers. For each collected user, they crawled his/her published videos, video descriptions and venue information if available. In such way, they harvested 2 million micro-videos. Thereinto, only about 24,000 micro-videos contain Foursquare check-in information. After removing the duplicate venue IDs, they further expanded their video set by crawling all videos in each venue ID with the help of vinepy API. This eventually yielded a dataset of 276,264 videos distributed in 442 Foursquare venue categories. Each venue ID was mapped to a venue category via the Foursquare API (https://developer.foursquare.com/categorytree), which serves as the ground truth.  And 99.8% of videos are shorter than 7 seconds. 

​

      We adopt 10 fold validation method to test the effective of our proposed model. For each fold, we randomly select 5,396 videos as our offline data for the Tree-guided multi-modal dictionary learning, 10,807 videos as our online learning data for the online dictionary update and 2,170 videos as the test set.

Code

Download

Our code can be available in here:

    Matlab version:  INTIMATE.rar

      Dataset:  Compelted_dataset.rar

                        We have completed the missing data of raw feature.

The code of baselines can be available in here:

Contact
bottom of page