NUS-MSS Citation

A. Farseev, N. Liqiang, M. Akbari, and T.-S. Chua. Harvesting multiple sources for user profile learning: a Big data study. ACM International Conference on Multimedia Retrieval (ICMR). China. June 23-26, 2015. [PDF], [slides], [bib]


With the rapid growth of multi-source social media resources, comprehensive user profile learning serves as an actual backbone in various application domains. Such user profile components as user mobility and user demography describe social media users from different views. However, there was no much research done on multi-source multimodal user profile learning. Moreover, there is not any benchmark dataset released towards user mobility and demographic profiling.

Here we introduce a multi-source dataset created by Lab for Media Search in National University of Singapore. The dataset includes six types of features extracted from these data, including location semantics features, location semantics LDA-based features, text LDA-based features, text LIWC features, sentiment and writing style features, ImageNet image concept features; and ground-truth data from three geographical regions: Singapore, New York, and London. In order to cover the most popular data modalities (visual, textual and location data), we incorporate following social media sources: Foursquare (the largest location based social network) as a location data source; Twitter (microblog service with the biggest English-speaking users base) as a textual data source; Instagram (The most popular photo sharing service) as a visual data source and Facebook as a ground truth source. We also provide the baseline results for user Demographic profiling by learning from the text, image and location data using the ensemble model. The benchmark results show that it is possible to learn models from these data aiming to improve user profile learning. Please check more details about user profile learning and features description from [slides]. The number of data records in dataset for each geographical region is presented in table below:

Number of users
Number of Twitter tweets
Number of Foursquare check-ins
Number of Instagram images
 Singapore  7,023  11,732,489  366,268  263,530
 London  5,503  2,973,162  127,276  65,088
 New York  7,957  5,263,630  304,493  230,752

Our dataset can be used for both descriptive and prescriptive research. That is to say, we do not intend to constraint future research on user profile learning, since the available ground truth provides possibility to tackle other contemporary problems. We list some potential research topics that can be conducted on our released dataset:

  1. Complete demographic profiling. Researchers are encouraged to learn other demographics attributes, such as occupation, personality and social status.
  2. Extended mobility profiling. In current study, we focused on category-specific user mobility profiling; while it would be useful to incorporate spatio-temporal factors of users' movement
  3. Causality patterns extraction. It is important to discover potential causal relationships between events from multiple data sources. For example, the "flower" image concept could be temporally related with flower shop check-ins or tweets about flowers.
  4. Cross-source user identification. The alignment of user accounts across multiple social resources can benefit from user profile compilation
  5. Cross-region user profiling and community matching. This direction may over insight on differences and similarities between users' preferences.


To get the anonymized timeline data from three social networks, please contact us.


For any questions regarding NUS-MSS dataset, please contact, Mr. Aleksandr Farseev