NUS-MSS Citation
A. Farseev, N. Liqiang, M. Akbari, and T.-S. Chua. Harvesting multiple sources for user profile learning: a Big data study. ACM International Conference on Multimedia Retrieval (ICMR). China. June 23-26, 2015. [PDF], [slides], [bib]
Description
With the rapid growth of multi-source social media resources, comprehensive user profile learning serves as an actual backbone in various application domains. Such user profile components as user mobility and user demography describe social media users from different views. However, there was no much research done on multi-source multimodal user profile learning. Moreover, there is not any benchmark dataset released towards user mobility and demographic profiling.
Here we introduce a multi-source dataset created by Lab for Media Search in National University of
Singapore.
The dataset includes six types of features extracted from these data, including location
semantics features, location semantics LDA-based features, text LDA-based features,
text LIWC features, sentiment and writing style features, ImageNet image concept
features; and ground-truth data from three geographical regions: Singapore, New York, and
London.
In order to cover the most popular data modalities (visual, textual and location
data), we incorporate following social media sources:
Foursquare (the largest location based social network) as
a location data source; Twitter (microblog service with
the biggest English-speaking users base) as a textual data
source; Instagram (The most popular photo sharing service)
as a visual data source and Facebook as a ground truth source.
We also provide the baseline results for user Demographic profiling by learning from the text, image and
location data using the ensemble model.
The benchmark results show that it is possible to learn models from these data aiming to improve user
profile learning. Please check more details about user profile learning
and features description from [slides].
The number of data records in dataset for each geographical region is presented in table below:
City |
Number of users |
Number of Twitter tweets |
Number of Foursquare check-ins |
Number of Instagram images |
---|---|---|---|---|
Singapore | 7,023 | 11,732,489 | 366,268 | 263,530 |
London | 5,503 | 2,973,162 | 127,276 | 65,088 |
New York | 7,957 | 5,263,630 | 304,493 | 230,752 |
Our dataset can be used for both descriptive and prescriptive research. That is to say, we do not intend to constraint future research on user profile learning, since the available ground truth provides possibility to tackle other contemporary problems. We list some potential research topics that can be conducted on our released dataset:
- Complete demographic profiling. Researchers are encouraged to learn other demographics attributes, such as occupation, personality and social status.
- Extended mobility profiling. In current study, we focused on category-specific user mobility profiling; while it would be useful to incorporate spatio-temporal factors of users' movement
- Causality patterns extraction. It is important to discover potential causal relationships between events from multiple data sources. For example, the "flower" image concept could be temporally related with flower shop check-ins or tweets about flowers.
- Cross-source user identification. The alignment of user accounts across multiple social resources can benefit from user profile compilation
- Cross-region user profiling and community matching. This direction may over insight on differences and similarities between users' preferences.
Downloads
To get the anonymized timeline data from three social networks, please contact us.
Contacts
For any questions regarding NUS-MSS dataset, please contact, Mr. Aleksandr Farseev farseev@u.nus.edu