Databases

Here are some data sets in this area often used by researchers:

Mark Newman’s Network data

This page contains links to some network data sets: Les Miserables, Word adjacencie, American College football, Dolphin social network, Political blogs, Books about US politics, Neural network, Power grid, Condensed matter collaborations 1999, Condensed matter collaborations 2003, Condensed matter collaborations 2005, Astrophysics collaborations, High-energy theory collaborations, Coauthorships in network science, Internet

Online Social Networks Research

This page contains links to some online social network data sets: Flickr, LiveJournal, Orkut , YouTube

Measurements on the overlay characteristics of PPLive

PPLive is a p2p IPTV streaming system, which stands out due to the heterogeneous channels and increasing popularity. This project gathers data about PPLive overlay by crawling the real running PPLive network. Trace containing data on the node degree in the PP Live overlay, overlay structure, channel population size and node session lengths per overlay.

Delft BitTorrent

This resource contains traces on Bittorrent systems. You can find data of many thousands torrents containing the hosts (by IP) observed during the entire torrent lifecycle. The IP is the anonymized IP address of the observed peer; also the messages of the downloaders are captured.

Jure Leskovec’s Network data

A collection of large network datasets: Social networks, Communication networks, Citation networks, Collaboration networks, Web graphs, Blog and Memetracker graphs, Amazon networks, Internet networks, Road networks, Autonomous systems, Signed networks

Digg 2009
This anonymized data set consists of the voting records for 3553 stories promoted to the front page over a period of a month in 2009. The voting record for each story contains id of the voter and time stamp of the vote. In addition, data about friendship links of voters was collected from Digg.

Wrapper maintenance
Wrappers facilitate access to Web-based information sources by providing a uniform querying and data extraction capability. When wrapper stops working due to changed in the layout of web pages, our task is to automatically reinduce the wrapper. The data sets used for experiments in our JAIR 2003 paper contain web pages downloaded from two dozen sources over a period of a year.

BlogCatalog
An ideal data set for learning tasks with rich social networking information. Especially suitable for prediction and community detection tasks with ground truth in place to verify your hypotheses. It has link information (i.e., friends), content information (e.g., tags, posts), and label information (i.e., user interests).

Flickr: a photo sharing dataset
It includes more than 35,000 users, with their joined groups, tags. It also includes the friendship and the commentship (i.e., who comments on whose photos) among the set of users. The joined groups can be treated as class labels in classification tasks, or ground truth for community detection tasks.

TWITTER’S SOCIAL GRAPH
We have been sharing a social graph (follow relationships) of Twitter at 2009. 41.7M users and 1.47B relationships are available.
You can download the file via torrent or http (if you cannot use torrent)

METADATA
We have been sharing metadata of videos uploaded in YouTube at 2006. Over 2M videos’ metadata are available.

COMMUNITY IDENTIFICATION ALGORITHMS & NETWORKS
We have been sharing pointers to existing community identification algorithms that maximize modularity. We also pointers to well-known networks data including Karate, E. coli, WWW, Flickr, Orkut, etc.

Flickr personal taxonomies
This anonymized data set contains personal taxonomies constructed by 7,000+ Flickr users to organize their photos, as well as the tags they associated with the photos. Personal taxonomies are shallow hierarchies (trees) containing collections and their constituent sets (aka photo-albums) and collections.

This is an overview of repositories of datasets.

Co-authorship and Citation Networks: DBLP, KDD Cup Dataset

Internet Topology: AS Graphs

Wikipedia: Wikipedia page to page link data, DBPedia

Movie Ratings: IMDB database, User rating data, MovieLens

Who trusts whom data at Trustlet, Trust network datasets

Social network data set (datamob)

University of zurich dataset Mobile Web Standards 2011-05 and The Pirate Bay 2008-12 Dataset

distributed artificial intelligence Laboratory DAI-Labor

Jazz,musicians,network,PGP Alex Arenas Datasets

Center for Advanced Study of Communities and Information CASCI

Networking Group Wiki Page Athina Markopoulou datasets

Face book Emilio Ferrara

Internet Topology

YouTube / Friendster

Complex Network Resources

Pajek datasets

Tore Opsahl

Complex Networks | Digital Media Lab