Cora Dataset

The Cora dataset stands as a fundamental resource in the field of graph machine learning, widely utilized for the development and benchmarking of various algorithms. Comprising a network of scientific publications in machine learning, the dataset provides a rich structure that facilitates research into node classification, link prediction, and clustering. This article presents an overview of the Cora dataset, its structure, applications, and the features and labels that define it.

Cora Dataset Overview

The Cora dataset is a citation network of 2,708 machine-learning papers, organized into seven distinct classes. These papers are interlinked by 5,429 citations, forming a directed graph that maps out how papers cite each other. Each paper is represented by a binary word vector, derived from a dictionary of 1,433 unique words, indicating the presence or absence of specific words in the paper.

The dataset is primarily divided into two files:

Cora Content File: Contains information about the nodes (papers), including their binary word vectors and class labels.
Cora Citations File: Describes the edges (citations) between nodes, with each line representing a directed edge from one paper to another.

Usage and Applications

The Cora dataset is extensively used for evaluating graph-based machine learning algorithms. Its applications span several key areas:

Node Classification: Predicting the class of each node (paper) based on its features and the graph structure. Graph Convolutional Networks (GCNs) and Graph Attention Networks (GATs) are examples of models tested using the Cora dataset.
Link Prediction: Inferring missing links or predicting future citations between nodes. The Cora dataset serves as a benchmark for algorithms that analyze the likelihood of connections within the graph.
Clustering: Grouping nodes into clusters with similar properties. The dataset's community structure is ideal for testing clustering algorithms, helping to identify natural groupings within the network.

Features and Labels

Each paper in the Cora dataset is described by a binary word vector, which serves as the feature set for the dataset. The presence (1) or absence (0) of each word from a dictionary of 1,433 unique words is recorded in this vector. This high-dimensional feature space captures the content of each paper, enabling detailed analysis and classification.

The labels in the Cora dataset correspond to the seven classes of machine learning topics:

Case Based
Genetic Algorithms
Neural Networks
Probabilistic Methods
Reinforcement Learning
Rule Learning
Theory

These labels provide a categorical classification for each paper, which is used as the target variable in various machine learning tasks.

Methods to Load Cora Dataset

Below are some of the methods to load cora dataset in Python:

Using PyTorch Geometric
Using DGL (Deep Graph Library)
Using NetworkX
Using TensorFlow

1. Using PyTorch Geometric

PyTorch Geometric is a library specifically designed for deep learning on irregular structures like graphs. It provides a straightforward way to load the CORA dataset.

Install PyTorch Geometric

Here, we will install PyTorch Geometric by using the following command:

pip install torch-geometric

Load the CORA dataset

Python

from torch_geometric.datasets import Planetoid

# Load the CORA dataset
dataset = Planetoid(root='/tmp/Cora', name='Cora')

# Access the first graph object
data = dataset[0]

Output:

Downloading https://raw.githubusercontent.com/kimiyoung/planetoid/master/data/ind.cora.x
Downloading https://raw.githubusercontent.com/kimiyoung/planetoid/master/data/ind.cora.tx
Downloading https://raw.githubusercontent.com/kimiyoung/planetoid/master/data/ind.cora.allx
Downloading https://raw.githubusercontent.com/kimiyoung/planetoid/master/data/ind.cora.y
Downloading https://raw.githubusercontent.com/kimiyoung/planetoid/master/data/ind.cora.ty
Downloading https://raw.githubusercontent.com/kimiyoung/planetoid/master/data/ind.cora.ally
Downloading https://raw.githubusercontent.com/kimiyoung/planetoid/master/data/ind.cora.graph
Downloading https://raw.githubusercontent.com/kimiyoung/planetoid/master/data/ind.cora.test.index
Processing...
Done!
Data(x=[2708, 1433], edge_index=[2, 10556], y=[2708], train_mask=[2708], val_mask=[2708], test_mask=[2708])

2. Using DGL (Deep Graph Library)

DGL is another powerful library designed for deep learning on graphs. It simplifies the process of building and training graph neural networks.

Installation

First, install DGL

pip install dgl

Loading the Cora Dataset

Python

import dgl
from dgl.data import CoraGraphDataset

# Load the Cora dataset
dataset = CoraGraphDataset()
graph = dataset[0]

3. Using NetworkX

NetworkX is a library for the creation, manipulation, and study of complex networks of nodes and edges.

Installation

First, install NetworkX:

pip install networkx

Loading the Cora Dataset

NetworkX does not have built-in support for the Cora dataset, but you can load it manually. Here is an example of how to do this:

Python

import networkx as nx
import pandas as pd

# Load the data (edges and node features/labels)
edges = pd.read_csv('cora.cites', sep='\t', header=None, names=['target', 'source'])
nodes = pd.read_csv('cora.content', sep='\t', header=None)

# Create a directed graph
G = nx.from_pandas_edgelist(edges, 'source', 'target', create_using=nx.DiGraph())

# Add node attributes
for i, row in nodes.iterrows():
    G.nodes[row[0]].update({'feature': row[1:-1].values, 'label': row[-1]})

4. Using TensorFlow

TensorFlow also supports graph data through its tf.data and tf.keras APIs. While TensorFlow does not have a direct way to load the Cora dataset, we can still load and preprocess it manually.

Installation

First, install TensorFlow:

pip install tensorflow

Python

import tensorflow as tf
import pandas as pd

# Load the data (edges and node features/labels)
edges = pd.read_csv('cora.cites', sep='\t', header=None, names=['target', 'source'])
nodes = pd.read_csv('cora.content', sep='\t', header=None)

# Create a dataset of edges
edges = tf.data.Dataset.from_tensor_slices((edges['source'].values, edges['target'].values))

# Create a dataset of nodes
features = nodes.iloc[:, 1:-1].values
labels = nodes.iloc[:, -1].values

# Convert to tensors
features = tf.convert_to_tensor(features, dtype=tf.float32)
labels = tf.convert_to_tensor(labels, dtype=tf.int64)

Conclusion

The Cora dataset is an essential resource for the graph machine learning community, offering a robust platform for testing and developing innovative algorithms. Its structured complexity, combined with rich features and comprehensive labels, makes it an ideal benchmark for advancing the study of complex networks. As graph neural networks and related methodologies continue to evolve, the Cora dataset will remain a critical tool in driving research and education in this dynamic field.

Cora Dataset Overview

Usage and Applications

Features and Labels

Methods to Load Cora Dataset

1. Using PyTorch Geometric

2. Using DGL (Deep Graph Library)

Installation

Loading the Cora Dataset

3. Using NetworkX

Installation

Loading the Cora Dataset

4. Using TensorFlow

Installation

Conclusion

Explore