This project implements a Graph Convolutional Network (GCN) for node classification on the Cora dataset. The GCN model leverages node features (embeddings) and graph structure (adjacency matrix) to classify nodes into different categories based on their content and citation relationships.
Data for the Cora dataset will be automatically downloaded if not present in a folder named cora
in the current working directory. The dataset includes papers (nodes), embeddings (features), categories of papers (labels), and citation relationships (edges).
- Adjacency Matrix Generation: Create an adjacency matrix using the citation relationships (edges) among the papers.
- Normalization: Apply normalization to both the features (embeddings) and the adjacency matrix to standardize data scales.
The GCN model takes the embeddings and adjacency matrix as inputs. It is defined with two graph convolutional layers:
- The first layer transforms the input features into a hidden layer with 20 neurons, followed by a dropout layer for regularization.
- The second graph convolutional layer outputs the class scores.
The Cora dataset is split into training and testing sets using 10-Fold stratified sampling to ensure that the label distribution is consistent between the train and test sets, which is crucial given the label imbalance in the dataset. Each split uses 80% of the data for training and 20% for testing.
During training:
- The model updates using only the training index in each batch.
- Training history (loss and accuracy on train and test sets) is saved in the
train_history
folder. - Plots of training progress are saved in the
train_history_plots
folder.
All model and training parameters are defined in the configs/config.toml
file. This configuration file allows for easy adjustment of parameters.
You can run the project code within two different environments: Conda or Docker.
conda create -n node_cora python=3.11
conda activate node_cora
pip install --no-cache-dir -r requirements.txt
You can execute the project code within the Conda environment using one of two methods: Shell scripts or Python commands.
chmod +x run.sh
./run.sh
If the dataset is not already downloaded in the cora
folder, you can download and extract it using the following commands:
# Commands to download the dataset
wget https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz
echo "Extracting the dataset..."
tar -xzvf cora.tgz
echo "Cleaning up downloaded files..."
rm cora.tgz
# Training configurations are stored in configs/config.toml, which includes e.g. model hyperparameters.
python train.py --config_path configs/config.toml
# By default, the dataset is split using a stratified method. However, you can opt for k-fold cross-validation by specifying the `cv_method` parameter.
# Example command to train with k-fold cross-validation:**
python train.py --config_path configs/config.toml --cv_method "kfold"
# Configuration for predictions, such as the location of the trained model and output file name, is in configs/pred_config.toml.
python predict.py --config_path configs/pred_config.toml
docker-compose up --build
After running the project, you can expect the following outputs, which help in evaluating the performance:
- Paper Categories Predictions: The predictions for the categories of papers are saved in a
prediction.tsv
file. This file contains the predicted categories for each paper in the dataset. If you only need the prediction file and you want to skip the training porcess, you can use the trained model located inmodels/model_split_1
to make predictions. Here are the instructions:
You can make the predictions using trained model within two different environments: Conda or Docker.
conda create -n node_cora python=3.11
conda activate node_cora
pip install --no-cache-dir -r requirements.txt
You can make the predictions within the Conda environment using one of two methods: Shell scripts or Python commands.
chmod +x run.predict.sh
./run.predict.sh
If the dataset is not already downloaded in the cora
folder, you can download and extract it using the following commands:
# Commands to download the dataset
wget https://linqs-data.soe.ucsc.edu/public/lbc/cora.tgz
echo "Extracting the dataset..."
tar -xzvf cora.tgz
echo "Cleaning up downloaded files..."
rm cora.tgz
python predict.py --config_path configs/pred_config.toml
docker-compose -f docker-compose.predict.yaml up --build
- Model Files: The trained models for each data split are saved in the
models
folder.
- Loss and Accuracy Logs: Detailed logs of training loss and accuracy for both the training and test sets across every split are stored in the
train_history
folder.
- Visualization of Model Training: Plots illustrating the training history, including loss and accuracy over epochs for each split, are saved in the
train_history_plots
folder.