Commit 1677e219 authored by durandtibo's avatar durandtibo
Browse files

Initial commit

parents
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# IPython
profile_default/
ipython_config.py
# pyenv
.python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/
# Celery stuff
celerybeat-schedule
celerybeat.pid
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
# IDE
.idea/
.vscode/
\ No newline at end of file
This diff is collapsed.
# SLAPS-GNN
This repo contains the implementation of the model proposed
in `SLAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks`.
## Datasets
`ogbn-arxiv` dataset will be loaded automatically, while `Cora`, `Citeseer`, and `Pubmed` are included in the GCN
package, available [here](https://github.com/tkipf/gcn/tree/master/gcn/data). Place the relevant files in the folder
data_tf.
## Dependencies
* `Python` version 3.7.2
* [`Numpy`](https://numpy.org/) version 1.18.5
* [`PyTorch`](https://pytorch.org/) version 1.5.1
* [`DGL`](https://www.dgl.ai/) version 0.5.2
* [`sklearn`](https://scikit-learn.org/stable/) version 0.21.3
* [`scipy`](https://www.scipy.org/) version 1.2.1
* [`torch-geometric`](https://github.com/rusty1s/pytorch_geometric) 1.6.1
* [`ogb`](https://ogb.stanford.edu/) version 1.2.3
To train the models, you need a machine with a GPU.
To install the dependencies, it is recommended to use a virtual environment. You can create a virtual environment and
install all the dependencies with the following command:
```bash
conda env create -f environment.yml
```
The file `requirements.txt` was written for CUDA 9.2 and Linux so you may need to adapt it to your infrastructure.
## Usage
To run the model you should define the following parameters:
- `dataset`: The dataset you want to run the model on
- `ntrials`: number of runs
- `epochs_adj`: number of epochs
- `epochs`: number of epochs for GNN_C (used for knn_gcn and 2step learning of the model)
- `lr_adj`: learning rate of GNN_DAE
- `lr`: learning rate of GNN_C
- `w_decay_adj`: l2 regularization parameter for GNN_DAE
- `w_decay`: l2 regularization parameter for GNN_C
- `nlayers_adj`: number of layers for GNN_DAE
- `nlayers`: number of layers for GNN_C
- `hidden_adj`: hidden size of GNN_DAE
- `hidden`: hidden size of GNN_C
- `dropout1`: dropout rate for GNN_DAE
- `dropout2`: dropout rate for GNN_C
- `dropout_adj1`: dropout rate on adjacency matrix for GNN_DAE
- `dropout_adj2`: dropout rate on adjacency matrix for GNN_C
- `dropout2`: dropout rate for GNN_C
- `k`: k for knn initialization with knn
- `lambda_`: weight of loss of GNN_DAE
- `nr`: ratio of zeros to ones to mask out for binary features
- `ratio`: ratio of ones to mask out for binary features and ratio of features to mask out for real values features
- `model`: model to run (choices are end2end, knn_gcn, or 2step)
- `sparse`: whether to make the adjacency sparse and run operations on sparse mode
- `gen_mode`: identifies the graph generator
- `non_linearity`: non-linearity to apply on the adjacency matrix
- `mlp_act`: activation function to use for the mlp graph generator
- `mlp_h`: hidden size of the mlp graph generator
- `noise`: type of noise to add to features (mask or normal)
- `loss`: type of GNN_DAE loss (mse or bce)
- `epoch_d`: epochs_adj / epoch2 of the epochs will be used for training GNN_DAE
- `half_val_as_train`: use half of validation for train to get Cora390 and Citeseer370
## Reproducing the Results in the Paper
In order to reproduce the results presented in the paper, you should run the following commands:
### Cora
#### FP
Run the following command:
```bash
python main.py -dataset cora -ntrials 10 -epochs_adj 2000 -lr 0.001 -lr_adj 0.01 -w_decay 0.0005 -nlayers 2 -nlayers_adj 2 -hidden 32 -hidden_adj 512 -dropout1 0.5 -dropout2 0.5 -dropout_adj1 0.5 -dropout_adj2 0.25 -k 30 -lambda_ 10.0 -nr 5 -ratio 10 -model end2end -sparse 0 -gen_mode 0 -non_linearity elu -epoch_d 5
```
#### MLP
Run the following command:
```bash
python main.py -dataset cora -ntrials 10 -epochs_adj 2000 -lr 0.01 -lr_adj 0.001 -w_decay 0.0005 -nlayers 2 -nlayers_adj 2 -hidden 32 -hidden_adj 512 -dropout1 0.5 -dropout2 0.5 -dropout_adj1 0.25 -dropout_adj2 0.5 -k 20 -lambda_ 10.0 -nr 5 -ratio 10 -model end2end -sparse 0 -gen_mode 1 -non_linearity relu -mlp_h 1433 -mlp_act relu -epoch_d 5
```
### MLP-D
Run the following command:
```bash
python main.py -dataset cora -ntrials 10 -epochs_adj 2000 -lr 0.01 -lr_adj 0.001 -w_decay 0.05 -nlayers 2 -nlayers_adj 2 -hidden 32 -hidden_adj 512 -dropout1 0.5 -dropout2 0.5 -dropout_adj1 0.25 -dropout_adj2 0.5 -k 15 -lambda_ 10.0 -nr 5 -ratio 10 -model end2end -sparse 0 -gen_mode 2 -non_linearity relu -mlp_act relu -epoch_d 5
```
### Citeseer
#### FP
Run the following command:
```bash
python main.py -dataset citeseer -ntrials 10 -epochs_adj 2000 -lr 0.01 -lr_adj 0.01 -w_decay 0.05 -nlayers 2 -nlayers_adj 2 -hidden 32 -hidden_adj 1024 -dropout1 0.5 -dropout2 0.5 -dropout_adj1 0.4 -dropout_adj2 0.4 -k 30 -lambda_ 1.0 -nr 1 -ratio 10 -model end2end -sparse 0 -gen_mode 0 -non_linearity elu -epoch_d 5
```
#### MLP
Run the following command:
```bash
python main.py -dataset citeseer -ntrials 10 -epochs_adj 2000 -lr 0.01 -lr_adj 0.001 -w_decay 0.0005 -nlayers 2 -nlayers_adj 2 -hidden 32 -hidden_adj 1024 -dropout1 0.5 -dropout2 0.5 -dropout_adj1 0.25 -dropout_adj2 0.5 -k 30 -lambda_ 10.0 -nr 5 -ratio 10 -model end2end -sparse 0 -gen_mode 1 -non_linearity relu -mlp_act tanh -mlp_h 3703 -epoch_d 5
```
#### MLP-D
Run the following command:
```bash
python main.py -dataset citeseer -ntrials 10 -epochs_adj 2000 -lr 0.001 -lr_adj 0.01 -w_decay 0.05 -nlayers 2 -nlayers_adj 2 -hidden 32 -hidden_adj 1024 -dropout1 0.5 -dropout2 0.5 -dropout_adj1 0.5 -dropout_adj2 0.5 -k 20 -lambda_ 10.0 -nr 5 -ratio 10 -model end2end -sparse 0 -gen_mode 2 -non_linearity relu -mlp_act tanh -epoch_d 5
```
### Cora390
#### FP
Run the following command:
```bash
python main.py -dataset cora -ntrials 10 -epochs_adj 2000 -lr 0.01 -lr_adj 0.01 -w_decay 0.0005 -nlayers 2 -nlayers_adj 2 -hidden 32 -hidden_adj 512 -dropout1 0.5 -dropout2 0.5 -dropout_adj1 0.25 -dropout_adj2 0.5 -k 20 -lambda_ 100.0 -nr 5 -ratio 10 -model end2end -sparse 0 -gen_mode 0 -non_linearity elu -epoch_d 5 -half_val_as_train 1
```
#### MLP
Run the following command:
```bash
python main.py -dataset cora -ntrials 10 -epochs_adj 2000 -lr 0.01 -lr_adj 0.001 -w_decay 0.0005 -nlayers 2 -nlayers_adj 2 -hidden 32 -hidden_adj 512 -dropout1 0.5 -dropout2 0.5 -dropout_adj1 0.25 -dropout_adj2 0.5 -k 20 -lambda_ 10.0 -nr 5 -ratio 10 -model end2end -sparse 0 -gen_mode 1 -non_linearity relu -mlp_h 1433 -mlp_act relu -epoch_d 5 -half_val_as_train 1
```
#### MLP-D
Run the following command:
```bash
python main.py -dataset cora -ntrials 10 -epochs_adj 2000 -lr 0.001 -lr_adj 0.001 -w_decay 0.0005 -nlayers 2 -nlayers_adj 2 -hidden 32 -hidden_adj 512 -dropout1 0.5 -dropout2 0.5 -dropout_adj1 0.25 -dropout_adj2 0.5 -k 20 -lambda_ 10.0 -nr 5 -ratio 10 -model end2end -sparse 0 -gen_mode 2 -non_linearity relu -mlp_act relu -epoch_d 5 -half_val_as_train 1
```
### Citeseer370
#### FP
Run the following command:
```bash
python main.py -dataset citeseer -ntrials 10 -epochs_adj 2000 -lr 0.01 -lr_adj 0.01 -w_decay 0.05 -nlayers 2 -nlayers_adj 2 -hidden 32 -hidden_adj 1024 -dropout1 0.5 -dropout2 0.5 -dropout_adj1 0.5 -dropout_adj2 0.5 -k 30 -lambda_ 1.0 -nr 1 -ratio 10 -model end2end -sparse 0 -gen_mode 0 -non_linearity elu -epoch_d 5 -half_val_as_train 1
```
#### MLP
Run the following command:
```bash
python main.py -dataset citeseer -ntrials 10 -epochs_adj 2000 -lr 0.01 -lr_adj 0.001 -w_decay 0.0005 -nlayers 2 -nlayers_adj 2 -hidden 32 -hidden_adj 1024 -dropout1 0.25 -dropout2 0.5 -dropout_adj1 0.25 -dropout_adj2 0.5 -k 30 -lambda_ 10.0 -nr 5 -ratio 10 -model end2end -sparse 0 -gen_mode 1 -non_linearity relu -mlp_act tanh -mlp_h 3703 -epoch_d 5 -half_val_as_train 1
```
#### MLP-D
Run the following command:
```bash
python main.py -dataset citeseer -ntrials 10 -epochs_adj 2000 -lr 0.01 -lr_adj 0.01 -w_decay 0.05 -nlayers 2 -nlayers_adj 2 -hidden 32 -hidden_adj 1024 -dropout1 0.5 -dropout2 0.5 -dropout_adj1 0.25 -dropout_adj2 0.5 -k 20 -lambda_ 10.0 -nr 5 -ratio 10 -model end2end -sparse 0 -gen_mode 2 -non_linearity relu -mlp_act tanh -epoch_d 5 -half_val_as_train 1
```
### Pubmed
#### MLP
Run the following command:
```bash
python main.py -dataset pubmed -ntrials 10 -epochs_adj 2000 -lr 0.01 -lr_adj 0.01 -w_decay 0.0005 -nlayers 2 -nlayers_adj 2 -hidden 32 -hidden_adj 128 -dropout1 0.5 -dropout2 0.5 -dropout_adj1 0.5 -dropout_adj2 0.5 -k 15 -lambda_ 10.0 -nr 5 -ratio 10 -model end2end -sparse 0 -gen_mode 1 -non_linearity relu -mlp_h 500 -mlp_act relu -epoch_d 5 -sparse 1
```
#### MLP-D
Run the following command:
```bash
python main.py -dataset pubmed -ntrials 10 -epochs_adj 2000 -lr 0.01 -lr_adj 0.01 -w_decay 0.0005 -nlayers 2 -nlayers_adj 2 -hidden 32 -hidden_adj 128 -dropout1 0.5 -dropout2 0.5 -dropout_adj1 0.25 -dropout_adj2 0.25 -k 15 -lambda_ 100.0 -nr 5 -ratio 20 -model end2end -sparse 0 -gen_mode 2 -non_linearity relu -mlp_act tanh -epoch_d 5 -sparse 1
```
### ogbn-arxiv
#### MLP
Run the following command:
```bash
python main.py -dataset ogbn-arxiv -ntrials 10 -epochs_adj 2000 -lr 0.01 -lr_adj 0.001 -w_decay 0.0 -nlayers 2 -nlayers_adj 2 -hidden 256 -hidden_adj 256 -dropout1 0.5 -dropout2 0.5 -dropout_adj1 0.25 -dropout_adj2 0.5 -k 15 -lambda_ 10.0 -nr 5 -ratio 100 -model end2end -sparse 0 -gen_mode 1 -non_linearity relu -mlp_h 128 -mlp_act relu -epoch_d 2001 -sparse 1 -loss mse -noise mask
```
#### MLP-D
Run the following command:
```bash
python main.py -dataset ogbn-arxiv -ntrials 10 -epochs_adj 2000 -lr 0.01 -lr_adj 0.001 -w_decay 0.0 -nlayers 2 -nlayers_adj 2 -hidden 256 -hidden_adj 256 -dropout1 0.5 -dropout2 0.5 -dropout_adj1 0.5 -dropout_adj2 0.25 -k 15 -lambda_ 10.0 -nr 5 -ratio 100 -model end2end -sparse 0 -gen_mode 2 -non_linearity relu -mlp_act relu -epoch_d 2001 -sparse 1 -loss mse -noise normal
```
# The MIT License
# Copyright (c) 2016 Thomas Kipf
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.
import pickle as pkl
import sys
import warnings
import numpy as np
import scipy.sparse as sp
import torch
warnings.simplefilter("ignore")
def parse_index_file(filename):
"""Parse index file."""
index = []
for line in open(filename):
index.append(int(line.strip()))
return index
def sample_mask(idx, l):
"""Create mask."""
mask = np.zeros(l)
mask[idx] = 1
return np.array(mask, dtype=np.bool)
def load_citation_network(dataset_str):
names = ['x', 'y', 'tx', 'ty', 'allx', 'ally', 'graph']
objects = []
for i in range(len(names)):
with open("data_tf/ind.{}.{}".format(dataset_str, names[i]), 'rb') as f:
if sys.version_info > (3, 0):
objects.append(pkl.load(f, encoding='latin1'))
else:
objects.append(pkl.load(f))
x, y, tx, ty, allx, ally, graph = tuple(objects)
test_idx_reorder = parse_index_file("data_tf/ind.{}.test.index".format(dataset_str))
test_idx_range = np.sort(test_idx_reorder)
if dataset_str == 'citeseer':
# Fix citeseer dataset (there are some isolated nodes in the graph)
# Find isolated nodes, add them as zero-vecs into the right position
test_idx_range_full = range(min(test_idx_reorder), max(test_idx_reorder) + 1)
tx_extended = sp.lil_matrix((len(test_idx_range_full), x.shape[1]))
tx_extended[test_idx_range - min(test_idx_range), :] = tx
tx = tx_extended
ty_extended = np.zeros((len(test_idx_range_full), y.shape[1]))
ty_extended[test_idx_range - min(test_idx_range), :] = ty
ty = ty_extended
features = sp.vstack((allx, tx)).tolil()
features[test_idx_reorder, :] = features[test_idx_range, :]
labels = np.vstack((ally, ty))
labels[test_idx_reorder, :] = labels[test_idx_range, :]
idx_test = test_idx_range.tolist()
idx_train = range(len(y))
idx_val = range(len(y), len(y) + 500)
train_mask = sample_mask(idx_train, labels.shape[0])
val_mask = sample_mask(idx_val, labels.shape[0])
test_mask = sample_mask(idx_test, labels.shape[0])
features = torch.FloatTensor(features.todense())
labels = torch.LongTensor(labels)
train_mask = torch.BoolTensor(train_mask)
val_mask = torch.BoolTensor(val_mask)
test_mask = torch.BoolTensor(test_mask)
nfeats = features.shape[1]
for i in range(labels.shape[0]):
sum_ = torch.sum(labels[i])
if sum_ != 1:
labels[i] = torch.tensor([1, 0, 0, 0, 0, 0])
labels = (labels == 1).nonzero()[:, 1]
nclasses = torch.max(labels).item() + 1
return features, nfeats, labels, nclasses, train_mask, val_mask, test_mask
# Copyright (c) 2020-present, Royal Bank of Canada.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#
import warnings
import torch
from citation_networks import load_citation_network, sample_mask
warnings.simplefilter("ignore")
def load_ogb_data(dataset_str):
from ogb.nodeproppred.dataset_pyg import PygNodePropPredDataset
dataset = PygNodePropPredDataset(dataset_str)
data = dataset[0]
features = data.x
nfeats = data.num_features
nclasses = dataset.num_classes
labels = data.y
split_idx = dataset.get_idx_split()
train_mask = sample_mask(split_idx['train'], data.x.shape[0])
val_mask = sample_mask(split_idx['valid'], data.x.shape[0])
test_mask = sample_mask(split_idx['test'], data.x.shape[0])
features = torch.FloatTensor(features)
labels = torch.LongTensor(labels).view(-1)
train_mask = torch.BoolTensor(train_mask)
val_mask = torch.BoolTensor(val_mask)
test_mask = torch.BoolTensor(test_mask)
return features, nfeats, labels, nclasses, train_mask, val_mask, test_mask
def load_data(args):
dataset_str = args.dataset
if dataset_str.startswith('ogb'):
return load_ogb_data(dataset_str)
return load_citation_network(dataset_str)
name: slaps
channels:
- anaconda
- conda-forge
- pytorch
dependencies:
- python=3.7
- pip
- pip:
- --find-links https://download.pytorch.org/whl/torch_stable.html
- --find-links https://pytorch-geometric.com/whl/torch-1.5.0.html
- --requirement requirements.txt
\ No newline at end of file
# Copyright (c) 2020-present, Royal Bank of Canada.
# All rights reserved.
#
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
#
import dgl
import torch.nn as nn
from layers import Diag
from utils import *
class FullParam(nn.Module):
def __init__(self, features, non_linearity, k, knn_metric, i, sparse):
super(FullParam, self).__init__()
self.non_linearity = non_linearity
self.k = k
self.knn_metric = knn_metric
self.i = i
self.sparse = sparse
if self.non_linearity == "exp":
self.Adj = nn.Parameter(
torch.from_numpy(nearest_neighbors_pre_exp(features, self.k, self.knn_metric, self.i)))
elif self.non_linearity == "elu":
self.Adj = nn.Parameter(
torch.from_numpy(nearest_neighbors_pre_elu(features, self.k, self.knn_metric, self.i)))
elif self.non_linearity == 'none':
self.Adj = nn.Parameter(torch.from_numpy(nearest_neighbors(features, self.k, self.knn_metric)))
else:
raise NameError('No non-linearity has been specified')
def forward(self, h):
if not self.sparse:
if self.non_linearity == "exp":
Adj = torch.exp(self.Adj)
elif self.non_linearity == "elu":
Adj = F.elu(self.Adj) + 1
elif self.non_linearity == "none":
Adj = self.Adj
else:
if self.non_linearity == 'exp':
Adj = self.Adj.coalesce()
Adj.values = torch.exp(Adj.values())
elif self.non_linearity == 'elu':
Adj = self.Adj.coalesce()
Adj.values = F.elu(Adj.values()) + 1
elif self.non_linearity == "none":
Adj = self.Adj
else:
raise NameError('Non-linearity is not supported in the sparse setup')
return Adj
class MLP_Diag(nn.Module):
def __init__(self, nlayers, isize, k, knn_metric, non_linearity, i, sparse, mlp_act):
super(MLP_Diag, self).__init__()
self.i = i
self.layers = nn.ModuleList()
for _ in range(nlayers):
self.layers.append(Diag(isize))
self.k = k
self.knn_metric = knn_metric
self.non_linearity = non_linearity
self.sparse = sparse
self.mlp_act = mlp_act
def internal_forward(self, h):
for i, layer in enumerate(self.layers):
h = layer(h)
if i != (len(self.layers) - 1):
if self.mlp_act == "relu":
h = F.relu(h)
elif self.mlp_act == "tanh":
h = F.tanh(h)
return h
def forward(self, features):
if self.sparse:
embeddings = self.internal_forward(features)
rows, cols, values = knn_fast(embeddings, self.k, 1000)
rows_ = torch.cat((rows, cols))
cols_ = torch.cat((cols, rows))
values_ = torch.cat((values, values))
values_ = apply_non_linearity(values_, self.non_linearity, self.i)
adj = dgl.graph((rows_, cols_), num_nodes=features.shape[0], device='cuda')
adj.edata['w'] = values_
return adj
else:
embeddings = self.internal_forward(features)
embeddings = F.normalize(embeddings, dim=1, p=2)
similarities = cal_similarity_graph(embeddings)
similarities = top_k(similarities, self.k + 1)
similarities = apply_non_linearity(similarities, self.non_linearity, self.i)
return similarities
class MLP(nn.Module):
def __init__(self, nlayers, isize, hsize, osize, mlp_epochs, k, knn_metric, non_linearity, i, sparse, mlp_act):
super(MLP, self).__init__()
self.layers = nn.ModuleList()
if nlayers == 1:
self.layers.append(nn.Linear(isize, hsize))
else:
self.layers.append(nn.Linear(isize, hsize))
for _ in range(nlayers - 2):
self.layers.append(nn.Linear(hsize, hsize))
self.layers.append(nn.Linear(hsize, osize))