Commit 7571aa56 authored by Megha Khosla's avatar Megha Khosla

removed redundant files

parent fc99d9c0
Cora citation network, part of the Koblenz Network Collection
===========================================================================
This directory contains the TSV and related files of the subelj_cora network:
This is the cora citation network. The network is directed. Nodes represent scientific papers. An edge between two nodes indicates that the left node cites the right node.
More information about the network is provided here:
http://konect.uni-koblenz.de/networks/subelj_cora
Files:
meta.subelj_cora -- Metadata about the network
out.subelj_cora -- The adjacency matrix of the network in space separated values format, with one edge per line
The meaning of the columns in out.subelj_cora are:
First column: ID of from node
Second column: ID of to node
ent.subelj_cora.class.name -- Contains the attribute `name` of entity `class` of the network
ent.subelj_cora.id.name -- Contains the attribute `name` of entity `id` of the network
Complete documentation about the file format can be found in the KONECT
handbook, in the section File Formats, available at:
http://konect.uni-koblenz.de/publications
All files are licensed under a Creative Commons Attribution-ShareAlike 2.0 Germany License.
For more information concerning license visit http://konect.uni-koblenz.de/license.
Use the following References for citation:
@MISC{konect:2016:subelj_cora,
title = {Cora citation network dataset -- {KONECT}},
month = sep,
year = {2016},
url = {http://konect.uni-koblenz.de/networks/subelj_cora}
}
@inproceedings{konect:dependency4,
title = {Model of Complex Networks based on Citation Dynamics},
booktitle = {Proceedings of the {WWW} Workshop on Large Scale Network Analysis},
author = {{\v S}ubelj, Lovro and Bajec, Marko},
year = {2013},
pages = {527--530},
}
@inproceedings{konect,
title = {{KONECT} -- {The} {Koblenz} {Network} {Collection}},
author = {Jérôme Kunegis},
year = {2013},
booktitle = {Proc. Int. Conf. on World Wide Web Companion},
pages = {1343--1350},
url = {http://userpages.uni-koblenz.de/~kunegis/paper/kunegis-koblenz-network-collection.pdf},
url_presentation = {http://userpages.uni-koblenz.de/~kunegis/paper/kunegis-koblenz-network-collection.presentation.pdf},
}
This diff is collapsed.
This diff is collapsed.
category: Citation
code: CC
name: Cora citation
description: Paper–paper citations
entity-names: paper
extr: subelj
url: http://lovro.lpt.fri.uni-lj.si/support.jsp
long-description: This is the cora citation network. The network is directed. Nodes represent scientific papers. An edge between two nodes indicates that the left node cites the right node.
relationship-names: citation
cite: konect:dependency4
This diff is collapsed.
DBLP network, part of the Koblenz Network Collection
===========================================================================
This directory contains the TSV and related files of the dblp-cite network:
This is the citation network of DBLP, a database of scientific publications such as papers and books. Each node in the network is a publication, and each edge represents a citation of a publication by another publication. In other words, the directed edge (A → B) denotes that publication A cites publication B. Publications are allowed to cite themselves, and therefore the network contains loops.
More information about the network is provided here:
http://konect.uni-koblenz.de/networks/dblp-cite
Files:
meta.dblp-cite -- Metadata about the network
out.dblp-cite -- The adjacency matrix of the network in space separated values format, with one edge per line
The meaning of the columns in out.dblp-cite are:
First column: ID of from node
Second column: ID of to node
Complete documentation about the file format can be found in the KONECT
handbook, in the section File Formats, available at:
http://konect.uni-koblenz.de/publications
All files are licensed under a Creative Commons Attribution-ShareAlike 2.0 Germany License.
For more information concerning license visit http://konect.uni-koblenz.de/license.
Use the following References for citation:
@MISC{konect:2016:dblp-cite,
title = {DBLP network dataset -- {KONECT}},
month = sep,
year = {2016},
url = {http://konect.uni-koblenz.de/networks/dblp-cite}
}
@inproceedings{konect:DBLP,
title = {The {DBLP} Computer Science Bibliography: Evolution,
Research Issues, Perspectives},
author = {Michael Ley},
booktitle = {Proc. Int. Symposium on String Processing and
Information Retrieval},
year = {2002},
pages = {1--10},
}
@inproceedings{konect,
title = {{KONECT} -- {The} {Koblenz} {Network} {Collection}},
author = {Jérôme Kunegis},
year = {2013},
booktitle = {Proc. Int. Conf. on World Wide Web Companion},
pages = {1343--1350},
url = {http://userpages.uni-koblenz.de/~kunegis/paper/kunegis-koblenz-network-collection.pdf},
url_presentation = {http://userpages.uni-koblenz.de/~kunegis/paper/kunegis-koblenz-network-collection.presentation.pdf},
}
category: Citation
code: Pi
name: DBLP
description: Publication–publication citations
url: http://dblp.uni-trier.de/xml/
cite: konect:DBLP
extr: dblp
long-description: This is the citation network of DBLP, a database of scientific publications such as papers and books. Each node in the network is a publication, and each edge represents a citation of a publication by another publication. In other words, the directed edge (A → B) denotes that publication A cites publication B. Publications are allowed to cite themselves, and therefore the network contains loops.
tags: #loop #regenerate
entity-names: publication
relationship-names: citation
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
#! /usr/bin/python3
import argparse
import random
import itertools
import networkx
import numpy
import gensim
from matplotlib import pyplot
from functions import Similarities
def random_sample_edges(node_list, size):
result = set()
current = 0
while current < size:
current = len(result)
progress = int(100 * current / size)
print('{}% [{}/{}]'.format(progress, current, size), end='\r')
node1 = random.choice(node_list)
node2 = random.choice(node_list)
result.add((node1, node2))
print()
return result
def main():
ap = argparse.ArgumentParser()
ap.add_argument('GRAPH', help='A file containing the graph')
ap.add_argument('AUTH', help='A file containing the auth. embeddings')
ap.add_argument('HUB', help='A file containing the hub embeddings')
ap.add_argument('-k', type=int, nargs='+', default=[100], help='k in Precision@k (multiple values possible)')
ap.add_argument('-sn', '--sample_nodes', type=float, help='Sample a fraction of all nodes')
ap.add_argument('-se', '--sample_edges', type=float, help='Sample a fraction of all edges')
ap.add_argument('--plot', action='store_true', help='Show a plot with the results')
args = ap.parse_args()
if args.sample_nodes and args.sample_edges:
exit('error: --sample_nodes and --sample_edges are mutually exclusive')
# read a weighted directed graph from the source file
print('reading {}...'.format(args.GRAPH))
orig_graph = networkx.read_edgelist(args.GRAPH, nodetype=str, data=(('weight', int),), create_using=networkx.DiGraph())
print('reading {}...'.format(args.AUTH))
auth = gensim.models.KeyedVectors.load_word2vec_format(args.AUTH, binary=False)
print('reading {}...'.format(args.HUB))
hub = gensim.models.KeyedVectors.load_word2vec_format(args.HUB, binary=False)
sims = Similarities(max(args.k))
if args.sample_nodes:
sample_size = int(args.sample_nodes * orig_graph.number_of_nodes())
print('sampling {} out of {} nodes...'.format(sample_size, orig_graph.number_of_nodes()))
sample = random.sample(orig_graph.nodes(), sample_size)
edges = itertools.product(sample, sample)
total = len(sample) ** 2
elif args.sample_edges:
num_edges = orig_graph.number_of_nodes() ** 2
sample_size = int(args.sample_edges * num_edges)
print('sampling {} out of {} edges...'.format(sample_size, num_edges))
edges = random_sample_edges(list(orig_graph.nodes()), sample_size)
total = len(edges)
else:
edges = itertools.product(orig_graph.nodes(), orig_graph.nodes())
total = orig_graph.number_of_nodes() ** 2
current = 1
for node1, node2 in edges:
progress = int(100 * current / total)
print('{}% [{}/{}]'.format(progress, current, total), end='\r')
current += 1
emb1_auth = auth.wv[node1]
emb1_hub = hub.wv[node1]
emb2_auth = auth.wv[node2]
emb2_hub = hub.wv[node2]
sims.add(node1, node2, numpy.dot(emb1_hub, emb2_auth))
sims.add(node2, node1, numpy.dot(emb2_hub, emb1_auth))
print('done')
precs = []
weighted_precs = []
for k in args.k:
prec = sims.precision(k, orig_graph)
weighted_prec = sims.weighted_precision(k, orig_graph)
print('P@{} = {}'.format(k, prec))
print('P_w@{} = {}'.format(k, weighted_prec))
precs.append(prec)
weighted_precs.append(weighted_prec)
if args.plot:
pyplot.xscale('log')
pyplot.yscale('log')
pyplot.plot(args.k, precs, label='P@k')
pyplot.plot(args.k, weighted_precs, label='P_w@k')
pyplot.legend(loc='best')
pyplot.show()
if __name__ == '__main__':
main()
#! /usr/bin/python3
import argparse
import random
import networkx
import numpy
import gensim
from sklearn import svm
from sklearn.model_selection import cross_val_predict
from sklearn import metrics
import functions
def sample_neg_edges(graph, fraction):
sample_size = int(graph.number_of_edges() * fraction)
result = set()
current = 0
node_list = list(graph.nodes())
while current < sample_size:
progress = int(100 * current / sample_size)
print('{}% [{}/{}]'.format(progress, current, sample_size), end='\r')
node1 = random.choice(node_list)
node2 = random.choice(node_list)
if not graph.has_edge(node1, node2):
result.add((node1, node2))
current = len(result)
return result
def sample_and_remove_pos_edges(graph, fraction):
sample_size = int(graph.number_of_edges() * fraction)
sample = random.sample(list(graph.edges()), sample_size)
for e in sample:
graph.remove_edge(*e)
return sample
def apply_op(op, edge, hub_embeddings, auth_embeddings):
hub, auth = edge
return op(hub_embeddings[hub], auth_embeddings[auth])
def shuffle(list1, list2):
list1_shuf = []
list2_shuf = []
index_shuf = list(range(len(list1)))
random.shuffle(index_shuf)
for i in index_shuf:
list1_shuf.append(list1[i])
list2_shuf.append(list2[i])
return list1_shuf, list2_shuf
def main():
ap = argparse.ArgumentParser()
ap.add_argument('GRAPH', help='A file containing the graph')
ap.add_argument('AUTH', help='A file containing the auth. embeddings')
ap.add_argument('HUB', help='A file containing the hub embeddings')
ap.add_argument('-f', '--fraction', type=float, default=0.5, help='Sample a fraction of the edges (positive and negative)')
ap.add_argument('-op', '--binary_operator', default='avg', help='What binary operator to use (avg, had, wl1, wl2)')
ap.add_argument('-cv', '--k_fold_cv', type=int, default=5, help='Do k-fold cross-validation')
ap.add_argument('--comments', default='#', help='A string that indicates the start of a line comment in the graph file')
args = ap.parse_args()
# read a weighted directed graph from the source file
print('reading {}...'.format(args.GRAPH))
orig_graph = networkx.read_edgelist(args.GRAPH, nodetype=str, data=(('weight', int),), create_using=networkx.DiGraph(), comments=args.comments)
print('reading {}...'.format(args.AUTH))
auth = gensim.models.KeyedVectors.load_word2vec_format(args.AUTH, binary=False)
print('reading {}...'.format(args.HUB))
hub = gensim.models.KeyedVectors.load_word2vec_format(args.HUB, binary=False)
# make sure we have all embeddings
assert len(auth.syn0) == len(hub.syn0) == orig_graph.number_of_nodes()
print('sampling negative edges...')
orig_number_of_edges = orig_graph.number_of_edges()
neg_edges = sample_neg_edges(orig_graph, args.fraction)
pos_edges = sample_and_remove_pos_edges(orig_graph, args.fraction)
# make sure we have the correct number of edges
assert len(pos_edges) == len(neg_edges)
# after sampling positive edges they should be removed from the graph
assert orig_graph.number_of_edges() == orig_number_of_edges - len(pos_edges)
print('sampled {} pos. and neg. edges'.format(len(pos_edges)))
print('applying binary operator...')
# the binary operator function to use
bin_op = {'avg': functions.average, 'had': functions.hadamard, 'wl1': functions.weighted_l1, 'wl2': functions.weighted_l2}[args.binary_operator]
__map_func = lambda e: apply_op(bin_op, e, hub.wv, auth.wv)
clf_x = list(map(__map_func, neg_edges)) + list(map(__map_func, pos_edges))
clf_y = [0] * len(neg_edges) + [1] * len(pos_edges)
print('shuffling the classifier inputs...')
clf_x_shuf, clf_y_shuf = shuffle(clf_x, clf_y)
print('cross-validating...')
clf = svm.SVC()
pred = cross_val_predict(clf, clf_x_shuf, clf_y_shuf, cv=args.k_fold_cv)
print(metrics.classification_report(clf_y_shuf, pred))
print(metrics.confusion_matrix(clf_y_shuf, pred))
if __name__ == '__main__':
main()
#! /usr/bin/python3
import argparse
import random
import networkx
import numpy
import gensim
import functions
def main():
ap = argparse.ArgumentParser()
ap.add_argument('GRAPH', help='A file containing the graph')
ap.add_argument('-u', '--unweighted', action='store_true', help='Read an unweighted graph')
ap.add_argument('-d', '--delimiter', default=' ', help='The delimiter in the edge list (default: space)')
ap.add_argument('LABELS', help='A file containing the labels of each node ID')
ap.add_argument('-l', '--label_format', default='cora', help='Specify the label format (cora, blogcat)')
ap.add_argument('TRAIN_LABELS', help='A file containing the training labels (created using create_ml_class_data.py)')
ap.add_argument('TEST_LABELS', help='A file containing the testing labels (created using create_ml_class_data.py)')
ap.add_argument('AUTH', help='A file containing the auth. (or regular) embeddings')
ap.add_argument('--hub', help='A file containing the hub embeddings (optional)')
ap.add_argument('-b', '--binary', action='store_true', help='Read the embedding files as binary')
ap.add_argument('--hope', action='store_true', help='Use the HOPE embedding format (csv)')
ap.add_argument('--verse', nargs=2, type=int, help='Use the VERSE embedding format (numpy binary). Expects 2 parameters: Number of nodes and embedding size (dimensions)')
ap.add_argument('--comments', default='#', help='A string that indicates the start of a line comment in the graph file')
args = ap.parse_args()
if args.hope and args.verse:
exit('error: --hope and --verse are mutually exclusive')
if args.label_format not in ('cora', 'blogcat'):
exit('error: only label formats "cora" and "blogcat" are supported')
print('reading {}...'.format(args.GRAPH))
if args.unweighted:
orig_graph = networkx.read_edgelist(args.GRAPH, nodetype=str, create_using=networkx.DiGraph(), comments=args.comments, delimiter=args.delimiter)
else:
orig_graph = networkx.read_edgelist(args.GRAPH, nodetype=str, data=(('weight', int),), create_using=networkx.DiGraph(), comments=args.comments, delimiter=args.delimiter)
if args.hope:
print('reading {}...'.format(args.AUTH))
auth = functions.read_hope_emb(args.AUTH)
if args.hub:
print('reading {}...'.format(args.hub))
hub = functions.read_hope_emb(args.hub)
# make sure we have all embeddings
assert len(auth) == len(hub) == orig_graph.number_of_nodes()
# check if the indices match
assert auth.keys() == hub.keys()
else:
hub = None
elif args.verse:
num_nodes, embedding_dim = args.verse
print('reading {}...'.format(args.AUTH))
auth = functions.read_verse_emb(args.AUTH, num_nodes, embedding_dim)
if args.hub:
print('reading {}...'.format(args.hub))
hub = functions.read_verse_emb(args.hub, num_nodes, embedding_dim)
# make sure we have all embeddings
assert len(auth) == len(hub) == orig_graph.number_of_nodes()
# check if the indices match
assert auth.keys() == hub.keys()
else:
hub = None
else:
print('reading {}...'.format(args.AUTH))
auth = functions.read_w2v_emb(args.AUTH, args.binary)
if args.hub:
print('reading {}...'.format(args.hub))
hub = functions.read_w2v_emb(args.hub, args.binary)
# make sure we have all embeddings
assert len(auth.vocab) == len(hub.vocab) == orig_graph.number_of_nodes()
# check if the indices match
assert auth.index2word == hub.index2word
else:
hub = None
print('reading {}...'.format(args.LABELS))
if args.label_format == 'cora':
labels, label_list = functions.read_cora_labels(args.LABELS)
else:
labels, label_list = functions.read_blogcat_labels(args.LABELS)
assert len(labels) == orig_graph.number_of_nodes()
print('reading {}...'.format(args.TRAIN_LABELS))
train_labels = functions.read_ml_class_labels(args.TRAIN_LABELS)
print('reading {}...'.format(args.TEST_LABELS))
test_labels = functions.read_ml_class_labels(args.TEST_LABELS)
f1_micro, f1_macro = functions.get_f1(train_labels, test_labels, label_list, auth, hub, True)
print('F1 (micro) = {}'.format(f1_micro))
print('F1 (macro) = {}'.format(f1_macro))
if __name__ == '__main__':
main()
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment