README.md 3.89 KB
Newer Older
Ngan Thi Dong's avatar
Ngan Thi Dong committed
1 2 3 4 5 6 7 8 9 10 11
Simplifying miRNA-disease association prediction

To replecate our result, use the following random seeds: [123, 456, 789, 101, 112]

The code was tested on python 3.7+, all required packages are put in requirements.txt

1. To **generate data for evaluation**, run 

`genFoldsData.py --data_dir data/XXX --save_dir data/XXX/folds --randseed RRR`

where XXX is the dataset(either hmdd2 or hmdd3), RRR is the random seed used
Ngan Thi Dong's avatar
Ngan Thi Dong committed
12 13 14
The script will split the data into five folds according to the given random seed. The training data set is balanced (negative:positive rate is set to 1). You can also change this rage with the `--neg_rate` argument. The testing data set is the whole test set consisting of the known association in the test split and all the possible combinations of miRNA-disease pairs (except the known ones in the training data). The program will also calculate miRNA functional, miRNA sequence, miRNA GIP similarity, and disease semantic, disease GIP similarity according to the training data in each data split.

We use the code given by the EPMDA authors for the GIP kernel similarity calculation. The code for miRNA sequence similarity is the one given by DBMDA authors. For miRNA functional and disease semantic similarity calculation, the code is partly adapted from the one released by MISIM 2.0 authors.
Ngan Thi Dong's avatar
Ngan Thi Dong committed
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

2. To evaluate **nimgcn** model, run:

`python eval_nimgcn.py --data_dir data/XXX --fold_dir data/XXX/folds --save_dir data/XXX/results`

other configurable parameters include:
- _sim_type_ should be one of ['functional1', 'functional2', 'gip', 'seq']. The default set up is 'functional2'
- _faulty_ should be either True or False. True means use the faulty calculated similarities
- _save_score_: should be either True or False, corresponding to whether to save to predicted scores or not
- _method_: should be one of ['nimgcn', 'nimgcn1', 'nimgcn2', 'nimgcn3']
- _randseed_: the random seed used to generate train/test split

2. To evaluate **dbmda** model, run:

`python eval_dbmda.py --data_dir data/XXX --fold_dir data/XXX/folds --save_dir data/XXX/results`

other configurable parameters include:
- _sim_type_ should be one of ['functional1', 'functional2', 'gip', 'seq']. The default set up is 'functional2'
- _faulty_ should be either True or False. True means use the faulty calculated similarities
- _save_score_: should be either True or False, corresponding to whether to save to predicted scores or not
- _use_autoencoder_: whether to use autoencoder or not
- _use_seq_sim_: whether to use seq sim or not
Ngan Thi Dong's avatar
Ngan Thi Dong committed
37
- _imbalanced_: whether to use imbalanced training data or not
Ngan Thi Dong's avatar
Ngan Thi Dong committed
38 39
- _randseed_: the random seed used to generate train/test split

Ngan Thi Dong's avatar
Ngan Thi Dong committed
40
3. For **EPMDA** since the features took a lot of time to run, we provide all calculated features in epmda/data folder
Ngan Thi Dong's avatar
Ngan Thi Dong committed
41
Please run `eval_epmda.py` with the corresponding arguments for evaluating EPMDA with the balance/imblance set up 
Ngan Thi Dong's avatar
Ngan Thi Dong committed
42

Ngan Thi Dong's avatar
Ngan Thi Dong committed
43 44
4. To get the topK scores for selected diseases, run `getTopRes.py`. The disease indexes given in the scripts are those used by our evaluation.

Ngan Thi Dong's avatar
Ngan Thi Dong committed
45 46 47
For feature calculation, please refer to the *.py files in epmda folder. 

4. To run all the model with the **NEW** dataset, the following files are needed:
Ngan Thi Dong's avatar
Ngan Thi Dong committed
48 49 50 51 52 53
- m-d.csv: store the association matrix where rows are miRNAs and columns are diseases
- miRNA-disease.txt: store miRNA-disease association list
- miRNA_seq.csv: the miRNAs sequence similarity matrix
- disease_sim.csv: the disease semantic similarity. 
- disease_sim2.csv: the disease semantic + phenotype similarity
- disease_not_found_list.txt: list of ids of diseases that are not found in MESH
Ngan Thi Dong's avatar
Ngan Thi Dong committed
54 55 56

To calculate disease semantic similarity from MESH ontology, disease GIP or miRNA sequence/functiona/GIP kernel similarity, please use the code provided in data/preparation folder.

Ngan Thi Dong's avatar
Ngan Thi Dong committed
57 58 59 60


If you use the code in your work, please cite the following paper:
Dong, Thi Ngan, and Megha Khosla. "A consistent evaluation of miRNA-disease association prediction models." bioRxiv (2020).