README.md 4.41 KB
Newer Older
Ngan Thi Dong's avatar
Ngan Thi Dong committed
1 2
Simplifying miRNA-disease association prediction

Ngan Thi Dong's avatar
Ngan Thi Dong committed
3
To reproduce our result, use the following random seeds: [123, 456, 789, 101, 112]
Ngan Thi Dong's avatar
Ngan Thi Dong committed
4 5 6 7 8 9 10 11

The code was tested on python 3.7+, all required packages are put in requirements.txt

1. To **generate data for evaluation**, run 

`genFoldsData.py --data_dir data/XXX --save_dir data/XXX/folds --randseed RRR`

where XXX is the dataset(either hmdd2 or hmdd3), RRR is the random seed used
Ngan Thi Dong's avatar
Ngan Thi Dong committed
12 13 14
The script will split the data into five folds according to the given random seed. The training data set is balanced (negative:positive rate is set to 1). You can also change this rage with the `--neg_rate` argument. The testing data set is the whole test set consisting of the known association in the test split and all the possible combinations of miRNA-disease pairs (except the known ones in the training data). The program will also calculate miRNA functional, miRNA sequence, miRNA GIP similarity, and disease semantic, disease GIP similarity according to the training data in each data split.

We use the code given by the EPMDA authors for the GIP kernel similarity calculation. The code for miRNA sequence similarity is the one given by DBMDA authors. For miRNA functional and disease semantic similarity calculation, the code is partly adapted from the one released by MISIM 2.0 authors.
Ngan Thi Dong's avatar
Ngan Thi Dong committed
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

2. To evaluate **nimgcn** model, run:

`python eval_nimgcn.py --data_dir data/XXX --fold_dir data/XXX/folds --save_dir data/XXX/results`

other configurable parameters include:
- _sim_type_ should be one of ['functional1', 'functional2', 'gip', 'seq']. The default set up is 'functional2'
- _faulty_ should be either True or False. True means use the faulty calculated similarities
- _save_score_: should be either True or False, corresponding to whether to save to predicted scores or not
- _method_: should be one of ['nimgcn', 'nimgcn1', 'nimgcn2', 'nimgcn3']
- _randseed_: the random seed used to generate train/test split

2. To evaluate **dbmda** model, run:

`python eval_dbmda.py --data_dir data/XXX --fold_dir data/XXX/folds --save_dir data/XXX/results`

other configurable parameters include:
- _sim_type_ should be one of ['functional1', 'functional2', 'gip', 'seq']. The default set up is 'functional2'
- _faulty_ should be either True or False. True means use the faulty calculated similarities
- _save_score_: should be either True or False, corresponding to whether to save to predicted scores or not
- _use_autoencoder_: whether to use autoencoder or not
- _use_seq_sim_: whether to use seq sim or not
Ngan Thi Dong's avatar
Ngan Thi Dong committed
37
- _imbalanced_: whether to use imbalanced training data or not
Ngan Thi Dong's avatar
Ngan Thi Dong committed
38 39
- _randseed_: the random seed used to generate train/test split

Ngan Thi Dong's avatar
Ngan Thi Dong committed
40
3. For **EPMDA** since the features took a lot of time to run, we provide all calculated features in epmda/data folder
Ngan Thi Dong's avatar
Ngan Thi Dong committed
41 42
Please run `eval_epmda.py` with the corresponding arguments for evaluating EPMDA with the balance/imblance set up. 
For feature calculation, please refer to the *.py files in epmda folder. 
Ngan Thi Dong's avatar
Ngan Thi Dong committed
43

Ngan Thi Dong's avatar
Ngan Thi Dong committed
44
4. To get the topK scores for selected diseases, run `getTopRes.py`. The disease indexes given in the scripts are those used in our paper.
Ngan Thi Dong's avatar
Ngan Thi Dong committed
45

Ngan Thi Dong's avatar
Ngan Thi Dong committed
46 47

4. To run all the model with the **NEW** dataset, the following files are needed:
Ngan Thi Dong's avatar
Ngan Thi Dong committed
48 49 50 51 52 53
- m-d.csv: store the association matrix where rows are miRNAs and columns are diseases
- miRNA-disease.txt: store miRNA-disease association list
- miRNA_seq.csv: the miRNAs sequence similarity matrix
- disease_sim.csv: the disease semantic similarity. 
- disease_sim2.csv: the disease semantic + phenotype similarity
- disease_not_found_list.txt: list of ids of diseases that are not found in MESH
Ngan Thi Dong's avatar
Ngan Thi Dong committed
54 55 56

To calculate disease semantic similarity from MESH ontology, disease GIP or miRNA sequence/functiona/GIP kernel similarity, please use the code provided in data/preparation folder.

Ngan Thi Dong's avatar
Ngan Thi Dong committed
57 58


Ngan Thi Dong's avatar
Ngan Thi Dong committed
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77
### Results
1. Effect of the data leakage problem on different models on HMDD2 dataset.

![erroneous_vs_correct.png](images/erroneous_vs_correct.png)



2. Average AP scores for our studied models on HMDD2 and HMDD3 datasets. Simpler variances in many cases, out-perform the originally proposed models.

![hmdd2_hmdd3_ap_scores.png](images/hmdd2_hmdd3.png)



3. Effect of balance training data and imbalanced training data on different models on HMDD2 and HMDD3 datasets.

![balance_vs_imbalance.png](images/balance_vs_imbalance.png)



Ngan Thi Dong's avatar
Ngan Thi Dong committed
78
If you use the code in your work, please cite the following paper:
Ngan Thi Dong's avatar
Ngan Thi Dong committed
79
_Dong, Thi Ngan, and Megha Khosla. "A consistent evaluation of miRNA-disease association prediction models." bioRxiv (2020)._