Evaluate the inferred tree

Evaluate the inferred tree#

ScisTree2 provides several metrics to evaluate the results. These include:

  • Genotype Accuracy:
    scistree2.metric.genotype_accuracy(true_genotype, genotype)

  • Tree Accuracy (defined as 1 minus the normalized Robinson-Foulds distance):
    scistree2.metric.tree_accuracy(true_tree, tree)

  • Ancestor-Descendant Error:
    scistree2.metric.ancestor_descendant_error(true_mutation, mutation)

  • Different Lineage Error:
    scistree2.metric.different_lineage_error(true_mutaiton, mutation)

Usage examples are shown below:

Load the prepared dataset and run inference using SPR, NNI, and NJ respectively.

import scistree2 as s2
import numpy as np 
import pandas as pd


gp = s2.probability.from_csv('./data/toy_raw_reads.csv', source='read')

# SPR local search
caller_spr = s2.ScisTree2(threads=8)
tree_spr, imputed_genotype_spr, likelihood_spr = caller_spr.infer(gp)
# NNI local search
caller_nni = s2.ScisTree2(nni=True, threads=8)
tree_nni, imputed_genotype_nni, likelihood_nni = caller_nni.infer(gp)
# NJ
caller_nj = s2.ScisTree2(nj=True)
tree_nj, imputed_genotype_nj, likelihood_nj = caller_nj.infer(gp)

Load the ground truth if you have.

# get groundtruth
true_genotype = np.loadtxt('data/true_genotype.txt') # load true genotype provided by CellCoal
with open('data/true_tree.nwk', 'r') as f: 
    true_tree_nwk = f.readline().strip() # load true tree provided by CellCoal
true_tree = s2.util.from_newick(true_tree_nwk)
print('Newick of true tree', true_tree)
print('True genotype', true_genotype.shape)
Newick of true tree (((((((((cell14,cell26),cell27),cell11),((cell16,cell47),cell2)),cell17),cell30),((((cell36,cell6),cell7),cell48),cell1)),(((((cell10,cell18),(cell12,cell37)),cell35),cell9),(((cell0,cell34),cell45),cell33))),(((((((cell15,cell29),cell44),cell8),((cell3,cell49),cell28)),(((cell13,cell38),(cell20,cell21)),cell42)),(((((cell23,cell31),cell22),cell41),(cell19,cell25)),(((cell24,cell43),cell39),((cell46,cell5),cell40)))),(cell32,cell4)));
True genotype (100, 50)

Evaluate the genotype accuracy (MAPE between imputed genotype and the ground truth).

gacc_spr = s2.metric.genotype_accuarcy(true_genotype, imputed_genotype_spr.values)
gacc_nni = s2.metric.genotype_accuarcy(true_genotype, imputed_genotype_nni.values)
gacc_nj = s2.metric.genotype_accuarcy(true_genotype, imputed_genotype_nj.values)

Evaluate the tree accuracy using \(1 - RF_{norm}(t_1, t_2)\), we use normalized Robinson-Foulds distance here.

tacc_spr = s2.metric.tree_accuracy(true_tree, tree_spr)
tacc_nni = s2.metric.tree_accuracy(true_tree, tree_nni)
tacc_nj = s2.metric.tree_accuracy(true_tree, tree_nj)

Next, we calculate the Ancestor-Descendant Error and Different Lineage Error. Before doing this, we need to get the ancestor-descendant pairs.

mutation_true = s2.metric.get_ancestor_descendant_pairs(true_genotype)
mutations_spr = s2.metric.get_ancestor_descendant_pairs(imputed_genotype_spr.values)
mutations_nni = s2.metric.get_ancestor_descendant_pairs(imputed_genotype_nni.values)
mutations_nj = s2.metric.get_ancestor_descendant_pairs(imputed_genotype_nj.values)

Then, calculate those errors.

ad_err_spr = s2.metric.ancestor_descendant_error(mutation_true, mutations_spr)
ad_err_nni = s2.metric.ancestor_descendant_error(mutation_true, mutations_nni)
ad_err_nj = s2.metric.ancestor_descendant_error(mutation_true, mutations_nj)
dl_err_spr = s2.metric.different_lineage_error(mutation_true, mutations_spr)
dl_err_nni = s2.metric.different_lineage_error(mutation_true, mutations_nni)
dl_err_nj = s2.metric.different_lineage_error(mutation_true, mutations_nj)

Check the results. It is clear to see SPR local search usually performs better.

metrics = {
    "Method": ["SPR", "NNI", "NJ"],
    "Genotype Accuracy": [gacc_spr, gacc_nni, gacc_nj],
    "Tree Accuracy": [tacc_spr, tacc_nni, tacc_nj],
    "Ancestor-Descendant Error": [ad_err_spr, ad_err_nni, ad_err_nj],
    "Different Lineage Error": [dl_err_spr, dl_err_nni, dl_err_nj]
}

# Convert to DataFrame
df_metrics = pd.DataFrame(metrics)
df_metrics
Method Genotype Accuracy Tree Accuracy Ancestor-Descendant Error Different Lineage Error
0 SPR 0.9826 0.250000 0.479858 0.023928
1 NNI 0.9802 0.166667 0.478591 0.024925
2 NJ 0.9766 0.083333 0.503420 0.024925