Basic usage

Basic usage#

Import ScisTree2 as:

import numpy as np 
import pandas as pd
import scistree2 as s2

Initialize a ScisTree2 caller with 10 threads:

caller = s2.ScisTree2(threads=10)

ScisTree2 takes a GenotypProbability object as input. ScisTree2 provides many ways to construct that GenotypProbability objects from your own data, check the next chapter. Here, we use a pre-calculated probabilities for illustration.

In this example, we provide a small toy dataset where rows represent SNPs and columns represent cells.
Each entry in the matrix denotes the probability of being the wild type (reference).

This format is suitable when you already have probabilistic genotypes derived from upstream processing.

Note

Use 0.5 if it is missing.

# one can generate the input directly from the probability matrix.
probs = np.array([[0.01, 0.6, 0.08, 0.8, 0.7],
                 [0.8, 0.02, 0.7, 0.01, 0.3],
                 [0.02, 0.8, 0.02, 0.8, 0.9],
                 [0.9, 0.9, 0.8, 0.8, 0.02],
                 [0.01, 0.8, 0.01, 0.8, 0.9],
                 [0.05, 0.02, 0.7, 0.05, 0.9]]) 
cell_names = ['cell1', 'cell2', 'cell3', 'cell4', 'cell5']
site_names = ['snp1', 'snp2', 'snp3', 'snp4', 'snp5', 'snp6']
gp = s2.probability.from_probs(probs, cell_names, site_names)
print(gp)
      cell1  cell2  cell3  cell4  cell5
snp1   0.01   0.60   0.08   0.80   0.70
snp2   0.80   0.02   0.70   0.01   0.30
snp3   0.02   0.80   0.02   0.80   0.90
snp4   0.90   0.90   0.80   0.80   0.02
snp5   0.01   0.80   0.01   0.80   0.90
snp6   0.05   0.02   0.70   0.05   0.90

Now run ScisTree2:

tree, imputed_genotype, likelihood = caller.infer(gp) # run Scistree2 inference
print('Imputed genotype from SPR: \n', imputed_genotype)
print('Newick of the SPR tree: ', tree)
print('Likelihood of the SPR tree: ', likelihood)
Imputed genotype from SPR: 
       cell1  cell2  cell3  cell4  cell5
snp1      1      0      1      0      0
snp2      0      1      0      1      0
snp3      1      0      1      0      0
snp4      0      0      0      0      1
snp5      1      0      1      0      0
snp6      1      1      1      1      0
Newick of the SPR tree:  (((cell1,cell3),(cell2,cell4)),cell5);
Likelihood of the SPR tree:  -6.271255186813891

We can also replace SPR (Subtree Prune and Regraft) local search with NNI (Nearest Neighbor Interchange) by setting nni=True.

Note

Using NNI typically speeds up the algorithm, but may result in lower accuracy compared to SPR. NNI is recommended when a faster approximation is needed, especially for large datasets.

caller_nni = s2.ScisTree2(threads=8, nni=True)
tree_nni, imputed_genotype_nni, likelihood_nni = caller_nni.infer(gp)
print('Imputed genotype from NNI: \n', imputed_genotype_nni)
print('Newick of the NNI tree: ', tree_nni)
print('Likelihood of the NNI tree: ', likelihood_nni)
Imputed genotype from NNI: 
       cell1  cell2  cell3  cell4  cell5
snp1      1      0      1      0      0
snp2      0      1      0      1      0
snp3      1      0      1      0      0
snp4      0      0      0      0      1
snp5      1      0      1      0      0
snp6      1      1      1      1      0
Newick of the NNI tree:  (((cell1,cell3),(cell2,cell4)),cell5);
Likelihood of the NNI tree:  -6.271255186813891

We may also invoke ScisTree2 with Neighbor Joining (NJ) by setting nj=True to obtain only the initial tree.

Note

In this toy example, Neighbor Joining performs quite well and produces a tree close to the optimal.

caller_nj = s2.ScisTree2(threads=8, nj=True)
tree_nj, imputed_genotype_nj, likelihood_nj = caller_nj.infer(gp)
print('Imputed genotype from NJ: \n', imputed_genotype_nj)
print('Newick of the NJ tree: ', tree_nj)
print('Likelihood of the NJ tree: ', likelihood_nj)
Imputed genotype from NJ: 
       cell1  cell2  cell3  cell4  cell5
snp1      1      0      1      0      0
snp2      0      1      0      1      0
snp3      1      0      1      0      0
snp4      0      0      0      0      1
snp5      1      0      1      0      0
snp6      1      1      1      1      0
Newick of the NJ tree:  (((cell1,cell3),(cell2,cell4)),cell5);
Likelihood of the NJ tree:  -6.271255186813891

We can also evaluate a random or alternative tree using the genotype probability matrix.
Using the same example as before, we evaluate a random tree structure.

Note

As expected, the likelihood of this alternative tree is lower than that of the optimal tree (−6.27).

random_tree = '(((cell2,cell3),(cell5,cell1)),cell4);'
imputed_genotype_random, likelihood_random = s2.evaluate(gp, random_tree)
# imputed_genotype_random, likelihood_random, tree = s2.evaluate(gp, random_tree, return_tree=True) # or return a tree
print('Imputed genotype from SPR: \n', imputed_genotype_random)
print('Newick of the random tree: ', random_tree)
print('Likelihood of the random tree: ', likelihood_random)
Imputed genotype from SPR: 
       cell1  cell2  cell3  cell4  cell5
snp1      1      1      1      0      1
snp2      1      1      1      1      1
snp3      1      1      1      0      1
snp4      0      0      0      0      1
snp5      1      1      1      0      1
snp6      1      1      1      1      1
Newick of the random tree:  (((cell2,cell3),(cell5,cell1)),cell4);
Likelihood of the random tree:  -18.274574970677588

ScisTree2 operates on a heuristic algorithm, which means its execution time can vary. For users who need to control runtime and potentially stop the program early, the max_iter parameter is available. By setting max_iter, you can define the maximum number of local search steps performed.

caller_early_stop = s2.ScisTree2(threads=8, max_iter=2)
tree_early_stop, imputed_genotype_early_stop, likelihood_early_stop = caller_early_stop.infer(gp)
print('likelihood after 2 iterations:', likelihood_early_stop)
print('the optimal likelihood:', likelihood)
likelihood after 2 iterations: -6.271255186813891
the optimal likelihood: -6.271255186813891