Map mutations on branches

Map mutations on branches#

Load our toy example

import scistree2 as s2
import numpy as np 
import pandas as pd
gp = s2.probability.from_csv('./data/toy_probs.csv')

Run inference

caller = s2.ScisTree2(threads=8) # use 8 threads
tree, imputed_genotype, likelihood = caller.infer(gp) # run Scistree2 inference
print('Imputed genotype from SPR: \n', imputed_genotype)
print('Newick of the SPR tree: ', tree)
print('Likelihood of the SPR tree: ', likelihood)
Imputed genotype from SPR: 
       cell1  cell2  cell3  cell4  cell5
snp1      1      0      1      0      0
snp2      0      1      0      1      0
snp3      1      0      1      0      0
snp4      0      0      0      0      1
snp5      1      0      1      0      0
snp6      1      1      1      1      0
Newick of the SPR tree:  (((cell1,cell3),(cell2,cell4)),cell5);
Likelihood of the SPR tree:  -6.271255186813891

Take a look at the inferred tree.

tree.draw()
           ┌cell5
 78c9455450┤
           │                     ┌cell4
           │          ┌f4144d5050┤
           │          │          └cell2
           └4811a4a3fa┤
                      │          ┌cell3
                      └e396ac3c84┤
                                 └cell1

You can now find out where mutations are placed on the tree. The node.mutations attribute of a branch’s ending node provides the mutation profile for that branch.

for node in tree.get_all_nodes():
    print(f'Mutations at branch ending at {tree[node].name}:', tree[node].mutations)
Mutations at branch ending at cell5: ['snp4']
Mutations at branch ending at 4811a4a3fa: ['snp6']
Mutations at branch ending at f4144d5050: ['snp2']
Mutations at branch ending at e396ac3c84: ['snp1', 'snp3', 'snp5']
Mutations at branch ending at cell4: []
Mutations at branch ending at cell2: []
Mutations at branch ending at cell3: []
Mutations at branch ending at cell1: []
Mutations at branch ending at 78c9455450: []

You can inject branch information into a Newick string by passing a custom function. This function extracts the desired data, such as the number of mutations, from each node and incorporates it into the string.

This allows you to represent the number of mutations as the branch length. With this Newick string, you can easily perform further analysis or visualization using other packages, for example, libraries like ete3 are well-suited for this purpose, although we won’t cover that part here.

def get_num_mutations(node):
    return len(node.mutations)

print(tree.output(branch_length_func=get_num_mutations)) # Newick string format: branch lengths represent the number of mutations.

print(tree.output(branch_length_func=lambda x: len(x.mutations))) # or simply, using a lambda expression.
(((cell1:0,cell3:0):3,(cell2:0,cell4:0):1):1,cell5:1):0;
(((cell1:0,cell3:0):3,(cell2:0,cell4:0):1):1,cell5:1):0;