Prepare your own data

Prepare your own data#

ScisTree2 works with genotype probability data, which it accepts as a GenotypeProbability object. You have several convenient methods to construct this essential object, each catering to different input data formats.

Note

All data and files can be found at GitHub

Method 1: Probability Matrix#

If you’ve already calculated your genotype probabilities, you can input them directly using the from_probs method.

scistree.probability.from_probs(probs, cell_names=None, site_names=None, margin=1e-5)

probs: A NumPy array containing the genotype probabilities.
cell_names (Optional): A list of names for your cells. If you don’t provide this, cells will be automatically named \(c_1\), \(c_2\), up to \(c_N\).
site_names (Optional): A list of names for your genomic sites. If not provided, sites will be named \(s_1\), \(s_2\), up to \(s_N\).
margin: This parameter helps prevent probabilities from becoming either extremely large or extremely small, ensuring numerical stability.

Note

This method is ideal when you have precise genotype probabilities readily available, perhaps from another analysis pipeline or a simulation.

import scistree2 as s2
import numpy as np 

probs = np.array([[0.01, 0.6, 0.08, 0.8, 0.7],
                 [0.8, 0.02, 0.7, 0.01, 0.3],
                 [0.02, 0.8, 0.02, 0.8, 0.9],
                 [0.9, 0.9, 0.8, 0.8, 0.02],
                 [0.01, 0.8, 0.01, 0.8, 0.9],
                 [0.05, 0.02, 0.7, 0.05, 0.9]]) 
cell_names = ['cell1', 'cell2', 'cell3', 'cell4', 'cell5']
site_names = ['snp1', 'snp2', 'snp3', 'snp4', 'snp5', 'snp6']
gp = s2.probability.from_probs(probs, cell_names, site_names)
print(gp)

      cell1  cell2  cell3  cell4  cell5
snp1   0.01   0.60   0.08   0.80   0.70
snp2   0.80   0.02   0.70   0.01   0.30
snp3   0.02   0.80   0.02   0.80   0.90
snp4   0.90   0.90   0.80   0.80   0.02
snp5   0.01   0.80   0.01   0.80   0.90
snp6   0.05   0.02   0.70   0.05   0.90

Method 2: From CSV File#

If you’ve already had a csv file containing probabilites or reads information, you can input it directly using the from_csv method.

scistree.probability.from_csv(csv_filepath, source="probability", **kwargs)

csv_path: The path to your CSV file.
source: Be either “probability” or “read”.
kwargs: Other parameters, such as ado, seqerr, etc.

	cell1	cell2	cell3	cell4	cell5
snp1	0.01	0.6	0.08	0.8	0.7
snp2	0.8	0.02	0.7	0.01	0.3
snp3	0.02	0.8	0.02	0.8	0.9
snp4	0.9	0.9	0.8	0.8	0.02
snp5	0.01	0.8	0.01	0.8	0.9
snp6	0.05	0.02	0.7	0.05	0.9

gp = s2.probability.from_csv('./data/toy_probs.csv')
print(gp)

      cell1  cell2  cell3  cell4  cell5
snp1   0.01   0.60   0.08   0.80   0.70
snp2   0.80   0.02   0.70   0.01   0.30
snp3   0.02   0.80   0.02   0.80   0.90
snp4   0.90   0.90   0.80   0.80   0.02
snp5   0.01   0.80   0.01   0.80   0.90
snp6   0.05   0.02   0.70   0.05   0.90

Method 3: From Read Counts#

For raw read count data, ScisTree2 can automatically convert these into genotype probabilities. This method includes crucial parameters to account for common experimental noise in sequencing data.

scistree.probability.from_reads(reads, ado=0.2, seqerr=0.01, posterior=True, af=None, cell_names=None, site_names=None)

reads: A three-dimensional NumPy array containing the read counts for each cell at each site.
ado: The Allele Dropout (ADO) rate, which accounts for the probability of an allele not being detected in a cell.
seqerr: The sequencing error rate, representing the probability of a base being incorrectly called during sequencing.
posterior: If set to True, ScisTree2 will use af to calculate the posterior genotype probability. If False, af defaults to \(0.5\).
af: The allele frequency.
cell_names (Optional): A list of names for your cells. If you don’t provide this, cells will be automatically named \(c_1\), \(c_2\), up to \(c_N\).
site_names (Optional): A list of names for your genomic sites. If not provided, sites will be named \(s_1\), \(s_2\), up to \(s_N\).

Note

This is your go-to method when you have sequencing read depth information and need to estimate genotype probabilities while accounting for common single-cell sequencing errors.

In this dataset, the input format remains a matrix of shape (num_sites, num_cells).
However, instead of using precomputed genotype probabilities, each entry is a tuple representing raw sequencing read counts.

Note

if the read counts are unknown (missing), just use (0, 0).

Each tuple has the form: (ref_count, alt_count), where:

ref_count is the number of reads supporting the reference (wild type) allele
alt_count is the number of reads supporting the mutation (alternative) allele

Note

This format is suitable when you start from raw read counts rather than inferred genotype probabilities, enabling ScisTree2 to perform probabilistic genotype modeling internally before tree inference.

This dataset contains 50 cells and 100 SNPs.
It was generated using CellCoal with the following command:

cellcoal-1.2.0 -n5 -s10 -l100 -e1000 -b1 -j30 -p0 -D0.5 -B0 0.01 -C5 -E0 -otoys -y3 -v -1 -2 -6 -7 -9 -Y

Preview:

	Cell 1	Cell 2	Cell 3	Cell 4	Cell 5	Cell 6	Cell 7	Cell 8	Cell 9	Cell 10
SNP 1	(4,1)	(4,0)	(5,1)	(4,0)	(1,0)	(4,0)	(6,0)	(6,0)	(5,0)	(5,0)
SNP 2	(4,0)	(3,0)	(7,1)	(1,1)	(3,0)	(2,0)	(10,0)	(11,0)	(1,0)	(9,0)
SNP 3	(4,5)	(11,0)	(4,0)	(3,0)	(4,0)	(3,0)	(7,0)	(0,0)	(2,0)	(4,0)
SNP 4	(8,0)	(2,0)	(6,0)	(1,0)	(2,2)	(3,5)	(5,3)	(1,3)	(2,4)	(8,0)
SNP 5	(5,0)	(9,0)	(4,0)	(3,0)	(4,0)	(3,0)	(7,0)	(4,1)	(4,0)	(5,0)

Note

Only the first 10 cells and 5 SNPs are shown here for illustration. The full dataset contains 100 SNPs × 50 cells. Use (0, 0) if there is any missing.

We calculate the posterior genotype probability with the following settings:

Allelic dropout rate: ado=0.2
Sequencing error rate: seqerr=0.01

When posterior=True (default), the posterior probability \(p(G \mid D)\) is calculated given the allele frequency af.
The parameter af should be a numpy.ndarray with shape (num_sites, 1).

If af is not provided or set to None, it will be automatically estimated from the current samples.

# one can generate the input directly from a reads matrix.
reads = np.load('data/toy_raw_reads.npy', allow_pickle=True) # load simulated reads
cell_names = [f'cell{i}' for i in range(50)]
site_names = [f'snp{i}' for i in range(100)]
gp = s2.probability.from_reads(reads, ado=0.2, seqerr=0.01, posterior=True, af=None, cell_names=cell_names, site_names=site_names)
print(gp)

       cell0  cell1  cell2  cell3  cell4  cell5  cell6  cell7  cell8  cell9  \
snp0    0.97   0.93   0.97   0.97   0.95   0.93   0.50   0.97   0.96   0.74   
snp1    0.98   0.71   0.82   0.98   0.82   0.93   0.96   0.98   0.98   0.97   
snp2    0.44   0.96   0.28   0.95   0.95   0.97   0.96   0.96   0.96   0.95   
snp3    0.74   0.88   0.88   0.01   0.00   0.00   0.34   0.90   0.01   0.88   
snp4    0.01   0.00   0.00   0.89   0.46   0.91   0.04   0.00   0.06   0.00   
...      ...    ...    ...    ...    ...    ...    ...    ...    ...    ...   
snp95   0.88   0.93   0.92   0.83   0.97   0.92   0.58   0.83   0.96   0.97   
snp96   0.87   0.94   0.00   0.94   0.97   0.87   0.96   0.97   0.55   0.92   
snp97   0.00   0.50   0.00   0.85   0.30   0.89   0.00   0.10   0.21   0.00   
snp98   0.99   0.99   0.96   0.98   0.98   0.99   0.98   0.98   0.99   0.98   
snp99   0.98   0.62   0.62   0.97   0.94   0.98   0.90   0.94   0.98   0.96   

       ...  cell40  cell41  cell42  cell43  cell44  cell45  cell46  cell47  \
snp0   ...    0.93    0.97    0.44    0.96    0.28    0.96    0.96    0.44   
snp1   ...    0.98    0.98    0.97    0.98    0.71    0.50    0.98    0.98   
snp2   ...    0.50    0.50    0.28    0.97    0.95    0.28    0.96    0.50   
snp3   ...    0.06    0.12    0.00    0.00    0.00    0.82    0.88    0.65   
snp4   ...    0.86    0.71    0.92    0.89    0.89    0.06    0.92    0.86   
...    ...     ...     ...     ...     ...     ...     ...     ...     ...   
snp95  ...    0.50    0.97    0.97    0.18    0.97    0.50    0.96    0.95   
snp96  ...    0.96    0.94    0.14    0.87    0.97    0.97    0.94    0.01   
snp97  ...    0.79    0.50    0.90    0.88    0.70    0.00    0.92    0.60   
snp98  ...    0.36    0.99    0.99    0.99    0.97    0.99    0.99    0.98   
snp99  ...    0.90    0.06    0.96    0.97    0.85    0.97    0.90    0.06   

       cell48  cell49  
snp0     0.98    0.96  
snp1     0.98    0.98  
snp2     0.98    0.93  
snp3     0.74    0.00  
snp4     0.06    0.92  
...       ...     ...  
snp95    0.97    0.97  
snp96    0.09    0.96  
snp97    0.00    0.70  
snp98    0.98    0.98  
snp99    0.94    0.94  

[100 rows x 50 columns]

Method 4: From VCF File#

ScisTree2 also supports direct input from Variant Call Format (VCF) files, a common format for storing genetic variation data.

scistree.probability.from_vcf(vcf_path, ado=0.2, seqerr=0.01, posterior=True, af=None, key='AD')

vcf_path: The path to your VCF file.
ado: The Allele Dropout (ADO) rate, which accounts for the probability of an allele not being detected in a cell.
seqerr: The sequencing error rate, representing the probability of a base being incorrectly called during sequencing.
posterior: If set to True, ScisTree2 will use af to calculate the posterior genotype probability. If False, af defaults to \(0.5\).
af: The allele frequency.
key: This is the specific tag within the VCF file that stores the read depth information.

vcf_path = './data/example.vcf'
gp = s2.probability.from_vcf(vcf_path, ado=0.2, seqerr=0.01, posterior=True, af=None)
print(gp)

Info: Skipping line because AD not found in FORMAT: CHR[chr1], POS[250]
          CELL_1  CELL_2  CELL_3  CELL_4  CELL_5
chr1:100    0.00    0.88    0.00    0.00    0.88
chr1:150    0.95    0.00    0.50    0.95    0.00
chr1:200    0.00    0.00    0.77    0.00    0.00

Prepare your own data

Contents

Prepare your own data#

Method 1: Probability Matrix#

Method 2: From CSV File#

Method 3: From Read Counts#

Method 4: From VCF File#