Functions
- convert.convert_gene(df, frm, to, species='human', frm_cols=[], verbose=True, bad_genes_col=False)
Convert gene names
Convert T-cell receptor (TCR) gene names between the IMGT, 10X, and Adaptive formats. Determines the columns to convert based on the input format (
frm) unless specified by the user (frm_cols). Returns a modified version of the input data frame with converted gene names while preserving row order.Behavioral Notes:
If a gene name cannot be mapped, it is replaced with
NA, and a warning is issued.If
bad_genes_col=True, appends a ‘bad_genes’ column containing comma-separated gene names that could not be converted for each row.If
frmis'imgt'andfrm_colsis not provided, 10X column names are assumed.Constant (C) genes are set to
NAwhen converting to Adaptive formats, as Adaptive does not capture constant regions.The input does not need to include all gene types; partial inputs (e.g., only V genes) are supported.
If no values in a custom column can be mapped (e.g., a CDR3 column) it is skipped and a warning is raised.
Standard Column Names:
If
frm_colsis not provided, these column names will be used if present:IMGT:
'v_gene','d_gene','j_gene','c_gene'10X:
'v_gene','d_gene','j_gene','c_gene'Adaptive:
'v_resolved','d_resolved','j_resolved'Adaptive v2:
'vMaxResolved','dMaxResolved','jMaxResolved'
- Parameters:
df (DataFrame) – Dataframe containing TCR gene names
frm (str) – Input format of TCR data
['tenx', 'adaptive', 'adaptivev2', 'imgt']to (str) – Output format of TCR data
['tenx', 'adaptive', 'adaptivev2', 'imgt']species (str, optional) – Species name. Defaults to
'human'.frm_cols (list of str, optional) – Custom gene column names.
verbose (bool, optional) – Whether to show all messages. Defaults to
True.bad_genes_col – Whether to add a column of the unconvertable genes. Defaults to
False.
- Returns:
Converted TCR data
- Return type:
DataFrame
- Example:
>>> import pandas as pd >>> import tcrconvert >>> tcr_file = tcrconvert.get_example_path('tenx.csv') >>> df = pd.read_csv(tcr_file)[['barcode', 'v_gene', 'j_gene', 'cdr3']] >>> df barcode v_gene j_gene cdr3 0 AAACCTGAGACCACGA-1 TRAV29/DV5 TRAJ12 CAVMDSSYKLIF 1 AAACCTGAGACCACGA-1 TRBV20/OR9-2 TRBJ2-1 CASSGLAGGYNEQFF 2 AAACCTGAGGCTCTTA-1 TRDV2 TRDJ3 CASSGVAGGTDTQYF 3 AAACCTGAGGCTCTTA-1 TRGV9 TRGJ1 CAVKDSNYQLIW >>> tcrconvert.convert_gene(df, 'tenx', 'adaptive', verbose=False) barcode v_gene j_gene cdr3 0 AAACCTGAGACCACGA-1 TCRAV29-01*01 TCRAJ12-01*01 CAVMDSSYKLIF 1 AAACCTGAGACCACGA-1 TCRBV20-or09_02*01 TCRBJ02-01*01 CASSGLAGGYNEQFF 2 AAACCTGAGGCTCTTA-1 TCRDV02-01*01 TCRDJ03-01*01 CASSGVAGGTDTQYF 3 AAACCTGAGGCTCTTA-1 TCRGV09-01*01 TCRGJ01-01*01 CAVKDSNYQLIW
- convert.choose_lookup(frm, to, species='human', verbose=True)
Choose lookup table
Determine which CSV lookup table to use based on the the input format (
frm) and returns the path to that file.- Parameters:
frm (str) – Input format of TCR data
['tenx', 'adaptive', 'adaptivev2', 'imgt']to (str) – Output format of TCR data
['tenx', 'adaptive', 'adaptivev2', 'imgt']species (str, optional) – Species, defaults to
'human'verbose (bool, optional) – Whether to show all messages, defaults to
True
- Returns:
Path to correct lookup table
- Return type:
str
- Example:
>>> import tcrconvert >>> tcrconvert.convert.choose_lookup('imgt', 'adaptive', verbose=False) '.../tcrconvert/data/human/lookup.csv'
- convert.which_frm_cols(df, frm, frm_cols=[], verbose=True)
Determine input columns to use
Determine the columns that are expected to hold gene name information in the input file based on the input format (
frm). Returns a list of those column names.- Parameters:
df (DataFrame) – Dataframe containing TCR gene names
frm (str) – Input format of TCR data
['tenx', 'adaptive', 'adaptivev2', 'imgt']frm_cols (list of str, optional) – Custom column names to use
verbose (bool, optional) – Whether to show all messages, defaults to
True
- Returns:
Column names to use
- Return type:
list of str
- Example:
>>> import pandas as pd >>> import tcrconvert >>> tcr_file = tcrconvert.get_example_path('tenx.csv') >>> df = pd.read_csv(tcr_file) >>> tcrconvert.convert.which_frm_cols(df, 'tenx') ['v_gene', 'd_gene', 'j_gene', 'c_gene']
- convert.convert_gene_cli(*args: t.Any, **kwargs: t.Any) t.Any
Convert T-cell receptor V/D/J/C gene names.
- Example:
Using custom input columns ‘myVgene’, ‘myDgene’, ‘myJgene’.
$ tcrconvert convert \ --input tcrconvert/examples/customcols.csv \ --output tcrconvert/examples/custom2adapt.tsv \ --frm tenx \ --to adaptive \ -c myVgene \ -c myDgene \ -c myJgene
- build_lookup.parse_imgt_fasta(infile)
Extract gene names from a reference FASTA
Extracts the second element from a “|”-delimited FASTA header, which will be the gene name for IMGT reference FASTAs.
- Parameters:
infile (str) – Path to FASTA file
- Returns:
Gene names
- Return type:
list of str
- Example:
Given a FASTA file containing this header:
>SomeText|TRBV29-1*01|MoreText| >SomeText|TRBV29-1*02|MoreText| >SomeText|TRBV29/OR9-2*01|MoreText|
>>> import tcrconvert >>> fasta = tcrconvert.get_example_path('fasta_dir/test_trbv.fa') >>> tcrconvert.build_lookup.parse_imgt_fasta(fasta) ['TRBV29-1*01', 'TRBV29-1*02', 'TRBV29/OR9-2*01']
- build_lookup.extract_imgt_genes(data_dir)
Extract all gene names from a folder of FASTAs
First run
parse_imgt_fasta()on all FASTA files in a given folder to pull out the gene names. Then return those names in an alphabetically sorted dataframe.- Parameters:
data_dir (str) – Path to directory containing FASTA files
- Returns:
Gene names
- Return type:
DataFrame
- Example:
Given a folder with FASTA files containing these headers:
>SomeText|TRAC*01|MoreText| >SomeText|TRAV1-1*01|MoreText| >SomeText|TRAV1-1*02|MoreText| >SomeText|TRAV1-2*01|MoreText| >SomeText|TRAV14/DV4*01|MoreText| >SomeText|TRAV38-1*01|MoreText| >SomeText|TRAV38-2/DV8*01|MoreText| >SomeText|TRBV29-1*01|MoreText| >SomeText|TRBV29-1*02|MoreText| >SomeText|TRBV29/OR9-2*01|MoreText|
>>> import tcrconvert >>> fastadir = tcrconvert.get_example_path('fasta_dir') >>> tcrconvert.build_lookup.extract_imgt_genes(fastadir) imgt 0 TRAC*01 1 TRAV1-1*01 2 TRAV1-1*02 3 TRAV1-2*01 4 TRAV14/DV4*01 5 TRAV38-1*01 6 TRAV38-2/DV8*01 7 TRBV29-1*01 8 TRBV29-1*02 9 TRBV29/OR9-2*01
- build_lookup.add_dash_one(gene_str)
Add
-01to gene names lacking gene-level infoSome genes just have the IMGT subgroup (e.g. TRBV2) and allele (e.g. *01) designation. The Adaptive format always includes an IMGT gene (e.g. -01) designation, with “-01” as the apparent default. add_dash_one() adds a default gene-level designation if it’s missing.
- Parameters:
gene_str (str) – Gene name
- Returns:
Updated gene name
- Return type:
str
- Example:
>>> import tcrconvert >>> tcrconvert.build_lookup.add_dash_one('TRBV2*01') 'TRBV2-01*01'
- build_lookup.pad_single_digit(gene_str)
Add a
0to single-digit gene-level designationTake a gene name and ensure that any single-digit number following a sequence of letters is padded with a leading zero. This is to match the Adaptive format.
- Parameters:
gene_str (str) – Gene name
- Returns:
Updated gene name
- Return type:
str
- Example:
>>> import tcrconvert >>> tcrconvert.build_lookup.pad_single_digit('TCRBV1-2') 'TCRBV01-2'
- build_lookup.build_lookup_from_fastas(data_dir, species)
Create lookup tables
Process IMGT reference FASTA files in a given folder to generate lookup tables used for making gene name conversions. It extracts all gene names and transforms them into 10X and Adaptive formats following predefined conversion rules. The resulting files are created:
lookup.csv: IMGT gene names and their 10X and Adaptive equivalents.lookup_from_tenx.csv: Gene names aggregated by their 10X identifiers, with one representative allele (*01) for each.lookup_from_adaptive.csv: Adaptive gene names, with or without alleles, and their IMGT and 10X equivalents.
The files are saved in a given subfolder (
species) within the appropriate application folder viaplatformdirs. For example:MacOS:
~/Library/Application Support/<AppName>Windows:
C:\Documents and Settings\<User>\Application Data\Local Settings\<AppAuthor>\<AppName>Linux:
~/.local/share/<AppName>
If a folder named
speciesalready exists in that location, it will be replaced.Key transformations from IMGT:
- 10X:
Remove allele information (e.g.,
*01) and modify/DVoccurrences.
- Adaptive:
Apply renaming rules such as adding gene-level designations and zero-padding single-digit numbers.
Convert constant genes to
'NoData'(Adaptive only captures VDJ) which will becomeNAafter the merge inconvert_gene().
- Parameters:
data_dir (str) – Directory containing FASTA files
species (str) – Name of species that will be used when running TCRconvert with these lookup tables.
- Returns:
Path to the new lookup directory
- Return type:
str
- Example:
>>> import tcrconvert >>> fastadir = tcrconvert.get_example_path('fasta_dir') >>> tcrconvert.build_lookup.build_lookup_from_fastas(fastadir, 'rabbit') '...tcrconvert/rabbit'
- build_lookup.build_lookup_from_fastas_cli(*args: t.Any, **kwargs: t.Any) t.Any
Create lookup tables :Example:
$ tcrconvert build -i tcrconvert/examples/fasta_dir/ -s rabbit