Functions

convert.convert_gene(df, frm, to, species='human', frm_cols=[], verbose=True, bad_genes_col=False)

Convert gene names

Convert T-cell receptor (TCR) gene names between the IMGT, 10X, and Adaptive formats. Determines the columns to convert based on the input format (frm) unless specified by the user (frm_cols). Returns a modified version of the input data frame with converted gene names while preserving row order.

Behavioral Notes:

  • If a gene name cannot be mapped, it is replaced with NA, and a warning is issued.

  • If bad_genes_col=True, appends a ‘bad_genes’ column containing comma-separated gene names that could not be converted for each row.

  • If frm is 'imgt' and frm_cols is not provided, 10X column names are assumed.

  • Constant (C) genes are set to NA when converting to Adaptive formats, as Adaptive does not capture constant regions.

  • The input does not need to include all gene types; partial inputs (e.g., only V genes) are supported.

  • If no values in a custom column can be mapped (e.g., a CDR3 column) it is skipped and a warning is raised.

Standard Column Names:

If frm_cols is not provided, these column names will be used if present:

  • IMGT: 'v_gene', 'd_gene', 'j_gene', 'c_gene'

  • 10X: 'v_gene', 'd_gene', 'j_gene', 'c_gene'

  • Adaptive: 'v_resolved', 'd_resolved', 'j_resolved'

  • Adaptive v2: 'vMaxResolved', 'dMaxResolved', 'jMaxResolved'

Parameters:
  • df (DataFrame) – Dataframe containing TCR gene names

  • frm (str) – Input format of TCR data ['tenx', 'adaptive', 'adaptivev2', 'imgt']

  • to (str) – Output format of TCR data ['tenx', 'adaptive', 'adaptivev2', 'imgt']

  • species (str, optional) – Species name. Defaults to 'human'.

  • frm_cols (list of str, optional) – Custom gene column names.

  • verbose (bool, optional) – Whether to show all messages. Defaults to True.

  • bad_genes_col – Whether to add a column of the unconvertable genes. Defaults to False.

Returns:

Converted TCR data

Return type:

DataFrame

Example:

>>> import pandas as pd
>>> import tcrconvert
>>> tcr_file = tcrconvert.get_example_path('tenx.csv')
>>> df = pd.read_csv(tcr_file)[['barcode', 'v_gene', 'j_gene', 'cdr3']]
>>> df
              barcode        v_gene   j_gene             cdr3
0  AAACCTGAGACCACGA-1    TRAV29/DV5   TRAJ12     CAVMDSSYKLIF
1  AAACCTGAGACCACGA-1  TRBV20/OR9-2  TRBJ2-1  CASSGLAGGYNEQFF
2  AAACCTGAGGCTCTTA-1         TRDV2    TRDJ3  CASSGVAGGTDTQYF
3  AAACCTGAGGCTCTTA-1         TRGV9    TRGJ1     CAVKDSNYQLIW
>>> tcrconvert.convert_gene(df, 'tenx', 'adaptive', verbose=False)
              barcode              v_gene         j_gene             cdr3
0  AAACCTGAGACCACGA-1       TCRAV29-01*01  TCRAJ12-01*01     CAVMDSSYKLIF
1  AAACCTGAGACCACGA-1  TCRBV20-or09_02*01  TCRBJ02-01*01  CASSGLAGGYNEQFF
2  AAACCTGAGGCTCTTA-1       TCRDV02-01*01  TCRDJ03-01*01  CASSGVAGGTDTQYF
3  AAACCTGAGGCTCTTA-1       TCRGV09-01*01  TCRGJ01-01*01     CAVKDSNYQLIW
convert.choose_lookup(frm, to, species='human', verbose=True)

Choose lookup table

Determine which CSV lookup table to use based on the the input format (frm) and returns the path to that file.

Parameters:
  • frm (str) – Input format of TCR data ['tenx', 'adaptive', 'adaptivev2', 'imgt']

  • to (str) – Output format of TCR data ['tenx', 'adaptive', 'adaptivev2', 'imgt']

  • species (str, optional) – Species, defaults to 'human'

  • verbose (bool, optional) – Whether to show all messages, defaults to True

Returns:

Path to correct lookup table

Return type:

str

Example:

>>> import tcrconvert
>>> tcrconvert.convert.choose_lookup('imgt', 'adaptive', verbose=False) 
'.../tcrconvert/data/human/lookup.csv'
convert.which_frm_cols(df, frm, frm_cols=[], verbose=True)

Determine input columns to use

Determine the columns that are expected to hold gene name information in the input file based on the input format (frm). Returns a list of those column names.

Parameters:
  • df (DataFrame) – Dataframe containing TCR gene names

  • frm (str) – Input format of TCR data ['tenx', 'adaptive', 'adaptivev2', 'imgt']

  • frm_cols (list of str, optional) – Custom column names to use

  • verbose (bool, optional) – Whether to show all messages, defaults to True

Returns:

Column names to use

Return type:

list of str

Example:

>>> import pandas as pd
>>> import tcrconvert
>>> tcr_file = tcrconvert.get_example_path('tenx.csv')
>>> df = pd.read_csv(tcr_file)
>>> tcrconvert.convert.which_frm_cols(df, 'tenx')
['v_gene', 'd_gene', 'j_gene', 'c_gene']
convert.convert_gene_cli(*args: t.Any, **kwargs: t.Any) t.Any

Convert T-cell receptor V/D/J/C gene names.

Example:

Using custom input columns ‘myVgene’, ‘myDgene’, ‘myJgene’.


$ tcrconvert convert \
    --input tcrconvert/examples/customcols.csv \
    --output tcrconvert/examples/custom2adapt.tsv \
    --frm tenx \
    --to adaptive \
    -c myVgene \
    -c myDgene \
    -c myJgene
build_lookup.parse_imgt_fasta(infile)

Extract gene names from a reference FASTA

Extracts the second element from a “|”-delimited FASTA header, which will be the gene name for IMGT reference FASTAs.

Parameters:

infile (str) – Path to FASTA file

Returns:

Gene names

Return type:

list of str

Example:

Given a FASTA file containing this header:


>SomeText|TRBV29-1*01|MoreText|
>SomeText|TRBV29-1*02|MoreText|
>SomeText|TRBV29/OR9-2*01|MoreText|
>>> import tcrconvert
>>> fasta = tcrconvert.get_example_path('fasta_dir/test_trbv.fa')
>>> tcrconvert.build_lookup.parse_imgt_fasta(fasta)
['TRBV29-1*01', 'TRBV29-1*02', 'TRBV29/OR9-2*01']
build_lookup.extract_imgt_genes(data_dir)

Extract all gene names from a folder of FASTAs

First run parse_imgt_fasta() on all FASTA files in a given folder to pull out the gene names. Then return those names in an alphabetically sorted dataframe.

Parameters:

data_dir (str) – Path to directory containing FASTA files

Returns:

Gene names

Return type:

DataFrame

Example:

Given a folder with FASTA files containing these headers:


 >SomeText|TRAC*01|MoreText|
 >SomeText|TRAV1-1*01|MoreText|
 >SomeText|TRAV1-1*02|MoreText|
 >SomeText|TRAV1-2*01|MoreText|
 >SomeText|TRAV14/DV4*01|MoreText|
 >SomeText|TRAV38-1*01|MoreText|
 >SomeText|TRAV38-2/DV8*01|MoreText|
 >SomeText|TRBV29-1*01|MoreText|
 >SomeText|TRBV29-1*02|MoreText|
 >SomeText|TRBV29/OR9-2*01|MoreText|
>>> import tcrconvert
>>> fastadir = tcrconvert.get_example_path('fasta_dir')
>>> tcrconvert.build_lookup.extract_imgt_genes(fastadir)
              imgt
0          TRAC*01
1       TRAV1-1*01
2       TRAV1-1*02
3       TRAV1-2*01
4    TRAV14/DV4*01
5      TRAV38-1*01
6  TRAV38-2/DV8*01
7      TRBV29-1*01
8      TRBV29-1*02
9  TRBV29/OR9-2*01
build_lookup.add_dash_one(gene_str)

Add -01 to gene names lacking gene-level info

Some genes just have the IMGT subgroup (e.g. TRBV2) and allele (e.g. *01) designation. The Adaptive format always includes an IMGT gene (e.g. -01) designation, with “-01” as the apparent default. add_dash_one() adds a default gene-level designation if it’s missing.

Parameters:

gene_str (str) – Gene name

Returns:

Updated gene name

Return type:

str

Example:

>>> import tcrconvert
>>> tcrconvert.build_lookup.add_dash_one('TRBV2*01')
'TRBV2-01*01'
build_lookup.pad_single_digit(gene_str)

Add a 0 to single-digit gene-level designation

Take a gene name and ensure that any single-digit number following a sequence of letters is padded with a leading zero. This is to match the Adaptive format.

Parameters:

gene_str (str) – Gene name

Returns:

Updated gene name

Return type:

str

Example:

>>> import tcrconvert
>>> tcrconvert.build_lookup.pad_single_digit('TCRBV1-2')
'TCRBV01-2'
build_lookup.build_lookup_from_fastas(data_dir, species)

Create lookup tables

Process IMGT reference FASTA files in a given folder to generate lookup tables used for making gene name conversions. It extracts all gene names and transforms them into 10X and Adaptive formats following predefined conversion rules. The resulting files are created:

  • lookup.csv: IMGT gene names and their 10X and Adaptive equivalents.

  • lookup_from_tenx.csv: Gene names aggregated by their 10X identifiers, with one representative allele (*01) for each.

  • lookup_from_adaptive.csv: Adaptive gene names, with or without alleles, and their IMGT and 10X equivalents.

The files are saved in a given subfolder (species) within the appropriate application folder via platformdirs. For example:

  • MacOS: ~/Library/Application Support/<AppName>

  • Windows: C:\Documents and Settings\<User>\Application Data\Local Settings\<AppAuthor>\<AppName>

  • Linux: ~/.local/share/<AppName>

If a folder named species already exists in that location, it will be replaced.

Key transformations from IMGT:

  • 10X:
    • Remove allele information (e.g., *01) and modify /DV occurrences.

  • Adaptive:
    • Apply renaming rules such as adding gene-level designations and zero-padding single-digit numbers.

    • Convert constant genes to 'NoData' (Adaptive only captures VDJ) which will become NA after the merge in convert_gene().

Parameters:
  • data_dir (str) – Directory containing FASTA files

  • species (str) – Name of species that will be used when running TCRconvert with these lookup tables.

Returns:

Path to the new lookup directory

Return type:

str

Example:

>>> import tcrconvert
>>> fastadir = tcrconvert.get_example_path('fasta_dir')
>>> tcrconvert.build_lookup.build_lookup_from_fastas(fastadir, 'rabbit') 
'...tcrconvert/rabbit'
build_lookup.build_lookup_from_fastas_cli(*args: t.Any, **kwargs: t.Any) t.Any

Create lookup tables :Example:

$ tcrconvert build -i tcrconvert/examples/fasta_dir/ -s rabbit