FAQ
=====

How does TCRconvert work?
---------------------------

TCRconvert performs a ``merge`` between the input data and a lookup table that
includes gene names with each nomenclature. These lookup tables are constructed
from IMGT reference FASTA files and account for the specific naming peculiarities
of each platform. The built-in lookup tables are located under ``tcrconvert/data/``.
The code used to build the lookup tables, which demonstrates the conversion logic, is
within the ``build_lookup_from_fastas()`` function in the
``tcrconvert.build_lookup`` module.


What input columns are required?
----------------------------------

TCRconvert expects at least one V, D, J, and/or C gene column. You can use standard
10X and Adaptive column names or custom names.


What if I have missing genes?
-------------------------------

``NA`` values in the input dataframe will remain ``NA`` in the output. Genes that
are not found in the lookup table (which is based on the IMGT reference), will be
converted to ``NA``. The built-in lookup tables are located under ``tcrconvert/data/``.


Are gamma-delta TCRs supported?
----------------------------------

Yes, for human, mouse, and rhesus macaque.


How are alleles added from 10X data?
--------------------------------------

Since 10X does not provide allele-level information, all genes are assigned the allele ``*01``.


How are C genes converted to Adaptive?
----------------------------------------

Adaptive does not capture constant ("C") gene information. If converting to the
Adaptive format, all C genes will be set to ``NA``.


What column names should I use for my IMGT-formatted data?
------------------------------------------------------------

IMGT does not have standard column names, so it's assumed that the 10X names
are used: (``v_gene``, ``d_gene``, ``j_gene``, ``c_gene``). To use other names,
specify them as a list with ``frm_cols`` (library) or ``--column`` (command-line).


Can I input AIRR files?
-------------------------

Yes, just specify the AIRR column names (``v_call``, ``d_call``, ``j_call``, ``c_call``)
using ``frm_cols`` (library) or ``--column`` (command-line). You must still
specify the input naming convention with ``frm``.


What if I have custom column names?
-------------------------------------

If you're using non-standard column names that do not match 10X, Adaptive, or
Adaptive V2 formats, specify them with ``frm_cols`` (library) or
``--column`` (command-line).


What about odd names (e.g. ``TRAV14DV4``, ``TCRAV01-02/12-02``)?
------------------------------------------------------------------

Gene names containing "OR" or "DV" are accounted for in the lookup tables.

Combinations of gene names, like ``TCRAV01-02/12-02``, will be converted to ``NA``
because they are not in the IMGT reference.


What about genes that aren't the reference?
---------------------------------------------
A warning message always lists all unique genes that could not be converted.

Genes not in the reference are replaced with NA in the converted columns by default.
If ``bad_genes_col=True``, the original unconverted gene names are retained in a
'bad_genes' column that contains the comma-separated 'bad' names for each row.


Are non-human species supported?
----------------------------------

Mouse and rhesus macaque are supported out-of-the-box. For other species, see
pages for building custom lookup tables for library or command-line usage.

The rhesus and mouse lookup tables were built from IMGT reference FASTAs and
gene tables. Mouse genes cover both "Mouse" and "Mouse C57BL/6J" as listed in IMGT.


What if my Adaptive data lacks ``x_resolved`` / ``xMaxResolved`` columns?
---------------------------------------------------------------------------

Create them by combining ``x_gene``/``xGeneName`` and
``x_allele``/``xGeneAllele`` with ``*`` as a separator. Example code:

.. code-block:: python


    import pandas as pd
    import numpy as np

    def create_col(gene_col, allele_col):
        return np.where(allele_col.notna(), gene_col + "*" + allele_col, gene_col)

    new_df = adaptive_df.copy()

    # Adaptive
    new_df['v_resolved'] = create_col(new_df['v_gene'], new_df['v_allele'])
    new_df['d_resolved'] = create_col(new_df['d_gene'], new_df['d_allele'])
    new_df['j_resolved'] = create_col(new_df['j_gene'], new_df['j_allele'])

    # Adaptive v2
    new_df['vMaxResolved'] = create_col(new_df['vGeneName'], new_df['vGeneAllele'])
    new_df['dMaxResolved'] = create_col(new_df['dGeneName'], new_df['dGeneAllele'])
    new_df['jMaxResolved'] = create_col(new_df['jGeneName'], new_df['jGeneAllele'])