Talk2data is a computational interface that allows scientists to “talk” with all of their individual single-cell datasets as a whole. While building the system, we have to tackle the following challenges:
Batch effect: Different datasets come from different platforms and with different expression units. We attack the problem by two-step transformation as described below [1.1. Data transformation].
Inconsistent annotation: In different studies, one condition (cell type/disease/stage …) can be annotated differently, according to the interest of the authors. To facilitate accurate combination of cells for queries, we apply sets of control vocabularies across the database [1.2. BioTuring annotation].
Heavy computation: The current BioTuring database contains more than 16 million cells, making up billions of records. This number is growing exponentially, posing challenges for data storage, computational time and resources. To tackle this, we build a proper data structure for effective storage and fast extraction of information, while pre-indexing the data to save computational time and resources.
Real-time calculation: One important characteristic of the Talk2data system is the flexibility, which allows scientists to ask questions on a subset of data that meets their interest (Ex: Epithelial cells in the tumor of smoked patients / Immune cells in the lymph node of lung …). The number of combinations is too huge to be pre-indexed, so we also optimize the system for efficient real-time querying and calculation.
To deal with the batch effect problem when combining data from different platforms, we transform the data in two steps:
For each cell in each single-cell dataset, we rank all of its genes by the expression level. The underlying rationale is that: If gene A is expressed higher than gene B in one cell, gene A should be ranked higher than gene B, no matter which technology is used to measure their expression level. As there can be multiple genes with the same absolute expression value, their ranks are then assigned by the average of their original ranks (SciPy, 2021). Given the assumption that a cell cannot express more than 5000 cells (in the current technology detection limit), we then convert the rank into 5000 – rank (Zhang, Ntranos, & Tse, 2020). In this setting, the gene with the highest expression value will have rank 5000.
Although the ranking system helps to overcome the differences between different units, the perturbation is still too large. A slight change in the original expression values can lead to significant differences in rank values. Therefore, we further transform the data into zero and one – one means this gene is expressed in the cell and vice versa. A gene is considered as expressed in a cell if it has a positive expression value and ranks among the top 5000 genes of the cell. From now, when we say gene A is expressed in cell B, it means gene A has the rank value smaller or equal to 5000 in cell B.
Cell annotation plays an important role in the analytic process and it should be consistent across all datasets, which allows accurate combination of cells with the same condition. Therefore, we spent enormous effort building up sets of controlled vocabularies. We now have our own cell type ontology for making authors’ cell type annotations consistent across the database. Other annotations such as anatomy, disease, diagnosis, treatment, gender, etc are specifically curated depending on the tissue.
Talk2Data is currently built on 2 databases: the BioTuring database and the GTEx database.
Scientists can view all the studies in the BioTuring public single-cell database on BioTuring website: https://bioturing.com/bbrowser/datasets
Talk2data also uses the Genotype-Tissue Expression (GTEx) database version 8 (https://gtexportal.org/home/) to let users study specific gene expression in human tissues. All samples were obtained from 54 non-diseased tissue sites from molecular assays consisting of WGS, WES, and RNA-Seq. To view the database information, you can select the Info icon on the top right corner.
Talk2Data database information with GTEx V8, as of Feb 2021.
To access Talk2Data, visit talk2data.bioturing.com. Please fill in your Talk2Data account, then Enter or click on Login.
Talk2data has also been integrated into the BioTuring Browser (BBrowser) desktop application from version 2.6.15. To access the Talk2data system, you can click on the Talk2data icon below the Home page icon in BBrowser.
The Talk2data dashboard has two main parts: Search box and Analytics area
Talk2Data uses the Jaccard index to find co-expressed genes.
The Jaccard Index between two genes g1 and g2 is calculated as follow:
The Talk2data system supports querying gene symbols, for both HGNC and HUGO nomenclature. Genes shared between human and mouse will be represented in the same way, as capitalized. Users can enter just the prefix of the gene symbol then Talk2data will suggest a list of gene names that match. Another way to enter a list of genes is to paste a list of genes separated by commas or spaces in your clipboard.
After querying gene(s), you will get a dot plot. The sizes of the circles represent the level of co-expression.
The panel on the right side of the Co-expression tab also shows what cell types express the gene of interest and how many percent of them express it.
For example, in the image below, IGKC is expressed in 324,397 B cells out of 529,600 B cells in BioTuring database (accounting for 61,25%).
To see the expression of all queried gene(s) across all cell types and subtypes, go to the Cell type tab. The x-axis of the scatter plot shows all the cell types. The y-axis shows the gene expression levels in a rank-based mechanism (please refer to our section on Data transformation). Each dot represents a cell.
Go to the Studies tab below the Search box to study the expression of queried gene(s) on each dataset in the BioTuring database.
This allows looking for studies that express your gene(s) of interest.
– Scatter (t-SNE): this is the default visualization. The 1st t-SNE plot shows the data colored by BioTuring – Cell Type metadata. The other t-SNE plots show the expression level of your gene(s). For more information on the expression unit, please refer to Section 10_Expression Unit.
– Violin plot: The violin plot shows the expression of your gene(s) on each cell type.
– Intersection: The intersection shows the composition of cells that express each set of the queried gene(s).
The RNA expression data of 54 human tissues from the GTEx database was collected and incorporated into the Talk2data system. This feature allows you to compare the gene expression on the tissue level. You can open this panel by clicking on the GTEx v8 tab as described below.
– t-SNE 2D: We applied t-SNE algorithm to reduce the dimensions of expression data of GTEX samples into 2 dimensions. Each dot represents a sample, which is colored by its tissue. In the expression plot, the color of each sample represents the expression level of the gene in that sample, ranging from yellow to red as the expression level goes from low to high.
– Violin plot: The expression of each gene across tissues is shown as violin plots.
To query potential marker genes of a cell type, you can select the Celltypes option at the drop-down menu at the top left corner > select a cell type of interest (e.g. T cell).
To query potential marker genes of a cell subtype, you can select the Celltypes option > select or search a subtype of interest (e.g. gamma-delta T cell).
This function helps you search the cell populations across the whole database that match some expression conditions.