TF-IDF (Term Frequency-Inverse Document Frequency) normalization is borrowed from natural language processing to identify features that are highly expressed in specific samples but not widely expressed across the entire dataset.
There are several different implementations that apply log or binarization
to different terms. sub_method = c(1:3)
and dgCMatrix
optimizations are
based on the ArchR implementations.
$$\LARGE TF_{i,j} = \frac{x_{i,j}}{\sum_{i} x_{i,j}} $$
$$\LARGE IDF_{i} = \frac{n_{samples}}{\sum_{j} x_{i,j}} $$
$$\LARGE IBDF_{i} = \frac{n_{samples}}{1 + n_{samples \: where \: feature \: i > 0}} $$
Implementations (sub_method
):
$$\large (default) \quad TFIDF_{i,j} = TF_{i,j} \times \log(IBDF_{i} + 1) $$
$$\large (1) \quad TFIDF_{i,j} = TF_{i,j} \times \log(IDF_{i} + 1) $$
$$\large (2) \quad TFIDF_{i,j} = \log(TF_{i,j} \times IDF_{i} \times S + 1) \quad $$
$$\large (3) \quad TFIDF_{i,j} = \log(TF_{i,j} + 1) \times \log(IDF_{i} + 1) $$
Where:
(\(x_{i,j}\)) is the raw count for feature \(i\) in sample \(j\)
(\(TF_{i,j}\)) is the term frequency of feature \(i\) in sample \(j\)
(\(IDF_{i}\)) is the inverse document frequency of feature \(i\)
(\(IBDF_{i}\)) is the inverse binarized document frequency of feature \(i\)
(\(TFIDF_{i,j}\)) is the final TF-IDF normalized value
(\(S\)) is a scalefactor (default = 10000)
normalized object
L2 normalization is commonly performed after TF-IDF normalization
sub_method | Either numeric 1, 2, or 3 or "default". Determines which
set of defaults to use during the TF-IDF calculation. Methods 1-3 map to
the same LSIMethod settings in ArchR. See sub_method section below. |
log_tf | logical (overrides sub_method defaults). Whether to log
transform TF values (includes a +1 offset). |
log_idf | logical (overrides sub_method defaults). Whether to log
transform IDF values (includes a +1 offset). |
log_tf_idf | logical (overrides sub_method defaults). Whether to log
transform the TF-IDF value (also applies a scalefactor (\(s\)) and +1
offset before the log operation). |
binarized_rowsums | logical (overrides sub_method defaults).
Whether to calculate IBDF instead of IDF, where the calculation is based
on the presence of a feature as opposed to its count. |
scalefactor | numeric (default = 10000). A scalefactor used when
log_tf_idf = TRUE . |
sub_method
sub_method
can be one of "default"
or any of the other implementations
from 1 to 3. These apply some defaults to the way that TF-IDF is calculated.
The individual log_
params will override these defaults.
"default"
- default Giotto implementation:
log_idf = TRUE
binarized_rowsums = TRUE
1
- Method introduced in Cusanovich et al. 2018.
log_idf = TRUE
2
- Method introduced in Stuart et al. 2021.
log_tf_idf = TRUE
3
- Method 3 in ArchR iterativeLSI()
log_tf = TRUE
log_idf = TRUE
Cusanovich, D., Reddington, J., Garfield, D. et al. The cis-regulatory dynamics of embryonic development at single-cell resolution. Nature 555, 538–542 (2018). https://doi.org/10.1038/nature25981
Stuart T, Srivastava A, Madad S, Lareau CA, Satija R. Single-cell chromatin state analysis with Signac. Nat Methods. 2021 Nov;18(11):1333-1341. doi: 10.1038/s41592-021-01282-5.
Granja JM, Corces MR, Pierce SE, Bagdatli ST, Choudhry H, Chang HY, Greenleaf WJ. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat Genet. 2021 Mar;53(3):403-411. doi: 10.1038/s41588-021-00790-6.
Other normalization parameters:
norm_arcsinh
,
norm_default
,
norm_l2
,
norm_library
,
norm_log
,
norm_osmfish
,
norm_pearson
,
norm_quantile
e <- GiottoData::loadSubObjectMini("exprObj")
processData(e, normParam("tf-idf"))
#> An object of class exprObj : "normalized"
#> spat_unit : "aggregate"
#> feat_type : "rna"
#> provenance: z0 z1
#>
#> contains:
#> 337 x 462 sparse Matrix of class "dgCMatrix"
#>
#> Mlc1 . . . . . . 0.04024682 . . 0.04674900 .
#> Gprc5b . 0.04891466 . 0.04435909 . . 0.03671471 . 0.04054054 0.04207417 .
#> Gfap 0.2961592 . 0.06344594 0.03339819 . . 0.05997783 . 0.02869596 0.03430300 .
#>
#> Mlc1 . . ......
#> Gprc5b 0.02977132 . ......
#> Gfap . . ......
#>
#> ........suppressing 449 columns and 331 rows
#>
#> Adgrf4 . . . . . . . . . . . . . ......
#> Epha2 . . . . . . . . . . . . . ......
#> Blank-139 . . . . . . . . . . . . . ......
#>
#> First four colnames:
#> 240649020551054330404932383065726870513
#> 274176126496863898679934791272921588227
#> 323754550002953984063006506310071917306
#> 87260224659312905497866017323180367450
processData(e, normParam("tf-idf", sub_method = 1))
#> An object of class exprObj : "normalized"
#> spat_unit : "aggregate"
#> feat_type : "rna"
#> provenance: z0 z1
#>
#> contains:
#> 337 x 462 sparse Matrix of class "dgCMatrix"
#>
#> Mlc1 . . . . . . 0.013323296 . .
#> Gprc5b . 0.01330919 . 0.012069661 . . 0.009989702 . 0.011030674
#> Gfap 0.06768595 . 0.0145003 0.007633016 . . 0.013707681 . 0.006558341
#>
#> Mlc1 0.015475776 . . . ......
#> Gprc5b 0.011447959 . 0.008100477 . ......
#> Gfap 0.007839806 . . . ......
#>
#> ........suppressing 449 columns and 331 rows
#>
#> Adgrf4 . . . . . . . . . . . . . ......
#> Epha2 . . . . . . . . . . . . . ......
#> Blank-139 . . . . . . . . . . . . . ......
#>
#> First four colnames:
#> 240649020551054330404932383065726870513
#> 274176126496863898679934791272921588227
#> 323754550002953984063006506310071917306
#> 87260224659312905497866017323180367450
processData(e, normParam("tf-idf", sub_method = 2))
#> An object of class exprObj : "normalized"
#> spat_unit : "aggregate"
#> feat_type : "rna"
#> provenance: z0 z1
#>
#> contains:
#> 337 x 462 sparse Matrix of class "dgCMatrix"
#>
#> Mlc1 . . . . . . 5.228877 . . 5.377893 . . .
#> Gprc5b . 5.102927 . 5.005792 . . 4.818045 . 4.916407 4.953272 . 4.610297 .
#> Gfap 6.682494 . 5.146369 4.509905 . . 5.090492 . 4.359961 4.536346 . . .
#>
#> Mlc1 ......
#> Gprc5b ......
#> Gfap ......
#>
#> ........suppressing 449 columns and 331 rows
#>
#> Adgrf4 . . . . . . . . . . . . . ......
#> Epha2 . . . . . . . . . . . . . ......
#> Blank-139 . . . . . . . . . . . . . ......
#>
#> First four colnames:
#> 240649020551054330404932383065726870513
#> 274176126496863898679934791272921588227
#> 323754550002953984063006506310071917306
#> 87260224659312905497866017323180367450
processData(e, normParam("tf-idf", sub_method = 3))
#> An object of class exprObj : "normalized"
#> spat_unit : "aggregate"
#> feat_type : "rna"
#> provenance: z0 z1
#>
#> contains:
#> 337 x 462 sparse Matrix of class "dgCMatrix"
#>
#> Mlc1 . . . . . . 0.013184339 . .
#> Gprc5b . 0.01309169 . 0.011890432 . . 0.009866505 . 0.01088072
#> Gfap 0.06138386 . 0.01418047 0.007543147 . . 0.013421407 . 0.00649185
#>
#> Mlc1 0.015288711 . . . ......
#> Gprc5b 0.011286555 . 0.00801922 . ......
#> Gfap 0.007745042 . . . ......
#>
#> ........suppressing 449 columns and 331 rows
#>
#> Adgrf4 . . . . . . . . . . . . . ......
#> Epha2 . . . . . . . . . . . . . ......
#> Blank-139 . . . . . . . . . . . . . ......
#>
#> First four colnames:
#> 240649020551054330404932383065726870513
#> 274176126496863898679934791272921588227
#> 323754550002953984063006506310071917306
#> 87260224659312905497866017323180367450