TF-IDF Normalization

TF-IDF (Term Frequency-Inverse Document Frequency) normalization is borrowed from natural language processing to identify features that are highly expressed in specific samples but not widely expressed across the entire dataset.

There are several different implementations that apply log or binarization to different terms. sub_method = c(1:3) and dgCMatrix optimizations are based on the ArchR implementations.

$$\LARGE TF_{i,j} = \frac{x_{i,j}}{\sum_{i} x_{i,j}} $$

$$\LARGE IDF_{i} = \frac{n_{samples}}{\sum_{j} x_{i,j}} $$

$$\LARGE IBDF_{i} = \frac{n_{samples}}{1 + n_{samples \: where \: feature \: i > 0}} $$

Implementations (sub_method):

$$\large (default) \quad TFIDF_{i,j} = TF_{i,j} \times \log(IBDF_{i} + 1) $$

$$\large (1) \quad TFIDF_{i,j} = TF_{i,j} \times \log(IDF_{i} + 1) $$

$$\large (2) \quad TFIDF_{i,j} = \log(TF_{i,j} \times IDF_{i} \times S + 1) \quad $$

$$\large (3) \quad TFIDF_{i,j} = \log(TF_{i,j} + 1) \times \log(IDF_{i} + 1) $$

Where:

($x_{i,j}$) is the raw count for feature $i$ in sample $j$
($TF_{i,j}$) is the term frequency of feature $i$ in sample $j$
($IDF_{i}$) is the inverse document frequency of feature $i$
($IBDF_{i}$) is the inverse binarized document frequency of feature $i$
($TFIDF_{i,j}$) is the final TF-IDF normalized value
($S$) is a scalefactor (default = 10000)

Value

normalized object

Note

L2 normalization is commonly performed after TF-IDF normalization

params

`sub_method`	Either numeric 1, 2, or 3 or "default". Determines which set of defaults to use during the TF-IDF calculation. Methods 1-3 map to the same `LSIMethod` settings in ArchR. See sub_method section below.
`log_tf`	logical (overrides `sub_method` defaults). Whether to log transform TF values (includes a +1 offset).
`log_idf`	logical (overrides `sub_method` defaults). Whether to log transform IDF values (includes a +1 offset).
`log_tf_idf`	logical (overrides `sub_method` defaults). Whether to log transform the TF-IDF value (also applies a scalefactor ($s$) and +1 offset before the log operation).
`binarized_rowsums`	logical (overrides `sub_method` defaults). Whether to calculate IBDF instead of IDF, where the calculation is based on the presence of a feature as opposed to its count.
`scalefactor`	numeric (default = 10000). A scalefactor used when `log_tf_idf = TRUE`.

`sub_method`

sub_method can be one of "default" or any of the other implementations from 1 to 3. These apply some defaults to the way that TF-IDF is calculated. The individual log_ params will override these defaults.

"default" - default Giotto implementation:
- log_idf = TRUE
- binarized_rowsums = TRUE
1 - Method introduced in Cusanovich et al. 2018.
- log_idf = TRUE
2 - Method introduced in Stuart et al. 2021.
- log_tf_idf = TRUE
3 - Method 3 in ArchR iterativeLSI()
- log_tf = TRUE
- log_idf = TRUE

References

Cusanovich, D., Reddington, J., Garfield, D. et al. The cis-regulatory dynamics of embryonic development at single-cell resolution. Nature 555, 538–542 (2018). https://doi.org/10.1038/nature25981

Stuart T, Srivastava A, Madad S, Lareau CA, Satija R. Single-cell chromatin state analysis with Signac. Nat Methods. 2021 Nov;18(11):1333-1341. doi: 10.1038/s41592-021-01282-5.

Granja JM, Corces MR, Pierce SE, Bagdatli ST, Choudhry H, Chang HY, Greenleaf WJ. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat Genet. 2021 Mar;53(3):403-411. doi: 10.1038/s41588-021-00790-6.

Examples

e <- GiottoData::loadSubObjectMini("exprObj")
processData(e, normParam("tf-idf"))
#> An object of class exprObj : "normalized"
#> spat_unit : "aggregate"
#> feat_type : "rna"
#> provenance: z0 z1 
#> 
#> contains:
#> 337 x 462 sparse Matrix of class "dgCMatrix"
#>                                                                                           
#> Mlc1   .         .          .          .          . . 0.04024682 . .          0.04674900 .
#> Gprc5b .         0.04891466 .          0.04435909 . . 0.03671471 . 0.04054054 0.04207417 .
#> Gfap   0.2961592 .          0.06344594 0.03339819 . . 0.05997783 . 0.02869596 0.03430300 .
#>                           
#> Mlc1   .          . ......
#> Gprc5b 0.02977132 . ......
#> Gfap   .          . ......
#> 
#>  ........suppressing 449 columns and 331 rows 
#>                                           
#> Adgrf4    . . . . . . . . . . . . . ......
#> Epha2     . . . . . . . . . . . . . ......
#> Blank-139 . . . . . . . . . . . . . ......
#> 
#>  First four colnames:
#>  240649020551054330404932383065726870513
#>  274176126496863898679934791272921588227
#>  323754550002953984063006506310071917306
#>  87260224659312905497866017323180367450 
processData(e, normParam("tf-idf", sub_method = 1))
#> An object of class exprObj : "normalized"
#> spat_unit : "aggregate"
#> feat_type : "rna"
#> provenance: z0 z1 
#> 
#> contains:
#> 337 x 462 sparse Matrix of class "dgCMatrix"
#>                                                                                 
#> Mlc1   .          .          .         .           . . 0.013323296 . .          
#> Gprc5b .          0.01330919 .         0.012069661 . . 0.009989702 . 0.011030674
#> Gfap   0.06768595 .          0.0145003 0.007633016 . . 0.013707681 . 0.006558341
#>                                          
#> Mlc1   0.015475776 . .           . ......
#> Gprc5b 0.011447959 . 0.008100477 . ......
#> Gfap   0.007839806 . .           . ......
#> 
#>  ........suppressing 449 columns and 331 rows 
#>                                           
#> Adgrf4    . . . . . . . . . . . . . ......
#> Epha2     . . . . . . . . . . . . . ......
#> Blank-139 . . . . . . . . . . . . . ......
#> 
#>  First four colnames:
#>  240649020551054330404932383065726870513
#>  274176126496863898679934791272921588227
#>  323754550002953984063006506310071917306
#>  87260224659312905497866017323180367450 
processData(e, normParam("tf-idf", sub_method = 2))
#> An object of class exprObj : "normalized"
#> spat_unit : "aggregate"
#> feat_type : "rna"
#> provenance: z0 z1 
#> 
#> contains:
#> 337 x 462 sparse Matrix of class "dgCMatrix"
#>                                                                                         
#> Mlc1   .        .        .        .        . . 5.228877 . .        5.377893 . .        .
#> Gprc5b .        5.102927 .        5.005792 . . 4.818045 . 4.916407 4.953272 . 4.610297 .
#> Gfap   6.682494 .        5.146369 4.509905 . . 5.090492 . 4.359961 4.536346 . .        .
#>              
#> Mlc1   ......
#> Gprc5b ......
#> Gfap   ......
#> 
#>  ........suppressing 449 columns and 331 rows 
#>                                           
#> Adgrf4    . . . . . . . . . . . . . ......
#> Epha2     . . . . . . . . . . . . . ......
#> Blank-139 . . . . . . . . . . . . . ......
#> 
#>  First four colnames:
#>  240649020551054330404932383065726870513
#>  274176126496863898679934791272921588227
#>  323754550002953984063006506310071917306
#>  87260224659312905497866017323180367450 
processData(e, normParam("tf-idf", sub_method = 3))
#> An object of class exprObj : "normalized"
#> spat_unit : "aggregate"
#> feat_type : "rna"
#> provenance: z0 z1 
#> 
#> contains:
#> 337 x 462 sparse Matrix of class "dgCMatrix"
#>                                                                                 
#> Mlc1   .          .          .          .           . . 0.013184339 . .         
#> Gprc5b .          0.01309169 .          0.011890432 . . 0.009866505 . 0.01088072
#> Gfap   0.06138386 .          0.01418047 0.007543147 . . 0.013421407 . 0.00649185
#>                                         
#> Mlc1   0.015288711 . .          . ......
#> Gprc5b 0.011286555 . 0.00801922 . ......
#> Gfap   0.007745042 . .          . ......
#> 
#>  ........suppressing 449 columns and 331 rows 
#>                                           
#> Adgrf4    . . . . . . . . . . . . . ......
#> Epha2     . . . . . . . . . . . . . ......
#> Blank-139 . . . . . . . . . . . . . ......
#> 
#>  First four colnames:
#>  240649020551054330404932383065726870513
#>  274176126496863898679934791272921588227
#>  323754550002953984063006506310071917306
#>  87260224659312905497866017323180367450