TF-IDF (Term Frequency-Inverse Document Frequency) normalization is borrowed from natural language processing to identify features that are highly expressed in specific samples but not widely expressed across the entire dataset.

$$\LARGE TF_{i,j} = \frac{x_{i,j}}{\sum_{i} x_{i,j}} $$

$$\LARGE IDF_{i} = \log(1 + \frac{n_{samples}}{1 + n_{samples \: where \: feature \: i > 0}}) $$

$$\LARGE TFIDF_{i,j} = TF_{i,j} \times IDF_{i} $$

Where:

  • (\(x_{i,j}\)) is the raw count for feature \(i\) in sample \(j\)

  • (\(TF_{i,j}\)) is the term frequency of feature \(i\) in sample \(j\)

  • (\(IDF_{i}\)) is the inverse document frequency of feature \(i\)

  • (\(TFIDF_{i,j}\)) is the final TF-IDF normalized value

Note

L2 normalization is commonly performed after TF-IDF normalization

params

None