Skip to main content
Skip to main content

categoricalInformationValue

categoricalInformationValue

Introduced in: v20.1

Calculates the information value (IV) for categorical features in relation to a binary target variable.

For each category, the function computes: (P(tag = 1) - P(tag = 0)) × (log(P(tag = 1)) - log(P(tag = 0)))

where:

  • P(tag = 1) is the probability that the target equals 1 for the given category
  • P(tag = 0) is the probability that the target equals 0 for the given category

Information Value is a statistic used to measure the strength of a categorical feature's relationship with a binary target variable in predictive modeling. Higher absolute values indicate stronger predictive power.

The result indicates how much each discrete (categorical) feature [category1, category2, ...] contributes to a learning model which predicts the value of tag.

Syntax

categoricalInformationValue(category1[, category2, ...,]tag)

Arguments

  • category1, category2, ... — One or more categorical features to analyze. Each category should contain discrete values. UInt8
  • tag — Binary target variable for prediction. Should contain values 0 and 1. UInt8

Returned value

Returns an array of Float64 values representing the information value for each unique combination of categories. Each value indicates the predictive strength of that category combination for the target variable. Array(Float64)

Examples

Basic usage analyzing age groups vs mobile usage

-- Using the metrica.hits dataset (available on https://sql.clickhouse.com/) to analyze age-mobile relationship
SELECT categoricalInformationValue(Age < 15, IsMobile)
FROM metrica.hits;
[0.0014814694805292418]

Multiple categorical features with user demographics

SELECT categoricalInformationValue(
    Sex,                 -- 0=male, 1=female
    toUInt8(Age < 25),   -- 0=25+, 1=under 25
    toUInt8(IsMobile)    -- 0=desktop, 1=mobile
) AS iv_values
FROM metrica.hits
WHERE Sex IN (0, 1);
[0.00018965785460692887,0.004973668839403392]