Chi-Square Test

Last Updated : 14 Apr, 2026

The Chi-squared (χ²) test is a statistical method used to determine whether there is a significant association between two categorical variables or whether observed data fits an expected distribution. In categorical data analysis, the chi-square test compares observed frequencies with expected frequencies under a given hypothesis.

Chi-squared test, or χ² test, helps in determining whether these two variables are associated with each other.

This test is widely used in market research, healthcare, social sciences, and more to analyze categorical relationships.

chi_square_test

For example, Entity 1: People’s favorite colors and Entity 2: Their preference for ice cream.

Null Hypothesis (H₀): Favorite color and ice cream preference are independent (no relationship).
Alternative Hypothesis (H₁): They are dependent (a relationship exists).

By comparing observed survey data with expected frequencies (if no relationship existed), the Chi-Square test calculates a test statistic (χ²). If this value is large enough, we reject H₀, concluding that color preference does influence ice cream choice and vice versa.

Formula For Chi-Square Test

\chi^2 = \sum \frac{ (O_i - E_i)² }{ E_i }

Symbols are broken down as follows:

O_i: Observed frequency
E_i: Expected frequency

Categorical Variables

Categorical variables classify data into distinct, non-numeric groups (e.g., colors, fruit types).

Key Characteristics:

Distinct Groups: No overlap (e.g., hair color: blonde, brunette).
Non-Numerical: No arithmetic meaning (e.g., "apple" ≠ "orange" numerically).
Limited Options: Fixed categories (e.g., traffic lights: red, yellow, green).

Example: "Do you prefer tea, coffee, or juice?" → Categories: tea/coffee/juice.

Steps for Chi-Square Test

Steps and an illustration of an example of how sex influences which type of ice-cream a person will choose using a chi-square test are added below:

Step 1: Define Hypothesis

Null Hypothesis (H₀): The variables are independent.
Alternative Hypothesis (H₁): The variables are dependent.

Step 2: Gather and Organize Data

Gather Information about the Two Category Variables: Before performing a chi-square test, you should have on hand information about two categorical variables you wish to observe.

You must collect details on people’s sex (male or female) and their best flavors (e.g., chocolate, vanilla, strawberry).

Once this information is collected, it can be inserted into a contingency table.

The hypothesis is that men prefer vanilla while women prefer chocolate. So we need to record how many have chosen vanilla among all male respondents versus the number who chose chocolate out of all female respondents.

Here's an example of what a contingency table might look like:

	Chocolate	Vanilla	Strawberry	Total
Male	20	15	10	45
Female	25	20	30	75
Total	45	35	40	120

Step 3: Calculate Expected Frequencies

Get Observed Frequency: In any specific cell, the expected frequency can be described as the number of occurrences that would be expected if the two variables were independent.
Expected Frequency Calculation: This involves multiplying the sums of rows and columns in proportion, then dividing by the total number of observations in a table.

Observed frequency is the table given above.

E_{ij}=\frac{(Row Total)×(Column Total)}{Grand Total}

Male and chocolate: \frac{45×45}{120} = 16.875
Male and Vanilla: \frac{45×35}{120} = 13.125
Male and Strawberry: \frac{45×40}{120}=15.0
Female and chocolate: \frac{75×45}{120} = 28.125
Female and Vanilla: \frac{75×35}{120} = 21.875
Female and Strawberry: \frac{75×40}{120}=25.0

Step 4: Perform Chi-Square Test

Use Chi-Square Formula:

\chi^2 = \sum \frac{ (O_i - E_i)² }{ E_i }

\chi^2 = \sum \frac{(O -E)^2}{E} = \frac{(20 -16.875)^2}{16.875} + \frac{(15 -13.125)^2}{13.125} + \frac{(10 -15)^2}{15}+ \frac{(25 -28.125)^2}{28.125}+ \frac{(20 -21.875)^2}{21.875}+ \frac{(30 -25)^2}{25} = 4.86

Step 5: Determine Degrees of Freedom (df)

df = (number of rows - 1) × (number of columns - 1)

df=(r−1)(c−1)=(2−1)(3−1)=2

Step 6: Find p-value

Compare the calculated χ² value with the critical value from the Chi-Square distribution table for the given degrees of freedom.

If a predefined table is provided, use it
Otherwise, we generally take the significance level as α = 0.05

significance_level_

Here, χ² = 4.86 with df=2:
Critical value at α=0.05 is 5.991.
Since 4.86 < 5.991, p > 0.05

Step 7: Interpret Results

If the p-value is less than a certain significance level (e.g., 0.05), then we reject the null hypothesis, which is commonly denoted by α. Thus, it means that category variables highly correlate with each other.
When a p-value is above α, it implies that we cannot reject the null hypothesis; hence, there is insufficient evidence for establishing the relationship between these variables.

No significant evidence supports the claim that men prefer vanilla or women prefer chocolate (p>0.05).

Addressing Assumptions and Considerations

Chi-square tests suppose that the observations are independent of one another; they are distinct.
Each cell in the table should have a minimum of five values in it for better results. Otherwise, think about Fisher’s exact test as an alternative measure if a table cell has fewer than five numbers in it.
Chi-square tests do not indicate a causal relationship, but they identify an association between variables.

Goodness-Of-Fit

A goodness-of-fit test checks if a hypothesized model matches observed data. For example, testing whether a die is fair.

Key Aspects:

Purpose: Check how well observed data fits expected data
Data Types: Categorical only
Applications: Compare observed vs. expected frequencies.
Benefits: Identifies model-data mismatch.

Applications of Chi-Square Test in Computer Science

A/B Testing & Feature Evaluation

Compare user engagement (e.g., clicks, conversions) between two website versions (A vs. B).
Chi-test is used to test if observed metrics (e.g., "Click" vs. "No Click") differ significantly between groups.
Example: Observed: Version A: 120 clicks / 1,000 views; Version B: 150 clicks / 1,000 views. Chi-Square: Checks if the difference is statistically significant (not due to chance).

Machine Learning (Feature Selection)

Identify categorical features correlated with target variables.
Test if independence between features (e.g., "Browser Type" vs. "Purchase Decision") using the Chi-square test.
Example: χ² p-value < 0.05 → "Browser Type" significantly affects purchases.

Database Query Optimization

Assess if data is evenly distributed across partitions.
Chi-square is used to test if actual row counts per partition match the expected uniform distribution.
Example: Uneven distribution (χ² significance) suggests a poor sharding strategy.

Natural Language Processing (NLP)

Evaluate word frequency distributions in texts.
Compare observed word counts (e.g., "error" in logs) to the expected Poisson distribution.
Example: Detects overused terms in spam emails (χ² highlights deviations from normal usage).

Solved Examples

Example 1: A study investigates the relationship between eye color (blue, brown, green) and hair color (blonde, brunette, Redhead). The following data is collected:

Eye Color	Blonde	Brunette	Redhead	Total
Blue	30	50	20	100
Brown	40	30	10	80
Green	20	10	10	40
Total	90	90	40	220

Step 1: Hypotheses
H₀: Eye color and hair color are independent
H₁: They are associated
Step 2: Expected Frequencies
Using E = \frac{(\text{Row Total} \times \text{Column Total})}{\text{Grand Total}}
Blue: (40.91, 40.91, 18.18) {color: blonde,brunette,redhead }
Brown: (32.73, 32.73, 14.55)
Green: (16.36, 16.36, 7.27)
Step 3: Chi-Square Calculation
\chi^2 = \sum \frac{(O - E)^2}{E} \approx 12.67
Step 4: Degrees of Freedom
df = (3 − 1)(3 − 1) = 4
Step 5: Decision
Critical value (α = 0.05, df = 4) = 9.488
Since 12.67 > 9.488 → Reject H₀
There is a significant association between eye color and hair color

Example 2: 100 flips of a coin are performed. The coin is fair, with an equal chance of heads and tails, according to the null hypothesis. 55 heads and 45 tails are the observed findings.

Step 1: Hypotheses
H₀: Coin is fair
H₁: Coin is not fair
Step 2: Expected Values
Heads = 50, Tails = 50
Step 3: Chi-Square Calculation
\chi^2 = \frac{(55-50)^2}{50} + \frac{(45-50)^2}{50} = 1
Step 4: Degrees of Freedom
df = 1
Step 5: Decision
Critical value (α = 0.05) = 3.84
Since 1 < 3.84 → Fail to reject H₀
The coin is likely fair

Related Articles:

Chi-Square Test for Feature Selection
Chi-square test in Machine Learning
Application of Chi Square Test

Practice Problems

Q1. Market Research on Beverages

A company conducts a survey to determine whether there's a relationship between age groups and preferred beverages. The data collected is as follows:

Age Group	Coffee	Tea	Soft Drinks	Water
18-25	30	20	25	15
26-35	25	30	20	25
36-45	20	25	30	25
46-55	15	20	25	40

Use a chi-square test to determine if there is an association between age groups and preferred beverages.

Q2. Student Performance

A teacher wants to find out if there is a relationship between study habits and grades. The data collected is as follows:

Study Habits	A	B	C	D	F
Regular	15	20	25	10	5
Occasional	10	15	20	15	10
Rare	5	10	15	20	25

Perform a chi-square test to determine if study habits and grades are associated.

Q3. Gender and Major

A university wants to see if there is an association between gender and chosen major. The data collected is:

Major	Male	Female
Engineering	60	30
Business	40	50
Arts	20	40
Sciences	30	30

Conduct a chi-square test to examine if gender and chosen major are related.

Q4. Voting Preferences

A political analyst wants to know if there is a relationship between gender and voting preference. The data is:

Preference	Male	Female
Candidate A	80	90
Candidate B	70	60
Undecided	50	40

Test the hypothesis that gender and voting preference are independent.

Comment

Article Tags:

School Learning

Math-Statistics

Explore

Basic Arithmetic

Algebra

Geometry

Trigonometry & Vector Algebra

Calculus

Probability and Statistics

Practice

Courses