DEV Community

ram vnet
ram vnet

Posted on

Statistics - Correlation in Data Science :

1️⃣ What is Correlation?

Correlation measures the strength and direction of a relationship between two numerical variables.

πŸ‘‰ It answers questions like:

When X increases, does Y increase or decrease?

How strongly are X and Y related?

πŸ“Œ Correlation does NOT mean causation.

Example:

Ice cream sales ↑ and temperature ↑ β†’ correlated

Ice cream sales ↑ does NOT cause temperature ↑

2️⃣ Why Correlation is Important in Data Science

Correlation is used in:

βœ” Exploratory Data Analysis (EDA)
βœ” Feature selection
βœ” Detecting multicollinearity
βœ” Understanding data patterns
βœ” Model simplification
βœ” Business insights

Example:

If two features are highly correlated, one may be removed.

3️⃣ Direction of Correlation

βž• Positive Correlation

Both variables increase together

Example: Height & Weight

πŸ“ˆ Graph: Upward slope

βž– Negative Correlation

One increases, the other decreases

Example: Speed & Travel Time

πŸ“‰ Graph: Downward slope

βšͺ Zero Correlation

No relationship

Example: Shoe size & IQ

πŸ“Š Graph: Random scatter

4️⃣ Correlation Coefficient (r)

The correlation coefficient measures correlation numerically.

Range:

-1 ≀ r ≀ +1

Value of r

Meaning

+1

Perfect positive

-1

Perfect negative

0

No correlation

Β±0.7 to Β±1

Strong

Β±0.3 to Β±0.7

Moderate

Β±0.0 to Β±0.3

Weak

5️⃣ Pearson Correlation (Most Common)

πŸ“Œ Used for:

Linear relationships

Continuous numerical data

Formula:

βœ” Linear relationship
βœ” No extreme outliers
βœ” Normal distribution (optional but preferred)

Example:

Study hours & exam marks

6️⃣ Spearman Rank Correlation

πŸ“Œ Used for:

Monotonic (non-linear) relationships

Ranked or ordinal data

Key Idea:

Convert values into ranks

Apply Pearson on ranks

Example:

Customer satisfaction rank vs loyalty rank

7️⃣ Kendall’s Tau Correlation

πŸ“Œ Used for:

Small datasets

Ordinal data

Robust to ties

Concept:

Counts concordant & discordant pairs

Example:

Ranking similarity between two judges

8️⃣ Correlation vs Covariance

Covariance

Correlation

Measures joint variability

Measures strength & direction

Units depend on data

Unit-free

Hard to interpret

Easy to interpret

Range: βˆ’βˆž to +∞

Range: βˆ’1 to +1

πŸ“Œ Correlation = Normalized covariance

9️⃣ Correlation Matrix

A correlation matrix shows correlations between multiple variables.

Example:

A

B

C

A

1

0.8

-0.2

B

0.8

1

-0.4

C

-0.2

-0.4

1

πŸ“Œ Used in:

Feature selection

Heatmaps

Multivariate EDA

πŸ”₯ 10️⃣ Multicollinearity

What is it?

When independent variables are highly correlated

Problems:

❌ Unstable coefficients
❌ Reduced model interpretability
❌ Inflated variance

Detection:

Correlation Matrix

VIF (Variance Inflation Factor)

11️⃣ Correlation β‰  Causation (Very Important)

Correlation does NOT mean one variable causes the other.

Example:

Crime rate & Ice cream sales are correlated

Both depend on temperature

πŸ“Œ Hidden variable = Confounding factor

12️⃣ Limitations of Correlation

⚠ Only measures linear relationships (Pearson)
⚠ Sensitive to outliers
⚠ Cannot capture cause-effect
⚠ Misses complex patterns

13️⃣ Correlation in Machine Learning

Used in:

Feature elimination

Dimensionality reduction

Data cleaning

Model diagnostics

Example:

Remove one of two features with r > 0.9

14️⃣ Real-World Example (Data Science)

πŸ“Œ Dataset: House Prices

Feature

Correlation with Price

Area

+0.85

Distance to city

-0.62

Age of house

-0.40

Bedrooms

+0.70

Interpretation:

Area strongly increases price

Distance negatively impacts price

15️⃣ Visualizing Correlation

βœ” Scatter plots
βœ” Heatmaps
βœ” Pair plots

16️⃣ Summary (Key Takeaways)

βœ” Correlation measures relationship, not causation
βœ” Range is from βˆ’1 to +1
βœ” Pearson β†’ Linear
βœ” Spearman β†’ Rank / Non-linear
βœ” Used heavily in EDA & ML
βœ” Helps detect redundancy in features

Read More…

Top comments (0)