ram vnet

Posted on Jan 2

Statistics - Correlation in Data Science :

1️⃣ What is Correlation?

Correlation measures the strength and direction of a relationship between two numerical variables.

👉 It answers questions like:

When X increases, does Y increase or decrease?

How strongly are X and Y related?

📌 Correlation does NOT mean causation.

Example:

Ice cream sales ↑ and temperature ↑ → correlated

Ice cream sales ↑ does NOT cause temperature ↑

2️⃣ Why Correlation is Important in Data Science

Correlation is used in:

✔ Exploratory Data Analysis (EDA)
✔ Feature selection
✔ Detecting multicollinearity
✔ Understanding data patterns
✔ Model simplification
✔ Business insights

Example:

If two features are highly correlated, one may be removed.

3️⃣ Direction of Correlation

➕ Positive Correlation

Both variables increase together

Example: Height & Weight

📈 Graph: Upward slope

➖ Negative Correlation

One increases, the other decreases

Example: Speed & Travel Time

📉 Graph: Downward slope

⚪ Zero Correlation

No relationship

Example: Shoe size & IQ

📊 Graph: Random scatter

4️⃣ Correlation Coefficient (r)

The correlation coefficient measures correlation numerically.

Range:

-1 ≤ r ≤ +1

Value of r

Meaning

Perfect positive

-1

Perfect negative

No correlation

±0.7 to ±1

Strong

±0.3 to ±0.7

Moderate

±0.0 to ±0.3

Weak

5️⃣ Pearson Correlation (Most Common)

📌 Used for:

Linear relationships

Continuous numerical data

Formula:

✔ Linear relationship
✔ No extreme outliers
✔ Normal distribution (optional but preferred)

Example:

Study hours & exam marks

6️⃣ Spearman Rank Correlation

📌 Used for:

Monotonic (non-linear) relationships

Ranked or ordinal data

Key Idea:

Convert values into ranks

Apply Pearson on ranks

Example:

Customer satisfaction rank vs loyalty rank

7️⃣ Kendall’s Tau Correlation

📌 Used for:

Small datasets

Ordinal data

Robust to ties

Concept:

Counts concordant & discordant pairs

Example:

Ranking similarity between two judges

8️⃣ Correlation vs Covariance

Covariance

Correlation

Measures joint variability

Measures strength & direction

Units depend on data

Unit-free

Hard to interpret

Easy to interpret

Range: −∞ to +∞

Range: −1 to +1

📌 Correlation = Normalized covariance

9️⃣ Correlation Matrix

A correlation matrix shows correlations between multiple variables.

Example:

0.8

-0.2

0.8

-0.4

-0.2

-0.4

📌 Used in:

Feature selection

Heatmaps

Multivariate EDA

🔥 10️⃣ Multicollinearity

What is it?

When independent variables are highly correlated

Problems:

❌ Unstable coefficients
❌ Reduced model interpretability
❌ Inflated variance

Detection:

Correlation Matrix

VIF (Variance Inflation Factor)

11️⃣ Correlation ≠ Causation (Very Important)

Correlation does NOT mean one variable causes the other.

Example:

Crime rate & Ice cream sales are correlated

Both depend on temperature

📌 Hidden variable = Confounding factor

12️⃣ Limitations of Correlation

⚠ Only measures linear relationships (Pearson)
⚠ Sensitive to outliers
⚠ Cannot capture cause-effect
⚠ Misses complex patterns

13️⃣ Correlation in Machine Learning

Used in:

Feature elimination

Dimensionality reduction

Data cleaning

Model diagnostics

Example:

Remove one of two features with r > 0.9

14️⃣ Real-World Example (Data Science)

📌 Dataset: House Prices

Feature

Correlation with Price

Area

+0.85

Distance to city

-0.62

Age of house

-0.40

Bedrooms

+0.70

Interpretation:

Area strongly increases price

Distance negatively impacts price

15️⃣ Visualizing Correlation

✔ Scatter plots
✔ Heatmaps
✔ Pair plots

16️⃣ Summary (Key Takeaways)

✔ Correlation measures relationship, not causation
✔ Range is from −1 to +1
✔ Pearson → Linear
✔ Spearman → Rank / Non-linear
✔ Used heavily in EDA & ML
✔ Helps detect redundancy in features

DEV Community

Statistics - Correlation in Data Science :

Top comments (0)