Linear Regression: A Comprehensive Guide

Learn how to build and evaluate a linear regression model in Python using scikit-learn.

What is Linear Regression?

Image by rawpixel.com on Freepik

Imagine a world where we can unlock the secrets hidden within data, where numbers dance to reveal patterns and whisper predictions. This, my friend, is the world of linear regression, a powerful tool that allows us to understand the relationships between variables. At its core, linear regression is a statistical technique that models these relationships using straight lines. Think of it as drawing a line through a scatter plot of data points, aiming to capture the essence of their dance. This line, known as the “regression line,” becomes our guide, allowing us to predict the value of one variable (the dependent variable) based on the other (the independent variable).

Why is Linear Regression Important?

Linear regression isn’t just a fancy party trick for statisticians. It’s a workhorse, a versatile tool used across diverse fields like:

  • Machine Learning: As the foundation of many machine learning algorithms, linear regression lays the groundwork for complex predictive models used in everything from image recognition to spam filtering.
  • Statistics: It’s a cornerstone of statistical analysis, helping researchers understand cause-and-effect relationships, test hypotheses, and draw meaningful conclusions from data.
  • Data Analysis: From business analysts to scientists, linear regression empowers us to explore trends, identify patterns, and make informed decisions based on data.
  • Prediction: From forecasting future sales to predicting stock market trends, linear regression helps us peer into the future, making it a valuable tool for financial analysts and beyond.

Who is this Guide For?

This comprehensive guide welcomes both eager beginners and seasoned learners seeking to solidify their understanding and push toward mastery. Whether you’re a data enthusiast with a thirst for knowledge or a seasoned statistician looking to sharpen your skills, this guide will serve as your compass, navigating you through the fascinating world of linear regression.

Foundations of Linear Regression

Linear regression is a statistical technique that allows us to model the relationship between a dependent variable (also called the response or outcome variable) and one or more independent variables (also called predictors or explanatory variables). The goal of linear regression is to find the best-fitting line or curve that describes how the dependent variable changes as a function of the independent variables.

Dependent and Independent Variables

The dependent variable is the variable that we want to explain or predict using the linear regression model. It is usually denoted by y. The independent variables are the variables that we use to explain or predict the dependent variable. They are usually denoted by x1​,x2​,…,xk​, where k is the number of independent variables.

Assumptions of Linear Regression

Linear regression makes some assumptions about the data and the relationship between the variables. These assumptions are:

  • Linearity: The relationship between the dependent variable and the independent variables is linear, or can be approximated by a linear function.
  • Independence: The observations are independent of each other, meaning that the value of the dependent variable for one observation does not depend on the value of the dependent variable for another observation.
  • Homoscedasticity: The variance of the error term (the difference between the observed and predicted values of the dependent variable) is constant across all values of the independent variables.
  • Normality: The error term follows a normal distribution with mean zero and constant variance.

Model Equation and Interpretation

The general form of the linear regression model equation is:

$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + … + \beta_k x_k + \epsilon$$

where β0​ is the intercept, β1​,β2​,…,βk​ are the coefficients, and ϵ is the error term. The intercept is the expected value of the dependent variable when all the independent variables are zero. The coefficients are the slopes of the regression line, indicating how much the dependent variable changes for a unit change in the corresponding independent variable, holding all other independent variables constant. The error term captures the random variation that is not explained by the model.

The interpretation of the model equation depends on the type and scale of the variables involved. For example, if the dependent variable and the independent variables are all continuous, then the interpretation is straightforward: for a unit increase in xi​, the dependent variable y increases by βi​ units, on average, holding all other independent variables constant. However, if the variables are categorical, binary, or have different units, then the interpretation may require some transformations or adjustments.

Goodness-of-Fit Measures

To evaluate how well the linear regression model fits the data, we can use some measures of goodness-of-fit. These measures quantify how close the predicted values are to the observed values, and how much of the variation in the dependent variable is explained by the model. Some common measures of goodness-of-fit are:

  • R-squared: This is the proportion of the total variation in the dependent variable that is explained by the model. It ranges from 0 to 1, with higher values indicating better fit. It is calculated as:

$$R^2 = 1 – \frac{SS_{res}}{SS_{tot}}$$

where SSres​ is the sum of squared residuals (the sum of the squared differences between the observed and predicted values of the dependent variable), and SStot​ is the total sum of squares (the sum of the squared differences between the observed values of the dependent variable and its mean).

  • Adjusted R-squared: This is a modified version of R-squared that adjusts for the number of independent variables in the model. It penalizes the model for adding more variables that do not improve the fit significantly. It is calculated as:

$$R^2_{adj} = 1 – \frac{(1 – R^2)(n – 1)}{n – k – 1}$$

where n is the number of observations and k is the number of independent variables.

  • Root Mean Squared Error (RMSE): This is the square root of the mean of the squared residuals. It measures the average deviation of the predicted values from the observed values. It has the same units as the dependent variable, and lower values indicate better fit. It is calculated as:

$$RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^n (y_i – \hat{y}_i)^2}$$

where yi is the observed value and ŷi is the predicted value of the dependent variable for the i-th observation.

Types of Linear Regression

Depending on the number and nature of the independent variables, there are different types of linear regression models. Some of the common types are:

Simple Linear Regression

This is the simplest type of linear regression, where there is only one independent variable. The model equation is:

$$y = \beta_0 + \beta_1 x + \epsilon$$

The coefficients β0​ and β1​ can be estimated using the method of ordinary least squares (OLS), which minimizes the sum of squared residuals. The slope β1​ can also be calculated using the formula:

$$\beta_1 = \frac{\sum_{i=1}^n (x_i – \bar{x})(y_i – \bar{y})}{\sum_{i=1}^n (x_i – \bar{x})^2}$$

where x̄ and ȳ are the means of the independent and dependent variables, respectively.

Multiple Linear Regression

This is the type of linear regression where there are more than one independent variable. The model equation is:

$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + … + \beta_k x_k + \epsilon$$

The coefficients β0​,β1​,…,βk​ can be estimated using the method of OLS, which involves solving a system of normal equations. Alternatively, they can be estimated using matrix algebra, as:

$$\boldsymbol{\beta} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$

where β is a vector of the coefficients, X is a matrix of the independent variables (with a column of ones for the intercept), and y is a vector of the dependent variable.

Polynomial Regression

This is the type of linear regression where the relationship between the dependent variable and the independent variables is non-linear but can be approximated by a polynomial function. The model equation is:

$$y = \beta_0 + \beta_1 x + \beta_2 x^2 + … + \beta_d x^d + \epsilon$$

where d is the degree of the polynomial. The coefficients β0​,β1​,…,βd​ can be estimated using the same methods as in multiple linear regression, by treating the powers of x as separate independent variables. The degree of the polynomial should be chosen carefully, as a higher degree may lead to overfitting (fitting the noise in the data) and a lower degree may lead to underfitting (missing the true pattern in the data).

Building a Linear Regression Model

Linear regression is a supervised learning technique that models the relationship between a continuous target variable and one or more explanatory variables. In this tutorial, we will use the California housing dataset from Scikit-learn to build and evaluate a linear regression model. The dataset contains information about the median house value, median income, population, and other features for various blocks in California.

Data Preparation

Before we can train and test our model, we need to prepare the data by importing it, handling missing values and outliers, and performing feature scaling.

Importing data

We can use the fetch_california_housing function from scikit-learn to load the dataset as an Bunch object, which is similar to a dictionary. We can access the data, the target, and the feature names using the keys datatarget, and feature_names, respectively.

# Import the necessary modules
from sklearn.datasets import fetch_california_housing
import pandas as pd

# Load the dataset
california = fetch_california_housing()

# Convert the data and target to pandas DataFrame and Series
X = pd.DataFrame(california.data, columns=california.feature_names)
y = pd.Series(california.target, name='MedianHouseValue')

We can print the shape and the first five rows of the data and target to get a sense of the data.

# Print the shape and the first five rows of the data and target
print(X.shape)
print(X.head())
print(y.shape)
print(y.head())

Handling missing values and outliers

We can use the isna and describe methods of pandas to check for missing values and outliers in our data. We can see that there are no missing values in our data, but there are some outliers in some features, such as MedInc and AveOccup. We can use the clip method of pandas to cap the values of these features at a certain percentile, such as the 99th percentile, to reduce the effect of outliers.

# Check for missing values
print(X.isna().sum())

# Check for outliers
print(X.describe())

# Clip the values of MedInc and AveOccup at the 99th percentile
X['MedInc'] = X['MedInc'].clip(upper=X['MedInc'].quantile(0.99))
X['AveOccup'] = X['AveOccup'].clip(upper=X['AveOccup'].quantile(0.99))

# Check the summary statistics after clipping
print(X.describe())

Feature scaling

We can use the StandardScaler class from scikit-learn to perform feature scaling. Feature scaling is the process of standardizing the range of the features, which can improve the performance and convergence of the model. We can use the fit_transform method of the StandardScaler class to apply the scaling to our data.

# Import the necessary module
from sklearn.preprocessing import StandardScaler

# Create a StandardScaler object and fit_transform the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

We can also plot the data distribution before and after feature scaling using the hist method of pandas. We can see that the scaled data has a mean of zero and a standard deviation of one, and the outliers are less prominent.

# Plot the data distribution before feature scaling
X.hist(figsize=(10, 10))
plt.suptitle('Data distribution before feature scaling')
plt.show()

# Plot the data distribution after feature scaling
pd.DataFrame(X_scaled, columns=X.columns).hist(figsize=(10, 10))
plt.suptitle('Data distribution after feature scaling')
plt.show()

Output

linear regression
Data before scaling
linear regression
Data after scaling

Model Training

After preparing the data, we can train and test our model using the LinearRegression and train_test_split classes from scikit-learn. We can use the fit method of the LinearRegression class to train the model on the training data, and the predict method to make predictions on the test data. We can also use the coef_ and intercept_ attributes of the LinearRegression class to access the coefficients and the intercept of the model.

# Import the necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Split the data and the target into train and test sets with 80/20 ratio
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

# Create a LinearRegression object and fit the model on the train set
model = LinearRegression()
model.fit(X_train, y_train)

We can print the coefficients and the intercept of the model to see the estimated parameters.

# Print the coefficients and the intercept of the model
print(model.coef_)
print(model.intercept_)

Output

[ 0.8301113   0.14911065 -0.26162232  0.31449458  0.0427653  -0.26440009
 -0.88210693 -0.80948829]
2.067381918996376

Model Evaluation

  • To evaluate the performance of our model, we can use various metrics, such as the mean squared error (MSE), the root mean squared error (RMSE), the mean absolute error (MAE), and the coefficient of determination (R2).
  • We can use the mean_squared_error, mean_absolute_error, and r2_score functions from scikit-learn to calculate these metrics.
  • We can also use the residplot function from seaborn to plot the residuals against the fitted values, and the qqplot function from statsmodels to plot the quantiles of the residuals against the theoretical normal quantiles.
  • These plots can help us check for the assumptions of linear regression, such as homoscedasticity, multicollinearity, and normality.
# Import the necessary modules
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import seaborn as sns
import statsmodels.api as sm
import matplotlib.pyplot as plt

# Make predictions on the test set
y_pred = model.predict(X_test)

# Calculate the MSE, RMSE, MAE, and R2 on the test set
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

# Print the metrics
print(f'MSE: {mse:.3f}')
print(f'RMSE: {rmse:.3f}')
print(f'MAE: {mae:.3f}')
print(f'R2: {r2:.3f}')

Output

MSE: 0.474
RMSE: 0.689
MAE: 0.497
R2: 0.638

We can plot the residuals vs the fitted values to check for homoscedasticity, which means that the variance of the residuals is constant across the range of the fitted values. If the plot shows a random scatter of points around zero, then the assumption is met. If the plot shows a pattern, such as a curved or fanning shape, then the assumption is violated.

# Plot the residuals vs the fitted values
sns.residplot(x=y_pred, y=y_test - y_pred, lowess=True, line_kws={'color': 'red'})
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.title('Residual plot')
plt.show()

Output

We can see that the residual plot shows a slight curve, indicating that the model may not capture some non-linear relationship in the data.

We can plot the quantiles of the residuals vs the theoretical normal quantiles to check for normality, which means that the distribution of the residuals is approximately normal. If the plot shows that the points lie on or near the 45-degree line, then the assumption is met. If the plot shows that the points deviate from the line, especially at the tails, then the assumption is violated.

# Plot the quantiles of the residuals vs the theoretical normal quantiles
sm.qqplot(y_test - y_pred, line='s')
plt.title('Q-Q plot')
plt.show()

Output

We can see that the Q-Q plot shows that the points are close to the line, indicating that the normality assumption is reasonably met.

Cross-validation for generalizability

To check the generalizability of our model, we can use cross-validation, which is a technique that splits the data into multiple folds and trains and tests the model on each fold. We can use the cross_val_score function from scikit-learn to perform cross-validation and obtain the scores for each fold. We can also use the np.mean and np.std functions from numpy to calculate the mean and the standard deviation of the scores.

# Import the necessary modules
from sklearn.model_selection import cross_val_score
import numpy as np

# Perform 10-fold cross-validation with R2 as the scoring metric
scores = cross_val_score(model, X_scaled, y, cv=10, scoring='r2')

# Print the scores for each fold
print(scores)

# Print the mean and the standard deviation of the scores
print(f'Mean: {np.mean(scores):.3f}')
print(f'Std: {np.std(scores):.3f}')

Output

[0.50802019 0.65092848 0.55035332 0.59520576 0.66644271 0.57967816
 0.48014124 0.52635576 0.56914073 0.59347323]
Mean: 0.572
Std: 0.056

We can see that the cross-validation scores are close to the test set score, indicating that our model is not overfitting or underfitting the data. We can also see that the standard deviation of the scores is low, indicating that our model is consistent across different folds.

Also read: Deep Learning vs Machine Learning: Unraveling the Differences – DataPro

Also read: Logistic Regression in Python: A Comprehensive Guide – DataPro

Conclusion

In this tutorial, we have learned how to build and evaluate a linear regression model using the California housing dataset from scikit-learn. We have covered the following steps:

  • Data preparation: importing data, handling missing values and outliers, and feature scaling
  • Model training: splitting the data into train and test sets, and fitting a linear regression model
  • Model evaluation: calculating various metrics, such as MSE, RMSE, MAE, and R2, and plotting the residuals and the Q-Q plot to check for the assumptions of linear regression
  • Cross-validation: performing 10-fold cross-validation to check the generalizability of the model

Some of the key takeaways and benefits of mastering linear regression are:

  • Linear regression is a simple yet powerful technique that can model the relationship between a continuous target variable and one or more explanatory variables
  • Linear regression can provide interpretable coefficients that indicate the direction and magnitude of the effect of each variable on the target
  • Linear regression can be used for various purposes, such as prediction, inference, hypothesis testing, and feature selection

However, linear regression also has some limitations and challenges, such as:

  • Linear regression assumes a linear relationship between the variables, which may not hold in reality
  • Linear regression is sensitive to outliers, multicollinearity, heteroscedasticity, and non-normality, which can affect the accuracy and validity of the model
  • Linear regression may suffer from overfitting or underfitting, which can reduce the performance and generalizability of the model

Therefore, to improve our skills and knowledge in linear regression, we can explore some future directions and advanced topics, such as:

  • Regularization: adding a penalty term to the loss function to reduce the complexity and variance of the model, such as ridge, lasso, and elastic net regression
  • Non-linear regression: using non-linear functions or transformations to model the non-linear relationship between the variables, such as polynomial, logarithmic, exponential, and sigmoid regression
  • Generalized linear models: extending the linear regression framework to handle different types of target variables, such as binary, categorical, count, or proportional data, such as logistic, multinomial, Poisson, and gamma regression.

Frequently Asked Questions

What is the difference between simple and multiple linear regression?

Simple linear regression involves one independent variable and one dependent variable, while multiple linear regression involves two or more independent variables and one dependent variable.

How to check the accuracy of a linear regression model?

One way to check the accuracy of a linear regression model is to calculate the coefficient of determination (R2), which measures how well the model fits the data.

How to handle categorical variables in linear regression?

One way to handle categorical variables in linear regression is to use dummy variables, which are binary variables that represent the presence or absence of a category.

How to interpret the coefficients of a linear regression model?

The coefficients of a linear regression model represent the slope of the line of best fit, which indicates the change in the dependent variable for each unit change in the independent variable.

How to plot a linear regression line in Python?

A: One way to plot a linear regression line in Python is to use the seaborn library, which has a regplot function that can create a scatter plot with a fitted linear regression line.

Share your love

Leave a Reply

Your email address will not be published. Required fields are marked *