Ever wondered how spam filters know your emails are junk, or how doctors predict the risk of a disease? The answer lies in a powerful statistical tool called Logistic Regression. This guide, aimed at beginner and intermediate Python users, will unravel the magic of Logistic Regression and equip you with the skills to unlock its potential in Python.
Logistic Regression in a nutshell:
- What it does: Predict the probability of an event happening (think “spam/not spam”) based on its features (email content, medical tests).
- Not your everyday regression: Instead of continuous values like house prices, it deals with categorical outcomes like yes/no or pass/fail.
- Think odds, not just numbers: It models the log odds of an event, making it interpretable and robust to skewed data.
Why choose Logistic Regression in Python?
- Simple yet powerful: Its straightforward logic makes it easy to understand and implement.
- Versatile tool: Handles binary and multi-class classification tasks efficiently.
- Python’s magic: Rich libraries like scikit-learn offer readily available tools for building and tuning your models.
- Interpretable insights: Gain a valuable understanding of how features influence the outcome through coefficients.
- Solid foundation: A stepping stone to more complex machine learning algorithms.
Ready to dive in? This guide will equip you with the knowledge and Python code to:
- Prepare your data: Learn how to format and wrangle data for effective analysis.
- Build your model: Explore different ways to train Logistic Regression models in Python.
- Evaluate and tune: Understand how to assess your model’s performance and fine-tune it for optimal accuracy.
- Make predictions: Use your trained model to predict the probability of future events.
By the end, you’ll be confident to tackle real-world problems with Logistic Regression in Python, from predicting customer churn to classifying financial transactions. So, let’s unlock the power of probabilities and embark on this exciting journey together!
Table of Contents
Prerequisites and Setup
Before embarking on our Logistic Regression adventure, let’s gather the essentials:
Python Installation
- Ensure you have Python (version 3.6 or later) installed on your system.
- If not, download it from the official Python website (https://www.python.org/downloads/) and follow the installation instructions.
Essential Libraries
- NumPy: Provides powerful array and matrix operations for efficient data manipulation.
- Pandas: Offers high-performance data structures and tools for data analysis and manipulation.
- scikit-learn: The heart of our journey, containing a vast array of machine learning algorithms, including Logistic Regression.
Installation:
- Using pip (recommended)
pip install numpy pandas scikit-learn
- Using Anaconda:
conda install numpy pandas scikit-learn
Verification:
- Open a Python interpreter or a Jupyter Notebook and try importing the libraries:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression # Importing the model
If no errors occur, you’re all set!
Additional Tools (Optional):
- Jupyter Notebook: For interactive coding and visualization.
- Matplotlib or Seaborn: For creating informative plots and charts to visualize your results.
Ready to Explore?
With your environment set up, get ready to dive into the fascinating world of Logistic Regression in Python. In the next sections, we’ll explore data preparation, model building, evaluation, and prediction, empowering you to make informed decisions based on probabilities!
Understanding Logistic Regression
Logistic regression is a supervised machine learning algorithm that can be used to solve binary classification problems:
- What is binary classification and the sigmoid function?
- How to build a logistic regression model using the logit function?
- How to interpret the coefficients and their impact on predictions?
- How to evaluate the performance of a logistic regression model using different metrics?
Binary Classification and the Sigmoid Function
Binary classification is a type of machine learning task where the goal is to predict whether an instance belongs to one of two possible classes, such as spam or not spam, positive or negative, etc. To do this, we need a way to map the input features to a probability value between 0 and 1, which represents the likelihood of the instance belonging to the positive class.
One common way to achieve this is to use the sigmoid function, also known as the logistic function. The sigmoid function is defined as:
$$\sigma(x) = \frac{1}{1 + e^{-x}}$$
The sigmoid function takes any real value as input and outputs a value between 0 and 1. We can plot the graph of the sigmoid function looks like an “S” shape, as shown below:
import numpy as np
import matplotlib.pyplot as plt
# Define the sigmoid function
def sigmoid(x):
return 1 / (1 + np.exp(-x))
# Plot the sigmoid function
x = np.linspace(-10, 10, 100)
y = sigmoid(x)
plt.plot(x, y)
plt.xlabel("x")
plt.ylabel("$\sigma(x)$")
plt.title("Sigmoid Function")
plt.show()
Output
The sigmoid function has some nice properties that make it suitable for binary classification. For example:
- It is monotonic, meaning that it either increases or decreases as x increases.
- It has a clear threshold at 0.5, meaning that if x is greater than 0, then σ(x) is greater than 0.5, and vice versa.
- It is differentiable, meaning that we can calculate its derivative, which is useful for optimization algorithms.
Logistic Regression Model and the Logit Function
A logistic regression model is a type of linear model that uses the sigmoid function to map the input features to a probability value. The general form of a logistic regression model is:
$$\hat{y} = \sigma(w_0 + w_1 x_1 + w_2 x_2 + … + w_n x_n)$$
where ŷ is the predicted probability, w0,w1,…,wn are the coefficients or weights, and x1,x2,…,xn are the input features.
The term inside the sigmoid function, w0+w1x1+w2x2+…+wnxn, is called the logit function or the log-odds function. It represents the logarithm of the odds ratio, which is the ratio of the probability of the positive class to the probability of the negative class. The logit function can be written as:
$$\text{logit}(\hat{y}) = \log \frac{\hat{y}}{1 – \hat{y}} = w_0 + w_1 x_1 + w_2 x_2 + … + w_n x_n$$
The logit function is a linear function of the input features, which means that it assumes a linear relationship between the features and the log-odds of the positive class. This is a simplifying assumption that may not always hold in real-world problems, but it often works well enough for many applications.
The main goal of logistic regression is to find the optimal values of the coefficients that best fit the data. This can be done by using various optimization algorithms, such as gradient descent, that minimize a loss function, such as the cross-entropy loss, that measures the difference between the predicted probabilities and the actual labels.
Interpretation of Coefficients and their Impact on Predictions
The coefficients of a logistic regression model can be interpreted as the change in the log-odds of the positive class for a unit change in the corresponding feature, holding all other features constant. For example, if w1 is the coefficient for feature x1, then increasing x1 by one unit will increase the log-odds of the positive class by w1 units, assuming that all other features remain the same.
To understand the impact of the coefficients on the predictions, we can use the following formula:
$$\frac{\partial \hat{y}}{\partial x_i} = \hat{y} (1 – \hat{y}) w_i$$
This formula shows the partial derivative of the predicted probability with respect to a feature, which measures how much the predicted probability changes when the feature changes by a small amount. The formula tells us that the impact of a feature on the prediction depends on three factors:
- The value of the predicted probability itself. The higher the predicted probability, the more sensitive it is to changes in the features.
- The value of the coefficient for the feature. The higher the coefficient, the more impact the feature has on the prediction.
- The value of the feature itself. The higher the feature, the more impact it has on the prediction.
Evaluation Metrics for Logistic Regression Models
To evaluate the performance of a logistic regression model, we need to compare the predicted probabilities with the actual labels and measure how well they match. Different metrics can be used for this purpose, such as:
Accuracy: This is the proportion of instances that are correctly classified by the model. It is calculated as:
$$\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}$$
Accuracy is a simple and intuitive metric, but it can be misleading when the classes are imbalanced, meaning that one class is much more frequent than the other. For example, if we have a dataset where 90% of the instances are negative and 10% are positive, and we have a model that always predicts negative, then the accuracy of the model is 90%, which seems high, but it is not useful at all, as it fails to identify any positive instances.
Precision: This is the proportion of positive predictions that are actually positive. It is calculated as:
$$\text{Precision} = \frac{\text{Number of true positives}}{\text{Number of true positives + Number of false positives}}$$
Precision measures how reliable the model is when it predicts a positive. A high precision means that the model rarely makes false positive errors, meaning that it does not label negative instances as positive. Precision is important when the cost of false positives is high, such as in spam detection or medical diagnosis.
Recall: This is the proportion of positive instances that are correctly predicted by the model. It is calculated as:
$$\text{Recall} = \frac{\text{Number of true positives}}{\text{Number of true positives + Number of false negatives}}$$
Recall measures how complete the model is when it identifies positive instances. A high recall means that the model rarely misses positive instances, meaning that it does not label positive instances as negative. Recall is important when the cost of false negatives is high, such as in fraud detection or cancer screening.
F1-score: This is the harmonic mean of precision and recall. It is calculated as:
$$\text{F1-score} = \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$
F1-score is a balanced metric that combines both precision and recall. It gives a higher weight to lower values, meaning that it penalizes the model more if either precision or recall is low. F1-score is useful when there is no clear preference between precision and recall, or when the classes are imbalanced.
Other metrics can be used to evaluate logistic regression models, such as the ROC curve, the AUC score, the confusion matrix, etc. However, these are beyond the scope of this tutorial
Building a Logistic Regression Model
In this tutorial, we will build a logistic regression model to predict whether a tumor is malignant or benign based on its features. We will use the breast cancer dataset from scikit-learn, which contains 569 samples of 30 features each. The target variable is binary, indicating whether the tumor is malignant (1) or benign (0).
Data Preparation
Before we can build and train our model, we need to prepare the data for analysis. This involves loading and exploring the dataset, preprocessing the data, and splitting the data into training, validation, and test sets.
Loading and Exploring the Dataset
We can load the dataset using the load_breast_cancer
function from scikit-learn. This returns an Bunch
object, which is similar to a dictionary, containing the data, the target, and some metadata.
# Import the load_breast_cancer function
from sklearn.datasets import load_breast_cancer
# Load the dataset
data = load_breast_cancer()
# Print the keys of the data object
print(data.keys())
Output
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
We can see that the data object has several keys, such as:
data
: The feature matrix, which is a numpy array of shapes (569, 30).target
: The target vector, which is a numpy array of shape (569,).target_names
: The names of the target classes, which are ‘malignant’ and ‘benign’.DESCR
: A description of the dataset, which contains information such as the number of samples, the number of features, the meaning of the features, and some summary statistics.feature_names
: The names of the features, which are the mean, standard error, and worst (largest) values of 10 different measurements of the tumors, such as radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension.
We can print the DESCR
key to get more details about the dataset.
# Print the description of the dataset
print(data.DESCR)
We can see that the dataset contains 569 samples, 30 features, and 2 classes. The features are numerical and have different scales and ranges. The classes are imbalanced, with 357 benign samples and 212 malignant samples.
We can convert the data object into a pandas dataframe for easier manipulation and visualization.
# Import pandas
import pandas as pd
# Create a dataframe with the feature variables
df = pd.DataFrame(data.data, columns=data.feature_names)
# Add the target variable to the dataframe
df['target'] = data.target
# Print the first five rows of the dataframe
print(df.head())
We can use the info()
and describe()
methods to get some basic information and statistics about the dataframe.
# Print the information of the dataframe
print(df.info())
# Print the summary statistics of the dataframe
print(df.describe())
We can also use the value_counts()
method to check the distribution of the target variable.
# Print the frequency of the target variable
print(df['target'].value_counts())
Output
1 357
0 212
Name: target, dtype: int64
We can see that there are no missing values in the dataframe, and the target variable has 357 samples of class 0 (benign) and 212 samples of class 1 (malignant).
Preprocessing the Data
Before we can feed the data to the logistic regression model, we need to do some preprocessing steps, such as:
- Encoding categorical variables: If the dataset contains any categorical variables, such as gender, color, or country, we need to encode them into numerical values, such as 0 and 1, or one-hot vectors. This can be done using various methods, such as label encoding, one-hot encoding, or ordinal encoding. However, in this case, the dataset does not have any categorical variables, so we can skip this step.
- Scaling numerical features: Since the features have different scales and ranges, we need to standardize them to have a mean of 0 and a standard deviation of 1. This can help the model converge faster and avoid numerical instability. We can use the
StandardScaler
class from scikit-learn to perform this step.
# Import the StandardScaler class
from sklearn.preprocessing import StandardScaler
# Create an instance of the scaler
scaler = StandardScaler()
# Fit the scaler to the feature variables
scaler.fit(df[data.feature_names])
# Transform the feature variables
df_scaled = scaler.transform(df[data.feature_names])
# Create a new dataframe with the scaled features
df_scaled = pd.DataFrame(df_scaled, columns=data.feature_names)
# Add the target variable to the dataframe
df_scaled['target'] = df['target']
# Print the first five rows of the dataframe
print(df_scaled.head())
Splitting the Data
The final step of data preparation is to split the data into training, validation, and test sets. The training set is used to fit the model, the validation set is used to tune the hyperparameters, and the test set is used to evaluate the final performance of the model. We can use the train_test_split
function from scikit-learn to perform this step. We will use 70% of the data for training, 15% for validation, and 15% for testing.
# Import the train_test_split function
from sklearn.model_selection import train_test_split
# Split the data into X and y
X = df_scaled[data.feature_names]
y = df_scaled['target']
# Split the data into training, validation, and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_test, y_test, test_size=0.5, random_state=42)
# Print the shapes of the splits
print(X_train.shape, y_train.shape)
print(X_val.shape, y_val.shape)
print(X_test.shape, y_test.shape)
Output
(398, 30) (398,)
(85, 30) (85,)
(86, 30) (86,)
We can see that the training set has 398 samples, the validation set has 85 samples, and the test set has 86 samples.
Training the Model
Now that we have prepared the data, we can build and train our logistic regression model using scikit-learn. We will use the LogisticRegression
class, which implements a regularized logistic regression model with various options for the solver, the penalty, and the regularization strength. We will use the default settings, which are:
solver='lbfgs'
: This is the algorithm that optimizes the loss function. LBFGS stands for Limited-memory Broyden–Fletcher–Goldfarb–Shanno, which is a type of quasi-Newton method that approximates the second-order derivative of the loss function using a limited amount of memory.penalty='l2'
: This is the type of regularization that is applied to the coefficients to prevent overfitting. L2 regularization, also known as ridge regularization, adds a term to the loss function that is proportional to the squared norm of the coefficients, which shrinks them towards zero.C=1.0
: This is the inverse of the regularization strength, which controls how much the coefficients are penalized. A smaller value of C means a stronger regularization and a larger value of C means a weaker regularization.
We can create an instance of the LogisticRegression
class and fit it into the training data using the fit
method.
# Import the LogisticRegression class
from sklearn.linear_model import LogisticRegression
# Create an instance of the model
model = LogisticRegression()
# Fit the model to the training data
model.fit(X_train, y_train)
We can check the coefficients and the intercept of the model using the coef_
and intercept_
attributes.
# Print the coefficients of the model
print(model.coef_)
# Print the intercept of the model
print(model.intercept_)
Output
[[-0.36022215 -0.37065437 -0.31607433 -0.42392253 -0.17034368 0.60368634
-0.77189069 -1.10278226 0.22958232 0.14278871 -1.21021192 0.16792803
-0.5605131 -0.88607955 -0.1693011 0.60870865 -0.10353489 -0.46753381
0.51643068 0.70446749 -0.8043121 -1.31427119 -0.52803186 -0.77963845
-0.5200606 0.0953213 -0.98070976 -0.81926084 -1.19784114 -0.11189745]]
[0.35820399]
We can see that the model has learned 30 coefficients, one for each feature, and one intercept, which is the bias term. The coefficients represent the change in the log-odds of the positive class for a unit change in the corresponding feature, holding all other features constant. The intercept represents the log-odds of the positive class when all the features are zero.
We can use the predict
and predict_proba
methods to make predictions on new data using the model. The predict
method returns the predicted class labels, either 0 or 1, based on a threshold of 0.5. The predict_proba
method returns the predicted probabilities of each class, which are the outputs of the sigmoid function.
We can compare the predictions with the actual labels to evaluate the performance of the model.
Model Evaluation
After building and training the logistic regression model, we need to evaluate its performance on the validation set, which is a set of data that the model has not seen before. We can use various metrics to measure how well the model predicts the class labels and the probabilities of the validation data. Some of the common metrics are:
- Accuracy
- Precision
- Recall
- F1-Score
- ROC Curve
We can use scikit-learn to calculate and plot these metrics for our logistic regression model on the validation data.
# Import the metrics module
from sklearn import metrics
# Calculate the accuracy, precision, recall, and f1-score
accuracy = metrics.accuracy_score(y_val, y_pred)
precision = metrics.precision_score(y_val, y_pred)
recall = metrics.recall_score(y_val, y_pred)
f1_score = metrics.f1_score(y_val, y_pred)
# Print the metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1_score:.4f}")
# Calculate the true positive rate and false positive rate for different thresholds
fpr, tpr, thresholds = metrics.roc_curve(y_val, y_prob[:, 1])
# Calculate the area under the curve
auc = metrics.roc_auc_score(y_val, y_prob[:, 1])
# Plot the ROC curve
import matplotlib.pyplot as plt
plt.plot(fpr, tpr, label=f"AUC = {auc:.4f}")
plt.plot([0, 1], [0, 1], linestyle='--', color='red', label="Random")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()
Output
Accuracy: 0.9882
Precision: 0.9796
Recall: 1.0000
F1-score: 0.9897
Analysis and Interpretation of the Results
We can analyze and interpret the results of the model evaluation using the metrics and the ROC curve. Some of the observations are:
- The accuracy of the model is 0.9882, which means that the model correctly classified 98.82% of the validation instances. This is a high accuracy, but it does not tell us how well the model performs on each class separately.
- The precision of the model is 0.9796, which means that out of all the instances that the model predicted as positive, 97.96% were actually positive. This is a high precision, which means that the model is reliable when it predicts positive.
- The recall of the model is 1, which means that out of all the positive instances, the model correctly predicted 100% of them. This is a high recall, which means that the model is complete when it identifies positive instances.
- The f1-score of the model is 0.9897, which is the harmonic mean of precision and recall. This is a high f1-score, which means that the model is balanced between precision and recall.
- The AUC of the model is 0.9977, which is the area under the ROC curve. This is a high AUC, which means that the model has a high performance across different thresholds. The ROC curve shows that the model has a high true positive rate and a low false positive rate for most of the thresholds, which means that it can distinguish between the two classes well. The ROC curve also shows that the model is much better than a random classifier, which would have a straight line with a slope of 1.
Based on these results, we can conclude that the logistic regression model has a good performance on the validation data, and it can predict whether a tumor is malignant or benign with high accuracy, precision, recall, and f1-score. The model also has a high AUC, which means that it can adjust to different thresholds depending on the trade-off between sensitivity and specificity.
However, these results are only based on the validation data, which is a small subset of the data. To get a more reliable estimate of the model’s performance, we need to evaluate it on the test data, which is the final and unseen set of data.
Using the Model for Prediction
After evaluating the model’s performance on the validation data, we can use it to make predictions on new data points that the model has not seen before. For example, suppose we have a new tumor with the following features:
# Create a new data point
new_data = pd.DataFrame([[15, 20, 100, 800, 0.1, 0.2, 0.15, 0.08, 0.18, 0.06,
0.5, 1, 3, 50, 0.01, 0.03, 0.04, 0.02, 0.02, 0.01,
17, 25, 110, 900, 0.12, 0.25, 0.2, 0.1, 0.2, 0.07]],
columns=data.feature_names)
We can use the predict
and predict_proba
methods to predict the class label and the probability of belonging to each class for this new data point.
# Predict the class label for the new data point
new_pred = model.predict(new_data)
# Predict the probabilities for the new data point
new_prob = model.predict_proba(new_data)
We can print the results and interpret them.
# Print the predicted class label
print(f"The predicted class label is {new_pred[0]}")
# Print the predicted probabilities
print(f"The predicted probabilities are {new_prob[0]}")
Output
The predicted class label is 0
The predicted probabilities are [1. 0.]
We can see that the model predicts that the new tumor is benign (class 0) with a high probability of 100%. This means that the model is confident that the new tumor has the characteristics of a benign tumor, based on the features that it learned from the training data.
Conclusion
In this article, we learned how to build and evaluate a logistic regression model using scikit-learn and the breast cancer dataset. We covered the following steps:
- Data preparation: We loaded and explored the dataset, preprocessed the data, and split the data into training, validation, and test sets.
- Model building and training: We created and fitted a logistic regression model to the training data using the default settings of scikit-learn.
- Model evaluation: We calculated and plotted various metrics to measure the performance of the model on the validation data, such as accuracy, precision, recall, f1-score, and ROC curve. We also interpreted the results and concluded that the model had a good performance on the validation data.
- Prediction: We used the model to predict the class labels and probabilities for a new data point, and interpreted the results.
We hope that this article was helpful and informative for you to understand the basics of logistic regression and how to apply it to a binary classification problem using scikit-learn.
Also read: Linear Regression: A Comprehensive Guide – DataPro
Frequently Asked Questions
Why is it called logistic regression if it’s used for classification?
Logistic regression is called so because it uses a logistic function to model a binary outcome.
How do you interpret the coefficients of a logistic regression model?
The coefficients of a logistic regression model represent the change in the log-odds of the positive class for a unit change in the feature.
What is the difference between linear regression and logistic regression?
Linear regression predicts a continuous outcome, while logistic regression predicts a categorical outcome.
How do you evaluate the performance of a logistic regression model?
The performance of a logistic regression model can be evaluated using metrics such as accuracy, precision, recall, f1-score, and ROC curve.
How do you handle categorical features in logistic regression?
Categorical features in logistic regression need to be encoded into numerical values using methods such as label encoding, one-hot encoding, or ordinal encoding.