4.7 KiB

Czysty Bezpośredni odnośnik Wina Historia

K-Nearest Neighbors (KNN) Machine Learning Algorithm in Python

Introduction

K-Nearest Neighbors (KNN) is a simple, yet powerful, supervised machine learning algorithm used for both classification and regression tasks. It assumes that similar things exist in close proximity. In other words, similar data points are near to each other.

How KNN Works

KNN works by finding the distances between a query and all the examples in the data, selecting the specified number of examples (K) closest to the query, then voting for the most frequent label (in classification) or averaging the labels (in regression).

Steps:

Choose the number K of neighbors
Calculate the distance between the query-instance and all the training samples
Sort the distances and determine the nearest neighbors based on the K-th minimum distance
Gather the labels of the nearest neighbors
Vote for the most frequent label (in case of classification) or average the labels (in case of regression)

When to Use KNN

Advantages:

Simple and easy to understand: KNN is intuitive and easy to implement.
No training phase: KNN is a lazy learner, meaning there is no explicit training phase.
Effective with a small dataset: KNN performs well with a small number of input variables.

Disadvantages:

Computationally expensive: The algorithm becomes significantly slower as the number of examples and/or predictors/independent variables increase.
Sensitive to irrelevant features: All features contribute to the distance equally.
Memory-intensive: Storing all the training data can be costly.

Use Cases:

Recommender Systems: Suggest items based on similarity to user preferences.
Image Recognition: Classify images by comparing new images to the training set.
Finance: Predict credit risk or fraud detection based on historical data.

KNN in Python

Required Libraries

To implement KNN, we need the following Python libraries:

numpy
pandas
scikit-learn
matplotlib (for visualization)

Installation

pip install numpy pandas scikit-learn matplotlib

Example Code

Let's implement a simple KNN classifier using the Iris dataset.

Step 1: Import Libraries

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

Step 2: Load Dataset

from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

Step 3: Split Dataset

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Step 4: Train KNN Model

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)

Step 5: Make Predictions

y_pred = knn.predict(X_test)

Step 6: Evaluate the Model

accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Visualization (Optional)

# Plotting the decision boundary for visualization (for 2D data)
h = .02  # step size in the mesh
# Create color maps
cmap_light = plt.cm.RdYlBu
cmap_bold = plt.cm.RdYlBu

# For simplicity, we take only the first two features of the dataset
X_plot = X[:, :2]
x_min, x_max = X_plot[:, 0].min() - 1, X_plot[:, 0].max() + 1
y_min, y_max = X_plot[:, 1].min() - 1, y_plot[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot also the training points
plt.scatter(X_plot[:, 0], X_plot[:, 1], c=y, edgecolor='k', cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = 3)")
plt.show()

Generalization and Considerations

Choosing K: The choice of K is critical. Smaller values of K can lead to noisy models, while larger values make the algorithm computationally expensive and might oversimplify the model.
Feature Scaling: Since KNN relies on distance calculations, features should be scaled (standardized or normalized) to ensure that all features contribute equally to the distance computation.
Distance Metrics: The choice of distance metric (Euclidean, Manhattan, etc.) can affect the performance of the algorithm.

In conclusion, KNN is a versatile and easy-to-implement algorithm suitable for various classification and regression tasks, particularly when working with small datasets and well-defined features. However, careful consideration should be given to the choice of K, feature scaling, and distance metrics to optimize its performance.

4.7 KiB Czysty Bezpośredni odnośnik Wina Historia