kopia lustrzana https://github.com/animator/learn-python
Merge branch 'main' into Transformers
commit
43a50a593b
|
@ -0,0 +1,140 @@
|
|||
# Ensemble Learning
|
||||
|
||||
Ensemble Learning is a powerful machine learning paradigm that combines multiple models to achieve better performance than any individual model. The idea is to leverage the strengths of different models to improve overall accuracy, robustness, and generalization.
|
||||
|
||||
|
||||
|
||||
## Introduction
|
||||
|
||||
Ensemble Learning is a technique that combines the predictions from multiple machine learning models to make more accurate and robust predictions than a single model. It leverages the diversity of different models to reduce errors and improve performance.
|
||||
|
||||
## Types of Ensemble Learning
|
||||
|
||||
### Bagging
|
||||
|
||||
Bagging, or Bootstrap Aggregating, involves training multiple versions of the same model on different subsets of the training data and averaging their predictions. The most common example of bagging is the `RandomForest` algorithm.
|
||||
|
||||
### Boosting
|
||||
|
||||
Boosting focuses on training models sequentially, where each new model corrects the errors made by the previous ones. This way, the ensemble learns from its mistakes, leading to improved performance. `AdaBoost` and `Gradient Boosting` are popular examples of boosting algorithms.
|
||||
|
||||
### Stacking
|
||||
|
||||
Stacking involves training multiple models (the base learners) and a meta-model that combines their predictions. The base learners are trained on the original dataset, while the meta-model is trained on the outputs of the base learners. This approach allows leveraging the strengths of different models.
|
||||
|
||||
## Advantages and Disadvantages
|
||||
|
||||
### Advantages
|
||||
|
||||
- **Improved Accuracy**: Combines the strengths of multiple models.
|
||||
- **Robustness**: Reduces the risk of overfitting and model bias.
|
||||
- **Versatility**: Can be applied to various machine learning tasks, including classification and regression.
|
||||
|
||||
### Disadvantages
|
||||
|
||||
- **Complexity**: More complex than individual models, making interpretation harder.
|
||||
- **Computational Cost**: Requires more computational resources and training time.
|
||||
- **Implementation**: Can be challenging to implement and tune effectively.
|
||||
|
||||
## Key Concepts
|
||||
|
||||
- **Diversity**: The models in the ensemble should be diverse to benefit from their different strengths.
|
||||
- **Voting/Averaging**: For classification, majority voting is used to combine predictions. For regression, averaging is used.
|
||||
- **Weighting**: In some ensembles, models are weighted based on their accuracy or other metrics.
|
||||
|
||||
## Code Examples
|
||||
|
||||
### Bagging with Random Forest
|
||||
|
||||
Below is an example of using Random Forest for classification on the Iris dataset.
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from sklearn.datasets import load_iris
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
from sklearn.model_selection import train_test_split
|
||||
from sklearn.metrics import accuracy_score, classification_report
|
||||
|
||||
# Load dataset
|
||||
iris = load_iris()
|
||||
X, y = iris.data, iris.target
|
||||
|
||||
# Split dataset
|
||||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
|
||||
|
||||
# Initialize Random Forest model
|
||||
clf = RandomForestClassifier(n_estimators=100, random_state=42)
|
||||
|
||||
# Train the model
|
||||
clf.fit(X_train, y_train)
|
||||
|
||||
# Make predictions
|
||||
y_pred = clf.predict(X_test)
|
||||
|
||||
# Evaluate the model
|
||||
accuracy = accuracy_score(y_test, y_pred)
|
||||
print(f"Accuracy: {accuracy * 100:.2f}%")
|
||||
print("Classification Report:\n", classification_report(y_test, y_pred))
|
||||
```
|
||||
|
||||
### Boosting with AdaBoost
|
||||
Below is an example of using AdaBoost for classification on the Iris dataset.
|
||||
|
||||
```
|
||||
from sklearn.ensemble import AdaBoostClassifier
|
||||
from sklearn.tree import DecisionTreeClassifier
|
||||
|
||||
# Initialize base model
|
||||
base_model = DecisionTreeClassifier(max_depth=1)
|
||||
|
||||
# Initialize AdaBoost model
|
||||
ada_clf = AdaBoostClassifier(base_estimator=base_model, n_estimators=50, random_state=42)
|
||||
|
||||
# Train the model
|
||||
ada_clf.fit(X_train, y_train)
|
||||
|
||||
# Make predictions
|
||||
y_pred = ada_clf.predict(X_test)
|
||||
|
||||
# Evaluate the model
|
||||
accuracy = accuracy_score(y_test, y_pred)
|
||||
print(f"Accuracy: {accuracy * 100:.2f}%")
|
||||
print("Classification Report:\n", classification_report(y_test, y_pred))
|
||||
```
|
||||
|
||||
### Stacking with Multiple Models
|
||||
Below is an example of using stacking with multiple models for classification on the Iris dataset.
|
||||
|
||||
```
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
from sklearn.neighbors import KNeighborsClassifier
|
||||
from sklearn.svm import SVC
|
||||
from sklearn.ensemble import StackingClassifier
|
||||
|
||||
# Define base models
|
||||
base_models = [
|
||||
('knn', KNeighborsClassifier(n_neighbors=5)),
|
||||
('svc', SVC(kernel='linear', probability=True))
|
||||
]
|
||||
|
||||
# Define meta-model
|
||||
meta_model = LogisticRegression()
|
||||
|
||||
# Initialize Stacking model
|
||||
stacking_clf = StackingClassifier(estimators=base_models, final_estimator=meta_model, cv=5)
|
||||
|
||||
# Train the model
|
||||
stacking_clf.fit(X_train, y_train)
|
||||
|
||||
# Make predictions
|
||||
y_pred = stacking_clf.predict(X_test)
|
||||
|
||||
# Evaluate the model
|
||||
accuracy = accuracy_score(y_test, y_pred)
|
||||
print(f"Accuracy: {accuracy * 100:.2f}%")
|
||||
print("Classification Report:\n", classification_report(y_test, y_pred))
|
||||
```
|
||||
|
||||
## Conclusion
|
||||
Ensemble Learning is a powerful technique that combines multiple models to improve overall performance. By leveraging the strengths of different models, it provides better accuracy, robustness, and generalization. However, it comes with increased complexity and computational cost. Understanding and implementing ensemble methods can significantly enhance machine learning solutions.
|
|
@ -11,6 +11,7 @@
|
|||
- [Introduction To Convolutional Neural Networks (CNNs)](intro-to-cnn.md)
|
||||
- [TensorFlow.md](tensorflow.md)
|
||||
- [PyTorch.md](pytorch.md)
|
||||
- [Ensemble Learning](ensemble-learning.md)
|
||||
- [Types of optimizers](types-of-optimizers.md)
|
||||
- [Logistic Regression](logistic-regression.md)
|
||||
- [Types_of_Cost_Functions](cost-functions.md)
|
||||
|
@ -18,3 +19,4 @@
|
|||
- [Hierarchical Clustering](hierarchical-clustering.md)
|
||||
- [Grid Search](grid-search.md)
|
||||
- [Transformers](transformers.md)
|
||||
- [K-nearest neighbor (KNN)](knn.md)
|
||||
|
|
|
@ -0,0 +1,122 @@
|
|||
# K-Nearest Neighbors (KNN) Machine Learning Algorithm in Python
|
||||
|
||||
## Introduction
|
||||
K-Nearest Neighbors (KNN) is a simple, yet powerful, supervised machine learning algorithm used for both classification and regression tasks. It assumes that similar things exist in close proximity. In other words, similar data points are near to each other.
|
||||
|
||||
## How KNN Works
|
||||
KNN works by finding the distances between a query and all the examples in the data, selecting the specified number of examples (K) closest to the query, then voting for the most frequent label (in classification) or averaging the labels (in regression).
|
||||
|
||||
### Steps:
|
||||
1. **Choose the number K of neighbors**
|
||||
2. **Calculate the distance** between the query-instance and all the training samples
|
||||
3. **Sort the distances** and determine the nearest neighbors based on the K-th minimum distance
|
||||
4. **Gather the labels** of the nearest neighbors
|
||||
5. **Vote for the most frequent label** (in case of classification) or **average the labels** (in case of regression)
|
||||
|
||||
## When to Use KNN
|
||||
### Advantages:
|
||||
- **Simple and easy to understand:** KNN is intuitive and easy to implement.
|
||||
- **No training phase:** KNN is a lazy learner, meaning there is no explicit training phase.
|
||||
- **Effective with a small dataset:** KNN performs well with a small number of input variables.
|
||||
|
||||
### Disadvantages:
|
||||
- **Computationally expensive:** The algorithm becomes significantly slower as the number of examples and/or predictors/independent variables increase.
|
||||
- **Sensitive to irrelevant features:** All features contribute to the distance equally.
|
||||
- **Memory-intensive:** Storing all the training data can be costly.
|
||||
|
||||
### Use Cases:
|
||||
- **Recommender Systems:** Suggest items based on similarity to user preferences.
|
||||
- **Image Recognition:** Classify images by comparing new images to the training set.
|
||||
- **Finance:** Predict credit risk or fraud detection based on historical data.
|
||||
|
||||
## KNN in Python
|
||||
|
||||
### Required Libraries
|
||||
To implement KNN, we need the following Python libraries:
|
||||
- `numpy`
|
||||
- `pandas`
|
||||
- `scikit-learn`
|
||||
- `matplotlib` (for visualization)
|
||||
|
||||
### Installation
|
||||
```bash
|
||||
pip install numpy pandas scikit-learn matplotlib
|
||||
```
|
||||
|
||||
### Example Code
|
||||
Let's implement a simple KNN classifier using the Iris dataset.
|
||||
|
||||
#### Step 1: Import Libraries
|
||||
```python
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
from sklearn.model_selection import train_test_split
|
||||
from sklearn.neighbors import KNeighborsClassifier
|
||||
from sklearn.metrics import accuracy_score
|
||||
import matplotlib.pyplot as plt
|
||||
```
|
||||
|
||||
#### Step 2: Load Dataset
|
||||
```python
|
||||
from sklearn.datasets import load_iris
|
||||
iris = load_iris()
|
||||
X = iris.data
|
||||
y = iris.target
|
||||
```
|
||||
|
||||
#### Step 3: Split Dataset
|
||||
```python
|
||||
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
|
||||
```
|
||||
|
||||
#### Step 4: Train KNN Model
|
||||
```python
|
||||
knn = KNeighborsClassifier(n_neighbors=3)
|
||||
knn.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
#### Step 5: Make Predictions
|
||||
```python
|
||||
y_pred = knn.predict(X_test)
|
||||
```
|
||||
|
||||
#### Step 6: Evaluate the Model
|
||||
```python
|
||||
accuracy = accuracy_score(y_test, y_pred)
|
||||
print(f'Accuracy: {accuracy}')
|
||||
```
|
||||
|
||||
### Visualization (Optional)
|
||||
```python
|
||||
# Plotting the decision boundary for visualization (for 2D data)
|
||||
h = .02 # step size in the mesh
|
||||
# Create color maps
|
||||
cmap_light = plt.cm.RdYlBu
|
||||
cmap_bold = plt.cm.RdYlBu
|
||||
|
||||
# For simplicity, we take only the first two features of the dataset
|
||||
X_plot = X[:, :2]
|
||||
x_min, x_max = X_plot[:, 0].min() - 1, X_plot[:, 0].max() + 1
|
||||
y_min, y_max = X_plot[:, 1].min() - 1, y_plot[:, 1].max() + 1
|
||||
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
|
||||
np.arange(y_min, y_max, h))
|
||||
|
||||
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
|
||||
Z = Z.reshape(xx.shape)
|
||||
plt.figure()
|
||||
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
|
||||
|
||||
# Plot also the training points
|
||||
plt.scatter(X_plot[:, 0], X_plot[:, 1], c=y, edgecolor='k', cmap=cmap_bold)
|
||||
plt.xlim(xx.min(), xx.max())
|
||||
plt.ylim(yy.min(), yy.max())
|
||||
plt.title("3-Class classification (k = 3)")
|
||||
plt.show()
|
||||
```
|
||||
|
||||
## Generalization and Considerations
|
||||
- **Choosing K:** The choice of K is critical. Smaller values of K can lead to noisy models, while larger values make the algorithm computationally expensive and might oversimplify the model.
|
||||
- **Feature Scaling:** Since KNN relies on distance calculations, features should be scaled (standardized or normalized) to ensure that all features contribute equally to the distance computation.
|
||||
- **Distance Metrics:** The choice of distance metric (Euclidean, Manhattan, etc.) can affect the performance of the algorithm.
|
||||
|
||||
In conclusion, KNN is a versatile and easy-to-implement algorithm suitable for various classification and regression tasks, particularly when working with small datasets and well-defined features. However, careful consideration should be given to the choice of K, feature scaling, and distance metrics to optimize its performance.
|
Ładowanie…
Reference in New Issue