kopia lustrzana https://github.com/animator/learn-python
second
rodzic
a32635c400
commit
9e1a20de44
|
@ -1,7 +1,8 @@
|
||||||
Random Forest
|
# Random Forest
|
||||||
|
|
||||||
Random Forest is a versatile machine learning algorithm capable of performing both regression and classification tasks. It is an ensemble method that operates by constructing a multitude of decision trees during training and outputting the average prediction of the individual trees (for regression) or the mode of the classes (for classification).
|
Random Forest is a versatile machine learning algorithm capable of performing both regression and classification tasks. It is an ensemble method that operates by constructing a multitude of decision trees during training and outputting the average prediction of the individual trees (for regression) or the mode of the classes (for classification).
|
||||||
|
|
||||||
Table of Contents
|
## Table of Contents
|
||||||
Introduction
|
Introduction
|
||||||
How Random Forest Works
|
How Random Forest Works
|
||||||
Advantages and Disadvantages
|
Advantages and Disadvantages
|
||||||
|
@ -13,51 +14,54 @@ Hyperparameter Tuning
|
||||||
Regression Example
|
Regression Example
|
||||||
Conclusion
|
Conclusion
|
||||||
References
|
References
|
||||||
Introduction
|
|
||||||
|
## Introduction
|
||||||
Random Forest is an ensemble learning method used for classification and regression tasks. It is built from multiple decision trees and combines their outputs to improve the model's accuracy and control over-fitting.
|
Random Forest is an ensemble learning method used for classification and regression tasks. It is built from multiple decision trees and combines their outputs to improve the model's accuracy and control over-fitting.
|
||||||
|
|
||||||
How Random Forest Works
|
## How Random Forest Works
|
||||||
Bootstrap Sampling:
|
### 1. Bootstrap Sampling:
|
||||||
Random subsets of the training dataset are created with replacement. Each subset is used to train an individual tree.
|
* Random subsets of the training dataset are created with replacement. Each subset is used to train an individual tree.
|
||||||
Decision Trees:
|
### 2. Decision Trees:
|
||||||
Multiple decision trees are trained on these subsets.
|
* Multiple decision trees are trained on these subsets.
|
||||||
Feature Selection:
|
### 3. Feature Selection:
|
||||||
At each split in the decision tree, a random selection of features is chosen. This randomness helps create diverse trees.
|
* At each split in the decision tree, a random selection of features is chosen. This randomness helps create diverse trees.
|
||||||
Voting/Averaging:
|
### 4. Voting/Averaging:
|
||||||
For classification, the mode of the classes predicted by individual trees is taken (majority vote).
|
For classification, the mode of the classes predicted by individual trees is taken (majority vote).
|
||||||
For regression, the average of the outputs of the individual trees is taken.
|
For regression, the average of the outputs of the individual trees is taken.
|
||||||
Detailed Working Mechanism
|
### Detailed Working Mechanism
|
||||||
Step 1: Bootstrap Sampling: Each tree is trained on a random sample of the original data, drawn with replacement (bootstrap sample). This means some data points may appear multiple times in a sample while others may not appear at all.
|
* #### Step 1: Bootstrap Sampling:
|
||||||
Step 2: Tree Construction: Each node in the tree is split using the best split among a random subset of the features. This process adds an additional layer of randomness, contributing to the robustness of the model.
|
Each tree is trained on a random sample of the original data, drawn with replacement (bootstrap sample). This means some data points may appear multiple times in a sample while others may not appear at all.
|
||||||
Step 3: Aggregation: For classification tasks, the final prediction is based on the majority vote from all the trees. For regression tasks, the final prediction is the average of all the tree predictions.
|
* #### Step 2: Tree Construction:
|
||||||
Advantages and Disadvantages
|
Each node in the tree is split using the best split among a random subset of the features. This process adds an additional layer of randomness, contributing to the robustness of the model.
|
||||||
Advantages
|
#### Step 3: Aggregation:
|
||||||
Robustness: Reduces overfitting and generalizes well due to the law of large numbers.
|
For classification tasks, the final prediction is based on the majority vote from all the trees. For regression tasks, the final prediction is the average of all the tree predictions.
|
||||||
Accuracy: Often provides high accuracy because of the ensemble method.
|
### Advantages and Disadvantages
|
||||||
Versatility: Can be used for both classification and regression tasks.
|
#### Advantages
|
||||||
Handles Missing Values: Can handle missing data better than many other algorithms.
|
* Robustness: Reduces overfitting and generalizes well due to the law of large numbers.
|
||||||
Feature Importance: Provides estimates of feature importance, which can be valuable for understanding the model.
|
* Accuracy: Often provides high accuracy because of the ensemble method.
|
||||||
Disadvantages
|
* Versatility: Can be used for both classification and regression tasks.
|
||||||
Complexity: More complex than individual decision trees, making interpretation difficult.
|
* Handles Missing Values: Can handle missing data better than many other algorithms.
|
||||||
Computational Cost: Requires more computational resources due to multiple trees.
|
* Feature Importance: Provides estimates of feature importance, which can be valuable for understanding the model.
|
||||||
Training Time: Can be slow to train compared to simpler models, especially with large datasets.
|
#### Disadvantages
|
||||||
Hyperparameters
|
* Complexity: More complex than individual decision trees, making interpretation difficult.
|
||||||
Key Hyperparameters
|
* Computational Cost: Requires more computational resources due to multiple trees.
|
||||||
n_estimators: The number of trees in the forest.
|
* Training Time: Can be slow to train compared to simpler models, especially with large datasets.
|
||||||
max_features: The number of features to consider when looking for the best split.
|
### Hyperparameters
|
||||||
max_depth: The maximum depth of the tree.
|
#### Key Hyperparameters
|
||||||
min_samples_split: The minimum number of samples required to split an internal node.
|
* n_estimators: The number of trees in the forest.
|
||||||
min_samples_leaf: The minimum number of samples required to be at a leaf node.
|
* max_features: The number of features to consider when looking for the best split.
|
||||||
bootstrap: Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
|
* max_depth: The maximum depth of the tree.
|
||||||
Tuning Hyperparameters
|
* min_samples_split: The minimum number of samples required to split an internal node.
|
||||||
|
* min_samples_leaf: The minimum number of samples required to be at a leaf node.
|
||||||
|
* bootstrap: Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
|
||||||
|
##### Tuning Hyperparameters
|
||||||
Hyperparameter tuning can significantly improve the performance of a Random Forest model. Common techniques include Grid Search and Random Search.
|
Hyperparameter tuning can significantly improve the performance of a Random Forest model. Common techniques include Grid Search and Random Search.
|
||||||
|
|
||||||
Code Examples
|
### Code Examples
|
||||||
Classification Example
|
#### Classification Example
|
||||||
Below is a simple example of using Random Forest for a classification task with the Iris dataset.
|
Below is a simple example of using Random Forest for a classification task with the Iris dataset.
|
||||||
|
|
||||||
python
|
'''
|
||||||
Copy code
|
|
||||||
import numpy as np
|
import numpy as np
|
||||||
import pandas as pd
|
import pandas as pd
|
||||||
from sklearn.datasets import load_iris
|
from sklearn.datasets import load_iris
|
||||||
|
@ -85,11 +89,14 @@ y_pred = clf.predict(X_test)
|
||||||
accuracy = accuracy_score(y_test, y_pred)
|
accuracy = accuracy_score(y_test, y_pred)
|
||||||
print(f"Accuracy: {accuracy * 100:.2f}%")
|
print(f"Accuracy: {accuracy * 100:.2f}%")
|
||||||
print("Classification Report:\n", classification_report(y_test, y_pred))
|
print("Classification Report:\n", classification_report(y_test, y_pred))
|
||||||
Feature Importance
|
|
||||||
|
'''
|
||||||
|
|
||||||
|
#### Feature Importance
|
||||||
Random Forest provides a way to measure the importance of each feature in making predictions.
|
Random Forest provides a way to measure the importance of each feature in making predictions.
|
||||||
|
|
||||||
python
|
|
||||||
Copy code
|
'''
|
||||||
import matplotlib.pyplot as plt
|
import matplotlib.pyplot as plt
|
||||||
|
|
||||||
# Get feature importances
|
# Get feature importances
|
||||||
|
@ -108,11 +115,11 @@ plt.bar(range(X.shape[1]), importances[indices], align='center')
|
||||||
plt.xticks(range(X.shape[1]), indices)
|
plt.xticks(range(X.shape[1]), indices)
|
||||||
plt.xlim([-1, X.shape[1]])
|
plt.xlim([-1, X.shape[1]])
|
||||||
plt.show()
|
plt.show()
|
||||||
Hyperparameter Tuning
|
'''
|
||||||
|
#### Hyperparameter Tuning
|
||||||
Using Grid Search for hyperparameter tuning.
|
Using Grid Search for hyperparameter tuning.
|
||||||
|
|
||||||
python
|
'''
|
||||||
Copy code
|
|
||||||
from sklearn.model_selection import GridSearchCV
|
from sklearn.model_selection import GridSearchCV
|
||||||
|
|
||||||
# Define the parameter grid
|
# Define the parameter grid
|
||||||
|
|
Ładowanie…
Reference in New Issue