From a32635c40089e1d82de3709cbc7e04a7050308ac Mon Sep 17 00:00:00 2001 From: Soubeer Koley Date: Mon, 27 May 2024 18:48:59 +0530 Subject: [PATCH 01/11] first --- contrib/machine-learning/Random_Forest.md | 175 ++++++++++++++++++++++ 1 file changed, 175 insertions(+) create mode 100644 contrib/machine-learning/Random_Forest.md diff --git a/contrib/machine-learning/Random_Forest.md b/contrib/machine-learning/Random_Forest.md new file mode 100644 index 0000000..3506a34 --- /dev/null +++ b/contrib/machine-learning/Random_Forest.md @@ -0,0 +1,175 @@ +Random Forest +Random Forest is a versatile machine learning algorithm capable of performing both regression and classification tasks. It is an ensemble method that operates by constructing a multitude of decision trees during training and outputting the average prediction of the individual trees (for regression) or the mode of the classes (for classification). + +Table of Contents +Introduction +How Random Forest Works +Advantages and Disadvantages +Hyperparameters +Code Examples +Classification Example +Feature Importance +Hyperparameter Tuning +Regression Example +Conclusion +References +Introduction +Random Forest is an ensemble learning method used for classification and regression tasks. It is built from multiple decision trees and combines their outputs to improve the model's accuracy and control over-fitting. + +How Random Forest Works +Bootstrap Sampling: +Random subsets of the training dataset are created with replacement. Each subset is used to train an individual tree. +Decision Trees: +Multiple decision trees are trained on these subsets. +Feature Selection: +At each split in the decision tree, a random selection of features is chosen. This randomness helps create diverse trees. +Voting/Averaging: +For classification, the mode of the classes predicted by individual trees is taken (majority vote). +For regression, the average of the outputs of the individual trees is taken. +Detailed Working Mechanism +Step 1: Bootstrap Sampling: Each tree is trained on a random sample of the original data, drawn with replacement (bootstrap sample). This means some data points may appear multiple times in a sample while others may not appear at all. +Step 2: Tree Construction: Each node in the tree is split using the best split among a random subset of the features. This process adds an additional layer of randomness, contributing to the robustness of the model. +Step 3: Aggregation: For classification tasks, the final prediction is based on the majority vote from all the trees. For regression tasks, the final prediction is the average of all the tree predictions. +Advantages and Disadvantages +Advantages +Robustness: Reduces overfitting and generalizes well due to the law of large numbers. +Accuracy: Often provides high accuracy because of the ensemble method. +Versatility: Can be used for both classification and regression tasks. +Handles Missing Values: Can handle missing data better than many other algorithms. +Feature Importance: Provides estimates of feature importance, which can be valuable for understanding the model. +Disadvantages +Complexity: More complex than individual decision trees, making interpretation difficult. +Computational Cost: Requires more computational resources due to multiple trees. +Training Time: Can be slow to train compared to simpler models, especially with large datasets. +Hyperparameters +Key Hyperparameters +n_estimators: The number of trees in the forest. +max_features: The number of features to consider when looking for the best split. +max_depth: The maximum depth of the tree. +min_samples_split: The minimum number of samples required to split an internal node. +min_samples_leaf: The minimum number of samples required to be at a leaf node. +bootstrap: Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree. +Tuning Hyperparameters +Hyperparameter tuning can significantly improve the performance of a Random Forest model. Common techniques include Grid Search and Random Search. + +Code Examples +Classification Example +Below is a simple example of using Random Forest for a classification task with the Iris dataset. + +python +Copy code +import numpy as np +import pandas as pd +from sklearn.datasets import load_iris +from sklearn.ensemble import RandomForestClassifier +from sklearn.model_selection import train_test_split +from sklearn.metrics import accuracy_score, classification_report + +# Load dataset +iris = load_iris() +X, y = iris.data, iris.target + +# Split dataset +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) + +# Initialize Random Forest model +clf = RandomForestClassifier(n_estimators=100, random_state=42) + +# Train the model +clf.fit(X_train, y_train) + +# Make predictions +y_pred = clf.predict(X_test) + +# Evaluate the model +accuracy = accuracy_score(y_test, y_pred) +print(f"Accuracy: {accuracy * 100:.2f}%") +print("Classification Report:\n", classification_report(y_test, y_pred)) +Feature Importance +Random Forest provides a way to measure the importance of each feature in making predictions. + +python +Copy code +import matplotlib.pyplot as plt + +# Get feature importances +importances = clf.feature_importances_ +indices = np.argsort(importances)[::-1] + +# Print feature ranking +print("Feature ranking:") +for f in range(X.shape[1]): + print(f"{f + 1}. Feature {indices[f]} ({importances[indices[f]]})") + +# Plot the feature importances +plt.figure() +plt.title("Feature importances") +plt.bar(range(X.shape[1]), importances[indices], align='center') +plt.xticks(range(X.shape[1]), indices) +plt.xlim([-1, X.shape[1]]) +plt.show() +Hyperparameter Tuning +Using Grid Search for hyperparameter tuning. + +python +Copy code +from sklearn.model_selection import GridSearchCV + +# Define the parameter grid +param_grid = { + 'n_estimators': [100, 200, 300], + 'max_features': ['auto', 'sqrt', 'log2'], + 'max_depth': [4, 6, 8, 10, 12], + 'criterion': ['gini', 'entropy'] +} + +# Initialize the Grid Search model +grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2) + +# Fit the model +grid_search.fit(X_train, y_train) + +# Print the best parameters +print("Best parameters found: ", grid_search.best_params_) +Regression Example +Below is a simple example of using Random Forest for a regression task with the Boston housing dataset. + +python +Copy code +import numpy as np +import pandas as pd +from sklearn.datasets import load_boston +from sklearn.ensemble import RandomForestRegressor +from sklearn.model_selection import train_test_split +from sklearn.metrics import mean_squared_error, r2_score + +# Load dataset +boston = load_boston() +X, y = boston.data, boston.target + +# Split dataset +X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) + +# Initialize Random Forest model +regr = RandomForestRegressor(n_estimators=100, random_state=42) + +# Train the model +regr.fit(X_train, y_train) + +# Make predictions +y_pred = regr.predict(X_test) + +# Evaluate the model +mse = mean_squared_error(y_test, y_pred) +r2 = r2_score(y_test, y_pred) +print(f"Mean Squared Error: {mse:.2f}") +print(f"R^2 Score: {r2:.2f}") +Conclusion +Random Forest is a powerful and flexible machine learning algorithm that can handle both classification and regression tasks. Its ability to create an ensemble of decision trees leads to robust and accurate models. However, it is important to be mindful of the computational cost associated with training multiple trees. + +References +Scikit-learn Random Forest Documentation +Wikipedia: Random Forest +Machine Learning Mastery: Introduction to Random Forest +Kaggle: Random Forest Guide +Towards Data Science: Understanding Random Forests \ No newline at end of file From 9e1a20de440dc20bd465e20fe2b0709241ac3595 Mon Sep 17 00:00:00 2001 From: Soubeer Koley Date: Mon, 27 May 2024 19:19:52 +0530 Subject: [PATCH 02/11] second --- contrib/machine-learning/Random_Forest.md | 97 ++++++++++++----------- 1 file changed, 52 insertions(+), 45 deletions(-) diff --git a/contrib/machine-learning/Random_Forest.md b/contrib/machine-learning/Random_Forest.md index 3506a34..8c62255 100644 --- a/contrib/machine-learning/Random_Forest.md +++ b/contrib/machine-learning/Random_Forest.md @@ -1,7 +1,8 @@ -Random Forest +# Random Forest + Random Forest is a versatile machine learning algorithm capable of performing both regression and classification tasks. It is an ensemble method that operates by constructing a multitude of decision trees during training and outputting the average prediction of the individual trees (for regression) or the mode of the classes (for classification). -Table of Contents +## Table of Contents Introduction How Random Forest Works Advantages and Disadvantages @@ -13,51 +14,54 @@ Hyperparameter Tuning Regression Example Conclusion References -Introduction + +## Introduction Random Forest is an ensemble learning method used for classification and regression tasks. It is built from multiple decision trees and combines their outputs to improve the model's accuracy and control over-fitting. -How Random Forest Works -Bootstrap Sampling: -Random subsets of the training dataset are created with replacement. Each subset is used to train an individual tree. -Decision Trees: -Multiple decision trees are trained on these subsets. -Feature Selection: -At each split in the decision tree, a random selection of features is chosen. This randomness helps create diverse trees. -Voting/Averaging: +## How Random Forest Works +### 1. Bootstrap Sampling: +* Random subsets of the training dataset are created with replacement. Each subset is used to train an individual tree. +### 2. Decision Trees: +* Multiple decision trees are trained on these subsets. +### 3. Feature Selection: +* At each split in the decision tree, a random selection of features is chosen. This randomness helps create diverse trees. +### 4. Voting/Averaging: For classification, the mode of the classes predicted by individual trees is taken (majority vote). For regression, the average of the outputs of the individual trees is taken. -Detailed Working Mechanism -Step 1: Bootstrap Sampling: Each tree is trained on a random sample of the original data, drawn with replacement (bootstrap sample). This means some data points may appear multiple times in a sample while others may not appear at all. -Step 2: Tree Construction: Each node in the tree is split using the best split among a random subset of the features. This process adds an additional layer of randomness, contributing to the robustness of the model. -Step 3: Aggregation: For classification tasks, the final prediction is based on the majority vote from all the trees. For regression tasks, the final prediction is the average of all the tree predictions. -Advantages and Disadvantages -Advantages -Robustness: Reduces overfitting and generalizes well due to the law of large numbers. -Accuracy: Often provides high accuracy because of the ensemble method. -Versatility: Can be used for both classification and regression tasks. -Handles Missing Values: Can handle missing data better than many other algorithms. -Feature Importance: Provides estimates of feature importance, which can be valuable for understanding the model. -Disadvantages -Complexity: More complex than individual decision trees, making interpretation difficult. -Computational Cost: Requires more computational resources due to multiple trees. -Training Time: Can be slow to train compared to simpler models, especially with large datasets. -Hyperparameters -Key Hyperparameters -n_estimators: The number of trees in the forest. -max_features: The number of features to consider when looking for the best split. -max_depth: The maximum depth of the tree. -min_samples_split: The minimum number of samples required to split an internal node. -min_samples_leaf: The minimum number of samples required to be at a leaf node. -bootstrap: Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree. -Tuning Hyperparameters +### Detailed Working Mechanism +* #### Step 1: Bootstrap Sampling: + Each tree is trained on a random sample of the original data, drawn with replacement (bootstrap sample). This means some data points may appear multiple times in a sample while others may not appear at all. +* #### Step 2: Tree Construction: + Each node in the tree is split using the best split among a random subset of the features. This process adds an additional layer of randomness, contributing to the robustness of the model. +#### Step 3: Aggregation: + For classification tasks, the final prediction is based on the majority vote from all the trees. For regression tasks, the final prediction is the average of all the tree predictions. +### Advantages and Disadvantages +#### Advantages +* Robustness: Reduces overfitting and generalizes well due to the law of large numbers. +* Accuracy: Often provides high accuracy because of the ensemble method. +* Versatility: Can be used for both classification and regression tasks. +* Handles Missing Values: Can handle missing data better than many other algorithms. +* Feature Importance: Provides estimates of feature importance, which can be valuable for understanding the model. +#### Disadvantages +* Complexity: More complex than individual decision trees, making interpretation difficult. +* Computational Cost: Requires more computational resources due to multiple trees. +* Training Time: Can be slow to train compared to simpler models, especially with large datasets. +### Hyperparameters +#### Key Hyperparameters +* n_estimators: The number of trees in the forest. +* max_features: The number of features to consider when looking for the best split. +* max_depth: The maximum depth of the tree. +* min_samples_split: The minimum number of samples required to split an internal node. +* min_samples_leaf: The minimum number of samples required to be at a leaf node. +* bootstrap: Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree. +##### Tuning Hyperparameters Hyperparameter tuning can significantly improve the performance of a Random Forest model. Common techniques include Grid Search and Random Search. -Code Examples -Classification Example +### Code Examples +#### Classification Example Below is a simple example of using Random Forest for a classification task with the Iris dataset. -python -Copy code +''' import numpy as np import pandas as pd from sklearn.datasets import load_iris @@ -85,11 +89,14 @@ y_pred = clf.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy * 100:.2f}%") print("Classification Report:\n", classification_report(y_test, y_pred)) -Feature Importance + +''' + +#### Feature Importance Random Forest provides a way to measure the importance of each feature in making predictions. -python -Copy code + +''' import matplotlib.pyplot as plt # Get feature importances @@ -108,11 +115,11 @@ plt.bar(range(X.shape[1]), importances[indices], align='center') plt.xticks(range(X.shape[1]), indices) plt.xlim([-1, X.shape[1]]) plt.show() -Hyperparameter Tuning +''' +#### Hyperparameter Tuning Using Grid Search for hyperparameter tuning. -python -Copy code +''' from sklearn.model_selection import GridSearchCV # Define the parameter grid From d138b0e6252f0cf28be7e014f8625352c4e90dd1 Mon Sep 17 00:00:00 2001 From: Soubeer Koley Date: Mon, 27 May 2024 19:23:51 +0530 Subject: [PATCH 03/11] third --- contrib/machine-learning/Random_Forest.md | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/contrib/machine-learning/Random_Forest.md b/contrib/machine-learning/Random_Forest.md index 8c62255..a3672d0 100644 --- a/contrib/machine-learning/Random_Forest.md +++ b/contrib/machine-learning/Random_Forest.md @@ -61,7 +61,7 @@ Hyperparameter tuning can significantly improve the performance of a Random Fore #### Classification Example Below is a simple example of using Random Forest for a classification task with the Iris dataset. -''' +``` import numpy as np import pandas as pd from sklearn.datasets import load_iris @@ -69,6 +69,7 @@ from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, classification_report + # Load dataset iris = load_iris() X, y = iris.data, iris.target @@ -90,13 +91,13 @@ accuracy = accuracy_score(y_test, y_pred) print(f"Accuracy: {accuracy * 100:.2f}%") print("Classification Report:\n", classification_report(y_test, y_pred)) -''' +``` #### Feature Importance Random Forest provides a way to measure the importance of each feature in making predictions. -''' +``` import matplotlib.pyplot as plt # Get feature importances @@ -115,11 +116,11 @@ plt.bar(range(X.shape[1]), importances[indices], align='center') plt.xticks(range(X.shape[1]), indices) plt.xlim([-1, X.shape[1]]) plt.show() -''' +``` #### Hyperparameter Tuning Using Grid Search for hyperparameter tuning. -''' +``` from sklearn.model_selection import GridSearchCV # Define the parameter grid @@ -138,11 +139,11 @@ grid_search.fit(X_train, y_train) # Print the best parameters print("Best parameters found: ", grid_search.best_params_) -Regression Example +``` +#### Regression Example Below is a simple example of using Random Forest for a regression task with the Boston housing dataset. -python -Copy code +``` import numpy as np import pandas as pd from sklearn.datasets import load_boston @@ -171,10 +172,11 @@ mse = mean_squared_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print(f"Mean Squared Error: {mse:.2f}") print(f"R^2 Score: {r2:.2f}") -Conclusion +``` +### Conclusion Random Forest is a powerful and flexible machine learning algorithm that can handle both classification and regression tasks. Its ability to create an ensemble of decision trees leads to robust and accurate models. However, it is important to be mindful of the computational cost associated with training multiple trees. -References +### References Scikit-learn Random Forest Documentation Wikipedia: Random Forest Machine Learning Mastery: Introduction to Random Forest From 834aade79acb60eb6cfc8dae1731ac93f794c2b6 Mon Sep 17 00:00:00 2001 From: Soubeer Koley Date: Mon, 27 May 2024 19:41:22 +0530 Subject: [PATCH 04/11] third --- contrib/machine-learning/Random_Forest.md | 36 +++++++++++++++-------- 1 file changed, 24 insertions(+), 12 deletions(-) diff --git a/contrib/machine-learning/Random_Forest.md b/contrib/machine-learning/Random_Forest.md index a3672d0..d2d7f5c 100644 --- a/contrib/machine-learning/Random_Forest.md +++ b/contrib/machine-learning/Random_Forest.md @@ -2,18 +2,30 @@ Random Forest is a versatile machine learning algorithm capable of performing both regression and classification tasks. It is an ensemble method that operates by constructing a multitude of decision trees during training and outputting the average prediction of the individual trees (for regression) or the mode of the classes (for classification). -## Table of Contents -Introduction -How Random Forest Works -Advantages and Disadvantages -Hyperparameters -Code Examples -Classification Example -Feature Importance -Hyperparameter Tuning -Regression Example -Conclusion -References + +- [Random Forest](#random-forest) + - [Introduction](#introduction) + - [How Random Forest Works](#how-random-forest-works) + - [1. Bootstrap Sampling:](#1-bootstrap-sampling) + - [2. Decision Trees:](#2-decision-trees) + - [3. Feature Selection:](#3-feature-selection) + - [4. Voting/Averaging:](#4-votingaveraging) + - [Detailed Working Mechanism](#detailed-working-mechanism) + - [Step 3: Aggregation:](#step-3-aggregation) + - [Advantages and Disadvantages](#advantages-and-disadvantages) + - [Advantages](#advantages) + - [Disadvantages](#disadvantages) + - [Hyperparameters](#hyperparameters) + - [Key Hyperparameters](#key-hyperparameters) + - [Tuning Hyperparameters](#tuning-hyperparameters) + - [Code Examples](#code-examples) + - [Classification Example](#classification-example) + - [Feature Importance](#feature-importance) + - [Hyperparameter Tuning](#hyperparameter-tuning) + - [Regression Example](#regression-example) + - [Conclusion](#conclusion) + - [References](#references) + ## Introduction Random Forest is an ensemble learning method used for classification and regression tasks. It is built from multiple decision trees and combines their outputs to improve the model's accuracy and control over-fitting. From 0185122fb9ae4aada0feff6cd45f1e688644ad1c Mon Sep 17 00:00:00 2001 From: Soubeer Koley Date: Mon, 27 May 2024 19:48:33 +0530 Subject: [PATCH 05/11] fourth --- .vscode/settings.json | 2 ++ 1 file changed, 2 insertions(+) create mode 100644 .vscode/settings.json diff --git a/.vscode/settings.json b/.vscode/settings.json new file mode 100644 index 0000000..7a73a41 --- /dev/null +++ b/.vscode/settings.json @@ -0,0 +1,2 @@ +{ +} \ No newline at end of file From 91a2b6d4b81bc46098d6c5007a45b0d0df3b572e Mon Sep 17 00:00:00 2001 From: Soubeer Koley Date: Mon, 27 May 2024 21:22:06 +0530 Subject: [PATCH 06/11] Added Random Forest --- contrib/machine-learning/Random_Forest.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/contrib/machine-learning/Random_Forest.md b/contrib/machine-learning/Random_Forest.md index d2d7f5c..59c44ef 100644 --- a/contrib/machine-learning/Random_Forest.md +++ b/contrib/machine-learning/Random_Forest.md @@ -23,8 +23,8 @@ Random Forest is a versatile machine learning algorithm capable of performing bo - [Feature Importance](#feature-importance) - [Hyperparameter Tuning](#hyperparameter-tuning) - [Regression Example](#regression-example) - - [Conclusion](#conclusion) - - [References](#references) + - [Conclusion](#conclusion) + - [References](#references) ## Introduction @@ -185,10 +185,10 @@ r2 = r2_score(y_test, y_pred) print(f"Mean Squared Error: {mse:.2f}") print(f"R^2 Score: {r2:.2f}") ``` -### Conclusion +## Conclusion Random Forest is a powerful and flexible machine learning algorithm that can handle both classification and regression tasks. Its ability to create an ensemble of decision trees leads to robust and accurate models. However, it is important to be mindful of the computational cost associated with training multiple trees. -### References +## References Scikit-learn Random Forest Documentation Wikipedia: Random Forest Machine Learning Mastery: Introduction to Random Forest From bdbe355c485de10dfea943108434b756e624a889 Mon Sep 17 00:00:00 2001 From: Soubeer Koley Date: Fri, 31 May 2024 08:23:22 +0530 Subject: [PATCH 07/11] updated index --- contrib/machine-learning/index.md | 1 + 1 file changed, 1 insertion(+) diff --git a/contrib/machine-learning/index.md b/contrib/machine-learning/index.md index 94ca1e2..073bca9 100644 --- a/contrib/machine-learning/index.md +++ b/contrib/machine-learning/index.md @@ -9,3 +9,4 @@ - [TensorFlow.md](tensorFlow.md) - [PyTorch.md](pytorch.md) - [Types of optimizers](Types_of_optimizers.md) +- [Random Forest](Random_Forest.md) \ No newline at end of file From 33543fd13c21329a4656f360a96d29c6483e8c71 Mon Sep 17 00:00:00 2001 From: Ankit Mahato Date: Sun, 2 Jun 2024 03:15:14 +0530 Subject: [PATCH 08/11] Delete .vscode/settings.json --- .vscode/settings.json | 2 -- 1 file changed, 2 deletions(-) delete mode 100644 .vscode/settings.json diff --git a/.vscode/settings.json b/.vscode/settings.json deleted file mode 100644 index 7a73a41..0000000 --- a/.vscode/settings.json +++ /dev/null @@ -1,2 +0,0 @@ -{ -} \ No newline at end of file From 3c95166575d6e25c2544a0be7d8d8cae4ab9b45e Mon Sep 17 00:00:00 2001 From: Ankit Mahato Date: Sun, 2 Jun 2024 03:15:39 +0530 Subject: [PATCH 09/11] Rename Random_Forest.md to random-forest.md --- contrib/machine-learning/{Random_Forest.md => random-forest.md} | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) rename contrib/machine-learning/{Random_Forest.md => random-forest.md} (99%) diff --git a/contrib/machine-learning/Random_Forest.md b/contrib/machine-learning/random-forest.md similarity index 99% rename from contrib/machine-learning/Random_Forest.md rename to contrib/machine-learning/random-forest.md index 59c44ef..0abd1ab 100644 --- a/contrib/machine-learning/Random_Forest.md +++ b/contrib/machine-learning/random-forest.md @@ -193,4 +193,4 @@ Scikit-learn Random Forest Documentation Wikipedia: Random Forest Machine Learning Mastery: Introduction to Random Forest Kaggle: Random Forest Guide -Towards Data Science: Understanding Random Forests \ No newline at end of file +Towards Data Science: Understanding Random Forests From 86e7c0d80613c236d05b4cd0895f0ee6df7601cb Mon Sep 17 00:00:00 2001 From: Ankit Mahato Date: Sun, 2 Jun 2024 03:16:18 +0530 Subject: [PATCH 10/11] Update index.md --- contrib/machine-learning/index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/contrib/machine-learning/index.md b/contrib/machine-learning/index.md index 073bca9..3b8c95b 100644 --- a/contrib/machine-learning/index.md +++ b/contrib/machine-learning/index.md @@ -9,4 +9,4 @@ - [TensorFlow.md](tensorFlow.md) - [PyTorch.md](pytorch.md) - [Types of optimizers](Types_of_optimizers.md) -- [Random Forest](Random_Forest.md) \ No newline at end of file +- [Random Forest](random-forest.md) From 0189c285cc983d2a19e306967bfe4d1f0dda43da Mon Sep 17 00:00:00 2001 From: Ankit Mahato Date: Sun, 2 Jun 2024 03:18:23 +0530 Subject: [PATCH 11/11] Update random-forest.md --- contrib/machine-learning/random-forest.md | 37 ++++------------------- 1 file changed, 6 insertions(+), 31 deletions(-) diff --git a/contrib/machine-learning/random-forest.md b/contrib/machine-learning/random-forest.md index 0abd1ab..feaaa7a 100644 --- a/contrib/machine-learning/random-forest.md +++ b/contrib/machine-learning/random-forest.md @@ -2,31 +2,6 @@ Random Forest is a versatile machine learning algorithm capable of performing both regression and classification tasks. It is an ensemble method that operates by constructing a multitude of decision trees during training and outputting the average prediction of the individual trees (for regression) or the mode of the classes (for classification). - -- [Random Forest](#random-forest) - - [Introduction](#introduction) - - [How Random Forest Works](#how-random-forest-works) - - [1. Bootstrap Sampling:](#1-bootstrap-sampling) - - [2. Decision Trees:](#2-decision-trees) - - [3. Feature Selection:](#3-feature-selection) - - [4. Voting/Averaging:](#4-votingaveraging) - - [Detailed Working Mechanism](#detailed-working-mechanism) - - [Step 3: Aggregation:](#step-3-aggregation) - - [Advantages and Disadvantages](#advantages-and-disadvantages) - - [Advantages](#advantages) - - [Disadvantages](#disadvantages) - - [Hyperparameters](#hyperparameters) - - [Key Hyperparameters](#key-hyperparameters) - - [Tuning Hyperparameters](#tuning-hyperparameters) - - [Code Examples](#code-examples) - - [Classification Example](#classification-example) - - [Feature Importance](#feature-importance) - - [Hyperparameter Tuning](#hyperparameter-tuning) - - [Regression Example](#regression-example) - - [Conclusion](#conclusion) - - [References](#references) - - ## Introduction Random Forest is an ensemble learning method used for classification and regression tasks. It is built from multiple decision trees and combines their outputs to improve the model's accuracy and control over-fitting. @@ -41,9 +16,9 @@ Random Forest is an ensemble learning method used for classification and regress For classification, the mode of the classes predicted by individual trees is taken (majority vote). For regression, the average of the outputs of the individual trees is taken. ### Detailed Working Mechanism -* #### Step 1: Bootstrap Sampling: +#### Step 1: Bootstrap Sampling: Each tree is trained on a random sample of the original data, drawn with replacement (bootstrap sample). This means some data points may appear multiple times in a sample while others may not appear at all. -* #### Step 2: Tree Construction: +#### Step 2: Tree Construction: Each node in the tree is split using the best split among a random subset of the features. This process adds an additional layer of randomness, contributing to the robustness of the model. #### Step 3: Aggregation: For classification tasks, the final prediction is based on the majority vote from all the trees. For regression tasks, the final prediction is the average of all the tree predictions. @@ -73,7 +48,7 @@ Hyperparameter tuning can significantly improve the performance of a Random Fore #### Classification Example Below is a simple example of using Random Forest for a classification task with the Iris dataset. -``` +```python import numpy as np import pandas as pd from sklearn.datasets import load_iris @@ -109,7 +84,7 @@ print("Classification Report:\n", classification_report(y_test, y_pred)) Random Forest provides a way to measure the importance of each feature in making predictions. -``` +```python import matplotlib.pyplot as plt # Get feature importances @@ -132,7 +107,7 @@ plt.show() #### Hyperparameter Tuning Using Grid Search for hyperparameter tuning. -``` +```python from sklearn.model_selection import GridSearchCV # Define the parameter grid @@ -155,7 +130,7 @@ print("Best parameters found: ", grid_search.best_params_) #### Regression Example Below is a simple example of using Random Forest for a regression task with the Boston housing dataset. -``` +```python import numpy as np import pandas as pd from sklearn.datasets import load_boston