Merge remote-tracking branch 'origin/main'

pull/1311/head
Divyanshi Singh 2024-06-27 23:07:02 +05:30
rodzic 3f661bd4a3
commit bafd63c95b
3 zmienionych plików z 78 dodań i 162 usunięć

Wyświetl plik

@ -1,161 +0,0 @@
# Naive Bayes classifier
A Naive Bayes classifier is a probabilistic machine learning model used for classification tasks. It is based on Bayes' Theorem and assumes that the features (or predictors) are independent of each other given the class. This assumption of independence is called the "naive" assumption, which rarely holds true in real-world data but simplifies the computations involved and often works well in practice.
It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.
## Bayes Theorem
Bayes Theorem finds the probability of an event occurring given the probability of another event that has already occurred. Bayes theorem is stated mathematically as the following equation:
$$ P(C|X) = \frac{P(X|C) \cdot P(C)}{P(X)} $$
where A and B are events and P(B) ≠ 0
* We are trying to find probability of event A, given the event B is true.
* Event B is also termed as evidence.
* P(A) is the priori of A (the prior probability)
* P(B) is Marginal Probability: Probability of Evidence.
* P(A|B) is a posteriori probability of B, i.e. probability of event after evidence is seen.
* P(B|A) is Likelihood probability i.e the likelihood that a hypothesis will come true based on the evidence.
## Steps in Naive Bayes Classification
In the context of a Naive Bayes classifier, we are interested in finding the most probable class for a given instance X with features
𝑥1, 𝑥2, 𝑥3, ..., 𝑥n.
* Calculate Priors: Calculate the prior probabilities P(C) for each class 𝐶.
* Calculate Likelihoods: Calculate the likelihoods P(𝑥𝑖C) for each feature 𝑥𝑖 given each class C.
* Apply Bayes' Theorem: Use Bayes' Theorem to compute the posterior probability for each class given the feature vector.
* Class Prediction: Predict the class with the highest posterior probability.
## Example on Naive Bayes Classifier
Consider a simple example where we want to classify emails as "spam" or "not spam" based on features like the presence of certain keywords.
#### Training Data:
| Email | Keyword1 | Keyword2 | Spam |
|--------|-------------|-------------|---------|
| 1 | Yes | No | Yes |
| 2 | Yes | Yes | Yes |
| 3 | No | Yes | No |
| 4 | Yes | No | No |
##### Calculating Priors:
1. P(Spam)= 2/4=0.5
2. P(Not Spam)= 2/4=0.5
##### Calculating Likelihoods
1. For Spam:
* 𝑃(Keyword1=Yes Spam) = 2/2 = 1
* 𝑃(Keyword2=Yes Spam) = 1/2 = 0.5
2. For Not Spam:
* 𝑃(Keyword1=Yes Not Spam) = 2/2 = 1
* 𝑃(Keyword2=Yes Not Spam) = 1/2 = 0.5
##### Applying Bayes' Theorem
For a new email with Keyword1 = Yes and Keyword2 = Yes:
1. Calculate the posterior for Spam:
* P(Spam|Keywords)∝ P(Keywords|Spam) * P(Spam)
* P(Spam|Keywords) ∝ 1.0 * 0.5 * 0.5 = 0.25
2. Calculate the posterior for Not Spam:
* P(Not Spam|Keywords)∝P(Keywords|Not Spam) * P(Not Spam)
* P(Not Spam|Keywords) ∝ 0.5 * 0.5 * 0.5=0.125
P(Spam|Keywords) > P(Not Spam|Keywords), we classify the new email as "Spam".
## Types of Naive Bayes Classifiers
#### 1. Gaussian Naive Bayes
In Gaussian Naive Bayes, continuous values associated with each feature are assumed to be distributed according to a Gaussian distribution. A Gaussian distribution is also called Normal distribution When plotted, it gives a bell shaped curve which is symmetric about the mean of the feature values.
* Assumption: Each feature follows a Gaussian distribution.
* Formula: The likelihood of the features given the class is computed using the Gaussian (normal) distribution formula:
$$
P(x_k | C) = \frac{1}{\sqrt{2\pi\sigma_C^2}} \exp\left(-\frac{(x_k - \mu_C)^2}{2\sigma_C^2}\right)
$$
where 𝜇𝐶 and 𝜎𝐶 are the mean and standard deviation of the feature 𝑥𝑖 for class C.
* Python implementation of Gaussian Naive Bayes classifier using scikit-learn:
```python
from sklearn.datasets import load_iris
iris = load_iris() # load the iris dataset
X = iris.data #feature matrix
y = iris.target #response matrix
# splitting X and y into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=1)
# training the model on training set
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
# making predictions on the testing set
y_pred = gnb.predict(X_test)
# comparing actual response values (y_test) with predicted response values (y_pred)
from sklearn import metrics
print("Gaussian Naive Bayes model accuracy(in %):", metrics.accuracy_score(y_test, y_pred)*100)
```
#### 2. Multinomial Naive Bayes
Feature vectors represent the frequencies with which certain events have been generated by a multinomial distribution.
Typically used for discrete features, especially for text (or document) classification problems like spam detection, where features represent word counts.
* Assumption: Features represent the number of times events (e.g., words) occur.
* Formula: The likelihood of the features given the class is computed using the multinomial distribution formula:
$$ P(C_k | x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) $$
$$
P(x_k | C) = \frac{1}{\sqrt{2\pi\sigma_C^2}} \exp\left(-\frac{(x_k - \mu_C)^2}{2\sigma_C^2}\right)
$$
where n(c,xi) is the count of feature 𝑥𝑖 in class 𝐶, N(C) is the total count of all features in class C, n is the number of features, and 𝛼 is a smoothing parameter.
#### 3. Bernoulli Naive Bayes
In the multivariate Bernoulli event model, features are independent booleans (binary variables) describing inputs. Like the multinomial model, this model is popular for document classification tasks, where binary term occurrence(i.e. a word occurs in a document or not) features are used rather than term frequencies(i.e. frequency of a word in the document).
Used for binary/boolean features, where features represent binary occurrences (e.g., word presence/absence in text).
* Assumption: Features are binary (e.g., word presence/absence).
* Formula: The likelihood of the features given the class is computed using the Bernoulli distribution formula:
$$
P(x_i | C) = {P_{i,C}^{x_i}} (1 - P_{i, C})^{(1 - x_i)}
$$
where P(𝑖,𝐶) is the probability of feature 𝑥𝑖 being 1 in class C.
## Advantages of Naive Bayes Classifier
* Easy to implement and computationally efficient.
* Effective in cases with a large number of features.
* Performs well even with limited training data.
* It performs well in the presence of categorical features.
* For numerical features data is assumed to come from normal distributions
## Disadvantages of Naive Bayes Classifier
* Assumes that features are independent, which may not always hold in real-world data.
* Can be influenced by irrelevant attributes.
* May assign zero probability to unseen events, leading to poor generalization.
## Applications of Naive Bayes Classifier
* Spam Email Filtering: Classifies emails as spam or non-spam based on features.
* Text Classification: Used in sentiment analysis, document categorization, and topic classification.
* Medical Diagnosis: Helps in predicting the likelihood of a disease based on symptoms.
* Credit Scoring: Evaluates creditworthiness of individuals for loan approval.
* Weather Prediction: Classifies weather conditions based on various factors.
## Conclusion
In conclusion, Naive Bayes classifiers, despite their simplified assumptions, prove effective in various applications, showcasing notable performance in document classification and spam filtering. Their efficiency, speed, and ability to work with limited data make them valuable in real-world scenarios, compensating for their naive independence assumption.

Wyświetl plik

@ -0,0 +1,77 @@
## TF-IDF (Term Frequency-Inverse Document Frequency)
### Introduction
TF-IDF stands for Term Frequency Inverse Document Frequency. It is a statistical measure used to evaluate the importance of a word in a document relative to a collection of documents (corpus). It is often used in information retrieval and text mining to identify the most significant words within a document.
### Terminologies
* Term Frequency (tf):It is a scoring of the frequency of word in the current document. In document d, the frequency represents the number of instances of a given word t. Therefore, we can see that it becomes more relevant when a word appears in the text, which is rational. Since the ordering of terms is not significant, we can use a vector to describe the text in the bag of term models. For each specific term in the paper, there is an entry with the value being the term frequency.
$$tf(t,d) = N(t) / t(d)$$
where,
N(t) = Number of times term t appears in document d
t(d) = Total number of terms in document d.
* Inverse Document Frequency (idf):It is a scoring of how rare the word is across documents. Mainly, it tests how relevant the word is. The key aim of the search is to locate the appropriate records that fit the demand. Since tf considers all terms equally significant, it is therefore not only possible to use the term frequencies to measure the weight of the term in the paper. For finding idf we first need to find df.
$$idf(t) = log(N/ df(t))$$
where,
df(t) = Number of documents containing term t
N = Total number of documents
* TF-IDF: The product of TF and IDF, providing a balanced measure that accounts for both the frequency of terms in a document and their rarity across the corpus. The tf-idf weight consists of two terms :- Normalized Term Frequency (tf) and Inverse Document Frequency (idf)
$$TF-IDF(t,d,D)=TF(t,d)×IDF(t,D)$$
### Applications of TF-IDF
TF-IDF is widely used in various applications in the different fields as follows:
* Information Retrieval
* Content Tagging
* Text Mining and Analysis
* Text Similarity Measurement
* Document Clustering and Classification
* Natural Language Processing (NLP)
* Recommendation Systems
### Advantages
* Simple to implement.
* Useful for information retrieval.
* Effective in highlighting important words.
* Does not require labeled data.
* Versatile across various applications.
### Disadvantages
* Ignores word order and context.
* DDoes not capture semantic relationships.
* Not effective for polysemous words.
* Assumes independence of terms.
* Large vocabulary size can increase computational complexity.
### Working
###### Consider a simple example with three documents:
* Document 1: "the cat in the hat"
* Document 2: "the quick brown fox"
* Document 3: "the cat and the mouse"
###### Calculating TF-IDF for the term "cat":
1) TF (cat, Document 1):
* Term Frequency: 1 (appears once)
* Total Terms: 5
* TF: 1/5 = 0.2
2) IDF (cat, All Documents):
* Total Documents: 3
* Documents containing "cat": 2 (Document 1 and Document 3)
* IDF: log(3/2) ≈ 0.176
3) TF-IDF (cat, Document 1):
* TF-IDF: 0.2 × 0.176 = 0.0352
By calculating TF-IDF for all terms across all documents, we can identify the most significant words in each document and understand their importance relative to the entire corpus.
### Conclusion
TF-IDF (Term Frequency-Inverse Document Frequency) is a widely used technique in text mining and information retrieval for identifying the importance of words in a document relative to a collection of documents. It effectively highlights significant terms by balancing term frequency within a document and the rarity of the term across the corpus.

Wyświetl plik

@ -1,3 +1,3 @@
# List of sections
- [Naive Bayes Classifiers](Naive_Bayes_Classifiers.md)
- [Term Frequency-Inverse Document Frequency](Tf-IDF)