Merge branch 'main' into main

pull/826/head
Ashita Prasad 2024-06-08 10:30:09 +05:30 zatwierdzone przez GitHub
commit b1a9fc3a32
Nie znaleziono w bazie danych klucza dla tego podpisu
ID klucza GPG: B5690EEEBB952194
44 zmienionych plików z 1980 dodań i 14 usunięć

Wyświetl plik

@ -0,0 +1,87 @@
# Generators
## Introduction
Generators in Python are a sophisticated feature that enables the creation of iterators without the need to construct a full list in memory. They allow you to generate values on-the-fly, which is particularly beneficial for working with large datasets or infinite sequences. We will explore generators in depth, covering their types, mathematical formulation, advantages, disadvantages, and implementation examples.
## Function Generators
Function generators are created using the `yield` keyword within a function. When invoked, a function generator returns a generator iterator, allowing you to iterate over the values generated by the function.
### Mathematical Formulation
Function generators can be represented mathematically using set-builder notation. The general form is:
```
{expression | variable in iterable, condition}
```
Where:
- `expression` is the expression to generate values.
- `variable` is the variable used in the expression.
- `iterable` is the sequence of values to iterate over.
- `condition` is an optional condition that filters the values.
### Advantages of Function Generators
1. **Memory Efficiency**: Function generators produce values lazily, meaning they generate values only when needed, saving memory compared to constructing an entire sequence upfront.
2. **Lazy Evaluation**: Values are generated on-the-fly as they are consumed, leading to improved performance and reduced overhead, especially when dealing with large datasets.
3. **Infinite Sequences**: Function generators can represent infinite sequences, such as the Fibonacci sequence, allowing you to work with data streams of arbitrary length without consuming excessive memory.
### Disadvantages of Function Generators
1. **Single Iteration**: Once a function generator is exhausted, it cannot be reused. If you need to iterate over the sequence again, you'll have to create a new generator.
2. **Limited Random Access**: Function generators do not support random access like lists. They only allow sequential access, which might be a limitation depending on the use case.
### Implementation Example
```python
def fibonacci():
a, b = 0, 1
while True:
yield a
a, b = b, a + b
# Usage
fib_gen = fibonacci()
for _ in range(10):
print(next(fib_gen))
```
## Generator Expressions
Generator expressions are similar to list comprehensions but return a generator object instead of a list. They offer a concise way to create generators without the need for a separate function.
### Mathematical Formulation
Generator expressions can also be represented mathematically using set-builder notation. The general form is the same as for function generators.
### Advantages of Generator Expressions
1. **Memory Efficiency**: Generator expressions produce values lazily, similar to function generators, resulting in memory savings.
2. **Lazy Evaluation**: Values are generated on-the-fly as they are consumed, providing improved performance and reduced overhead.
### Disadvantages of Generator Expressions
1. **Single Iteration**: Like function generators, once a generator expression is exhausted, it cannot be reused.
2. **Limited Random Access**: Generator expressions, similar to function generators, do not support random access.
### Implementation Example
```python
# Generate squares of numbers from 0 to 9
square_gen = (x**2 for x in range(10))
# Usage
for num in square_gen:
print(num)
```
## Conclusion
Generators offer a powerful mechanism for creating iterators efficiently in Python. By understanding the differences between function generators and generator expressions, along with their mathematical formulation, advantages, and disadvantages, you can leverage them effectively in various scenarios. Whether you're dealing with large datasets or need to work with infinite sequences, generators provide a memory-efficient solution with lazy evaluation capabilities, contributing to more elegant and scalable code.

Wyświetl plik

@ -1,10 +1,12 @@
# List of sections # List of sections
- [OOPs](OOPs.md) - [OOPs](oops.md)
- [Decorators/\*args/**kwargs](decorator-kwargs-args.md) - [Decorators/\*args/**kwargs](decorator-kwargs-args.md)
- [Lambda Function](lambda-function.md) - [Lambda Function](lambda-function.md)
- [Working with Dates & Times in Python](dates_and_times.md) - [Working with Dates & Times in Python](dates_and_times.md)
- [Regular Expressions in Python](regular_expressions.md) - [Regular Expressions in Python](regular_expressions.md)
- [JSON module](json-module.md) - [JSON module](json-module.md)
- [Map Function](map-function.md) - [Map Function](map-function.md)
- [Protocols](protocols.md)
- [Exception Handling in Python](exception-handling.md) - [Exception Handling in Python](exception-handling.md)
- [Generators](generators.md)

Wyświetl plik

@ -0,0 +1,243 @@
# Protocols in Python
Python can establish informal interfaces using protocols In order to improve code structure, reusability, and type checking. Protocols allow for progressive adoption and are more flexible than standard interfaces in other programming languages like JAVA, which are tight contracts that specify the methods and attributes a class must implement.
>Before going into depth of this topic let's understand another topic which is pre-requisite od this topic \#TypingModule
## Typing Module
This is a module in python which provides
1. Provides classes, functions, and type aliases.
2. Allows adding type annotations to our code.
3. Enhances code readability.
4. Helps in catching errors early.
### Type Hints in Python:
Type hints allow you to specify the expected data types of variables, function parameters, and return values. This can improve code readability and help with debugging.
Here is a simple function that adds two numbers:
```python
def add(a,b):
return a + b
add(10,20)
```
>Output: 30
While this works fine, adding type hints makes the code more understandable and serves as documentation:
```python
def add(a:int, b:int)->int:
return a + b
print(add(1,10))
```
>Output: 11
In this version, `a` and `b` are expected to be integers, and the function is expected to return an integer. This makes the function's purpose and usage clearer.
#### let's see another example
The function given below takes an iterable (it can be any off list, tuple, dict, set, frozeset, String... etc) and print it's content in a single line along with it's type.
```python
from typing import Iterable
# type alias
def print_all(l: Iterable)->None:
print(type(l),end=' ')
for i in l:
print(i,end=' ')
print()
l = [1,2,3,4,5] # type: List[int]
s = {1,2,3,4,5} # type: Set[int]
t = (1,2,3,4,5) # type: Tuple[int]
for iter_obj in [l,s,t]:
print_all(iter_obj)
```
Output:
> <class 'list'> 1 2 3 4 5
> <class 'set'> 1 2 3 4 5
> <class 'tuple'> 1 2 3 4 5
and now lets try calling the function `print_all` using a non-iterable object `int` as argument.
```python
a = 10
print_all(a) # This will raise an error
```
Output:
>TypeError: 'int' object is not iterable
This error occurs because `a` is an `integer`, and the `integer` class does not have any methods or attributes that make it work like an iterable. In other words, the integer class does not conform to the `Iterable` protocol.
**Benefits of Type Hints**
Using type hints helps in several ways:
1. **Error Detection**: Tools like mypy can catch type-related problems during development, decreasing runtime errors.
2. **Code Readability**: Type hints serve as documentation, making it easy to comprehend what data types are anticipated and returned.
3. **Improved Maintenance**: With unambiguous type expectations, maintaining and updating code becomes easier, especially in huge codebases.
Now that we have understood about type hints and typing module let's dive deep into protocols.
## Understanding Protocols
In Python, protocols define interfaces similar to Java interfaces. They let you specify methods and attributes that an object must implement without requiring inheritance from a base class. Protocols are part of the `typing` module and provide a way to enforce certain structures in your classes, enhancing type safety and code clarity.
### What is a Protocol?
A protocol specifies one or more method signatures that a class must implement to be considered as conforming to the protocol.
This concept is often referred to as "structural subtyping" or "duck typing," meaning that if an object implements the required methods and attributes, it can be treated as an instance of the protocol.
Let's write our own protocol:
```python
from typing import Protocol
# Define a Printable protocol
class Printable(Protocol):
def print(self) -> None:
"""Print the object"""
pass
# Book class implements the Printable protocol
class Book:
def __init__(self, title: str):
self.title = title
def print(self) -> None:
print(f"Book Title: {self.title}")
# print_object function takes a Printable object and calls its print method
def print_object(obj: Printable) -> None:
obj.print()
book = Book("Python Programming")
print_object(book)
```
Output:
> Book Title: Python Programming
In this example:
1. **Printable Protocol:** Defines an interface with a single method print.
2. **Book Class:** Implements the Printable protocol by providing a print method.
3. **print_object Function:** Accepts any object that conforms to the Printable protocol and calls its print method.
we got our output because the class `Book` confirms to the protocols `printable`.
similarly When you pass an object to `print_object` that does not conform to the Printable protocol, an error will occur. This is because the object does not implement the required `print` method.
Let's see an example:
```python
class Team:
def huddle(self) -> None:
print("Team Huddle")
c = Team()
print_object(c) # This will raise an error
```
Output:
>AttributeError: 'Team' object has no attribute 'print'
In this case:
- The `Team` class has a `huddle` method but does not have a `print` method.
- When `print_object` tries to call the `print` method on a `Team` instance, it raises an `AttributeError`.
> This is an important aspect of using protocols: they ensure that objects provide the necessary methods, leading to more predictable and reliable code.
**Ensuring Protocol Conformance**
To avoid such errors, you need to ensure that any object passed to `print_object` implements the `Printable` protocol. Here's how you can modify the `Team` class to conform to the protocol:
```python
class Team:
def __init__(self, name: str):
self.name = name
def huddle(self) -> None:
print("Team Huddle")
def print(self) -> None:
print(f"Team Name: {self.name}")
c = Team("Dream Team")
print_object(c)
```
Output:
>Team Name: Dream Team
The `Team` class now implements the `print` method, conforming to the `Printable` protocol. and hence, no longer raises an error.
### Protocols and Inheritance:
Protocols can also be used in combination with inheritance to create more complex interfaces.
we can do that by following these steps:
**Step 1 - Base protocol**: Define a base protocol that specifies a common set of methods and attributes.
**Step 2 - Derived Protocols**: Create derives protocols that extends the base protocol with addition requirements
**Step 3 - Polymorphism**: Objects can then conform to multiple protocols, allowing for Polymorphic behavior.
Let's see an example on this as well:
```python
from typing import Protocol
# Base Protocols
class Printable(Protocol):
def print(self) -> None:
"""Print the object"""
pass
# Base Protocols-2
class Serializable(Protocol):
def serialize(self) -> str:
pass
# Derived Protocol
class PrintableAndSerializable(Printable, Serializable):
pass
# class with implementation of both Printable and Serializable
class Book_serialize:
def __init__(self, title: str):
self.title = title
def print(self) -> None:
print(f"Book Title: {self.title}")
def serialize(self) -> None:
print(f"serialize: {self.title}")
# function accepts the object which implements PrintableAndSerializable
def test(obj: PrintableAndSerializable):
obj.print()
obj.serialize()
book = Book_serialize("lean-in")
test(book)
```
Output:
> Book Title: lean-in
serialize: lean-in
In this example:
**Printable Protocol:** Specifies a `print` method.
**Serializable Protocol:** Specifies a `serialize` method.
**PrintableAndSerializable Protocol:** Combines both `Printable` and `Serializable`.
**Book Class**: Implements both `print` and `serialize` methods, conforming to `PrintableAndSerializable`.
**test Function:** Accepts any object that implements the `PrintableAndSerializable` protocol.
If you try to pass an object that does not conform to the `PrintableAndSerializable` protocol to the test function, it will raise an `error`. Let's see an example:
```python
class Team:
def huddle(self) -> None:
print("Team Huddle")
c = Team()
test(c) # This will raise an error
```
output:
> AttributeError: 'Team' object has no attribute 'print'
In this case:
The `Team` class has a `huddle` method but does not implement `print` or `serialize` methods.
When test tries to call `print` and `serialize` on a `Team` instance, it raises an `AttributeError`.
**In Conclusion:**
>Python protocols offer a versatile and powerful means of defining interfaces, encouraging the decoupling of code, improving readability, and facilitating static type checking. They are particularly handy for scenarios involving file-like objects, bespoke containers, and any case where you wish to enforce certain behaviors without requiring inheritance from a specific base class. Ensuring that classes conform to protocols reduces runtime problems and makes your code more robust and maintainable.

Wyświetl plik

@ -10,3 +10,6 @@
- [Greedy Algorithms](greedy-algorithms.md) - [Greedy Algorithms](greedy-algorithms.md)
- [Dynamic Programming](dynamic-programming.md) - [Dynamic Programming](dynamic-programming.md)
- [Linked list](linked-list.md) - [Linked list](linked-list.md)
- [Stacks in Python](stacks.md)
- [Sliding Window Technique](sliding-window.md)
- [Trie](trie.md)

Wyświetl plik

@ -0,0 +1,249 @@
# Sliding Window Technique
The sliding window technique is a fundamental approach used to solve problems involving arrays, lists, or sequences. It's particularly useful when you need to calculate something over a subarray or sublist of fixed size that slides over the entire array.
In easy words, It is the transformation of the nested loops into the single loop
## Concept
The sliding window technique involves creating a window (a subarray or sublist) that moves or "slides" across the entire array. This window can either be fixed in size or dynamically resized. By maintaining and updating this window as it moves, you can optimize certain computations, reducing time complexity.
## Types of Sliding Windows
1. **Fixed Size Window**: The window size remains constant as it slides from the start to the end of the array.
2. **Variable Size Window**: The window size can change based on certain conditions, such as the sum of elements within the window meeting a specified target.
## Steps to Implement a Sliding Window
1. **Initialize the Window**: Set the initial position of the window and any required variables (like sum, count, etc.).
2. **Expand the Window**: Add the next element to the window and update the relevant variables.
3. **Shrink the Window**: If needed, remove elements from the start of the window and update the variables.
4. **Slide the Window**: Move the window one position to the right by including the next element and possibly excluding the first element.
5. **Repeat**: Continue expanding, shrinking, and sliding the window until you reach the end of the array.
## Example Problems
### 1. Maximum Sum Subarray of Fixed Size K
Given an array of integers and an integer k, find the maximum sum of a subarray of size k.
**Steps:**
1. Initialize the sum of the first k elements.
2. Slide the window from the start of the array to the end, updating the sum by subtracting the element that is left behind and adding the new element.
3. Track the maximum sum encountered.
**Python Code:**
```python
def max_sum_subarray(arr, k):
n = len(arr)
if n < k:
return None
# Compute the sum of the first window
window_sum = sum(arr[:k])
max_sum = window_sum
# Slide the window from start to end
for i in range(n - k):
window_sum = window_sum - arr[i] + arr[i + k]
max_sum = max(max_sum, window_sum)
return max_sum
# Example usage:
arr = [1, 3, 2, 5, 1, 1, 6, 2, 8, 5]
k = 3
print(max_sum_subarray(arr, k)) # Output: 16
```
### 2. Longest Substring Without Repeating Characters
Given a string, find the length of the longest substring without repeating characters.
**Steps:**
1. Use two pointers to represent the current window.
2. Use a set to track characters in the current window.
3. Expand the window by moving the right pointer.
4. If a duplicate character is found, shrink the window by moving the left pointer until the duplicate is removed.
**Python Code:**
```python
def longest_unique_substring(s):
n = len(s)
char_set = set()
left = 0
max_length = 0
for right in range(n):
while s[right] in char_set:
char_set.remove(s[left])
left += 1
char_set.add(s[right])
max_length = max(max_length, right - left + 1)
return max_length
# Example usage:
s = "abcabcbb"
print(longest_unique_substring(s)) # Output: 3
```
## 3. Minimum Size Subarray Sum
Given an array of positive integers and a positive integer `s`, find the minimal length of a contiguous subarray of which the sum is at least `s`. If there isn't one, return 0 instead.
### Steps:
1. Use two pointers, `left` and `right`, to define the current window.
2. Expand the window by moving `right` and adding `arr[right]` to `current_sum`.
3. If `current_sum` is greater than or equal to `s`, update `min_length` and shrink the window from the left by moving `left` and subtracting `arr[left]` from `current_sum`.
4. Repeat until `right` has traversed the array.
### Python Code:
```python
def min_subarray_len(s, arr):
n = len(arr)
left = 0
current_sum = 0
min_length = float('inf')
for right in range(n):
current_sum += arr[right]
while current_sum >= s:
min_length = min(min_length, right - left + 1)
current_sum -= arr[left]
left += 1
return min_length if min_length != float('inf') else 0
# Example usage:
arr = [2, 3, 1, 2, 4, 3]
s = 7
print(min_subarray_len(s, arr)) # Output: 2 (subarray [4, 3])
```
## 4. Longest Substring with At Most K Distinct Characters
Given a string `s` and an integer `k`, find the length of the longest substring that contains at most `k` distinct characters.
### Steps:
1. Use two pointers, `left` and `right`, to define the current window.
2. Use a dictionary `char_count` to count characters in the window.
3. Expand the window by moving `right` and updating `char_count`.
4. If `char_count` has more than `k` distinct characters, shrink the window from the left by moving `left` and updating `char_count`.
5. Keep track of the maximum length of the window with at most `k` distinct characters.
### Python Code:
```python
def longest_substring_k_distinct(s, k):
n = len(s)
char_count = {}
left = 0
max_length = 0
for right in range(n):
char_count[s[right]] = char_count.get(s[right], 0) + 1
while len(char_count) > k:
char_count[s[left]] -= 1
if char_count[s[left]] == 0:
del char_count[s[left]]
left += 1
max_length = max(max_length, right - left + 1)
return max_length
# Example usage:
s = "eceba"
k = 2
print(longest_substring_k_distinct(s, k)) # Output: 3 (substring "ece")
```
## 5. Maximum Number of Vowels in a Substring of Given Length
Given a string `s` and an integer `k`, return the maximum number of vowel letters in any substring of `s` with length `k`.
### Steps:
1. Use a sliding window of size `k`.
2. Keep track of the number of vowels in the current window.
3. Expand the window by adding the next character and update the count if it's a vowel.
4. If the window size exceeds `k`, remove the leftmost character and update the count if it's a vowel.
5. Track the maximum number of vowels found in any window of size `k`.
### Python Code:
```python
def max_vowels(s, k):
vowels = set('aeiou')
max_vowel_count = 0
current_vowel_count = 0
for i in range(len(s)):
if s[i] in vowels:
current_vowel_count += 1
if i >= k:
if s[i - k] in vowels:
current_vowel_count -= 1
max_vowel_count = max(max_vowel_count, current_vowel_count)
return max_vowel_count
# Example usage:
s = "abciiidef"
k = 3
print(max_vowels(s, k)) # Output: 3 (substring "iii")
```
## 6. Subarray Product Less Than K
Given an array of positive integers `nums` and an integer `k`, return the number of contiguous subarrays where the product of all the elements in the subarray is less than `k`.
### Steps:
1. Use two pointers, `left` and `right`, to define the current window.
2. Expand the window by moving `right` and multiplying `product` by `nums[right]`.
3. If `product` is greater than or equal to `k`, shrink the window from the left by moving `left` and dividing `product` by `nums[left]`.
4. For each position of `right`, the number of valid subarray ending at `right` is `right - left + 1`.
5. Sum these counts to get the total number of subarray with product less than `k`.
### Python Code:
```python
def num_subarray_product_less_than_k(nums, k):
if k <= 1:
return 0
product = 1
left = 0
count = 0
for right in range(len(nums)):
product *= nums[right]
while product >= k:
product /= nums[left]
left += 1
count += right - left + 1
return count
# Example usage:
nums = [10, 5, 2, 6]
k = 100
print(num_subarray_product_less_than_k(nums, k)) # Output: 8
```
## Advantages
- **Efficiency**: Reduces the time complexity from O(n^2) to O(n) for many problems.
- **Simplicity**: Provides a straightforward way to manage subarrays/substrings with overlapping elements.
## Applications
- Finding the maximum or minimum sum of subarrays of fixed size.
- Detecting unique elements in a sequence.
- Solving problems related to dynamic programming with fixed constraints.
- Efficiently managing and processing streaming data or real-time analytics.
By using the sliding window technique, you can tackle a wide range of problems in a more efficient manner.

Wyświetl plik

@ -0,0 +1,116 @@
# Stacks in Python
In Data Structures and Algorithms, a stack is a linear data structure that complies with the Last In, First Out (LIFO) rule. It works by use of two fundamental techniques: **PUSH** which inserts an element on top of the stack and **POP** which takes out the topmost element.This concept is similar to a stack of plates in a cafeteria. Stacks are usually used for handling function calls, expression evaluation, and parsing in programming. Indeed, they are efficient in managing memory as well as tracking program state.
## Points to be Remebered
- A stack is a collection of data items that can be accessed at only one end, called **TOP**.
- Items can be inserted and deleted in a stack only at the TOP.
- The last item inserted in a stack is the first one to be deleted.
- Therefore, a stack is called a **Last-In-First-Out (LIFO)** data structure.
## Real Life Examples of Stacks
- **PILE OF BOOKS** - Suppose a set of books are placed one over the other in a pile. When you remove books from the pile, the topmost book will be removed first. Similarly, when you have to add a book to the pile, the book will be placed at the top of the file.
- **PILE OF PLATES** - The first plate begins the pile. The second plate is placed on the top of the first plate and the third plate is placed on the top of the second plate, and so on. In general, if you want to add a plate to the pile, you can keep it on the top of the pile. Similarly, if you want to remove a plate, you can remove the plate from the top of the pile.
- **BANGLES IN A HAND** - When a person wears bangles, the last bangle worn is the first one to be removed.
## Applications of Stacks
Stacks are widely used in Computer Science:
- Function call management
- Maintaining the UNDO list for the application
- Web browser *history management*
- Evaluating expressions
- Checking the nesting of parentheses in an expression
- Backtracking algorithms (Recursion)
Understanding these applications is essential for Software Development.
## Operations on a Stack
Key operations on a stack include:
- **PUSH** - It is the process of inserting a new element on the top of a stack.
- **OVERFLOW** - A situation when we are pushing an item in a stack that is full.
- **POP** - It is the process of deleting an element from the top of a stack.
- **UNDERFLOW** - A situation when we are popping item from an empty stack.
- **PEEK** - It is the process of getting the most recent value of stack *(i.e. the value at the top of the stack)*
- **isEMPTY** - It is the function which return true if stack is empty else false.
- **SHOW** -Displaying stack items.
## Implementing Stacks in Python
```python
def isEmpty(S):
if len(S) == 0:
return True
else:
return False
def Push(S, item):
S.append(item)
def Pop(S):
if isEmpty(S):
return "Underflow"
else:
val = S.pop()
return val
def Peek(S):
if isEmpty(S):
return "Underflow"
else:
top = len(S) - 1
return S[top]
def Show(S):
if isEmpty(S):
print("Sorry, No items in Stack")
else:
print("(Top)", end=' ')
t = len(S) - 1
while t >= 0:
print(S[t], "<", end=' ')
t -= 1
print()
stack = [] # initially stack is empty
Push(stack, 5)
Push(stack, 10)
Push(stack, 15)
print("Stack after Push operations:")
Show(stack)
print("Peek operation:", Peek(stack))
print("Pop operation:", Pop(stack))
print("Stack after Pop operation:")
Show(stack)
```
## Output
```markdown
Stack after Push operations:
(Top) 15 < 10 < 5 <
Peek operation: 15
Pop operation: 15
Stack after Pop operation:
(Top) 10 < 5 <
```
## Complexity Analysis
- **Worst case**: `O(n)` This occurs when the stack is full, it is dominated by the usage of Show operation.
- **Best case**: `O(1)` When the operations like isEmpty, Push, Pop and Peek are used, they have a constant time complexity of O(1).
- **Average case**: `O(n)` The average complexity is likely to be lower than O(n), as the stack is not always full.

Wyświetl plik

@ -0,0 +1,152 @@
# Trie
A Trie is a tree-like data structure used for storing a dynamic set of strings where the keys are usually strings. It is also known as prefix tree or digital tree.
>Trie is a type of search tree, where each node represents a single character of a string.
>Nodes are linked in such a way that they form a tree, where each path from the root to a leaf node represents a unique string stored in the Trie.
## Characteristics of Trie
- **Prefix Matching**: Tries are particularly useful for prefix matching operations. Any node in the Trie represents a common prefix of all strings below it.
- **Space Efficiency**: Tries can be more space-efficient than other data structures like hash tables for storing large sets of strings with common prefixes.
- **Time Complexity**: Insertion, deletion, and search operations in a Trie have a time complexity of
𝑂(𝑚), where m is the length of the string. This makes Tries very efficient for these operations.
## Structure of Trie
Trie mainly consists of three parts:
- **Root**: The root of a Trie is an empty node that does not contain any character.
- **Edges**: Each edge in the Trie represents a character in the alphabet of the stored strings.
- **Nodes**: Each node contains a character and possibly additional information, such as a boolean flag indicating if the node represents the end of a valid string.
To implement the nodes of trie. We use Classes in Python. Each node is an object of the Node Class.
Node Class have mainly two components
- *Array of size 26*: It is used to represent the 26 alphabets. Initially all are None. While inserting the words, then array will be filled with object of child nodes.
- *End of word*: It is used to represent the end of word while inserting.
Code Block of Node Class :
```python
class Node:
def __init__(self):
self.alphabets = [None] * 26
self.end_of_word = 0
```
Now we need to implement Trie. We create another class named Trie with some methods like Insertion, Searching and Deletion.
**Initialization:** In this, we initializes the Trie with a `root` node.
Code Implementation of Initialization:
```python
class Trie:
def __init__(self):
self.root = Node()
```
## Operations on Trie
1. **Insertion**: Inserts the word into the Trie. This method takes `word` as parameter. For each character in the word, it checks if there is a corresponding child node. If not, it creates a new `Node`. After processing all the characters in word, it increments the `end_of_word` value of the last node.
Code Implementation of Insertion:
```python
def insert(self, word):
node = self.root
for char in word:
index = ord(char) - ord('a')
if not node.alphabets[index]:
node.alphabets[index] = Node()
node = node.alphabets[index]
node.end_of_word += 1
```
2. **Searching**: Search the `word` in trie. Searching process starts from the `root` node. Each character of the `word` is processed. After traversing the whole word in trie, it return the count of words.
There are two cases in Searching:
- *Word Not found*: It happens when the word we search not present in the trie. This case will occur, if the value of `alphabets` array at that character is `None` or if the value of `end_of_word` of the node, reached after traversing the whole word is `0`.
- *Word found*: It happens when the search word is present in the Trie. This case will occur, when the `end_of_word` value is greater than `0` of the node after traversing the whole word.
Code Implementation of Searching:
```python
def Search(self, word):
node = self.root
for char in word:
index = ord(char) - ord('a')
if not node.alphabets[index]:
return 0
node = node.alphabets[index]
return node.end_of_word
```
3. **Deletion**: To delete a string, follow the path of the string. If the end node is reached and `end_of_word` is greater than `0` then decrement the value.
Code Implementation of Deletion:
```python
def delete(self, word):
node = self.root
for char in word:
index = ord(char) - ord('a')
node = node.alphabets[index]
if node.end_of_word:
node.end_of_word-=1
```
Python Code to implement Trie:
```python
class Node:
def __init__(self):
self.alphabets = [None] * 26
self.end_of_word = 0
class Trie:
def __init__(self):
self.root = Node()
def insert(self, word):
node = self.root
for char in word:
index = ord(char) - ord('a')
if not node.alphabets[index]:
node.alphabets[index] = Node()
node = node.alphabets[index]
node.end_of_word += 1
def Search(self, word):
node = self.root
for char in word:
index = ord(char) - ord('a')
if not node.alphabets[index]:
return 0
node = node.alphabets[index]
return node.end_of_word
def delete(self, word):
node = self.root
for char in word:
index = ord(char) - ord('a')
node = node.alphabets[index]
if node.end_of_word:
node.end_of_word-=1
if __name__ == "__main__":
trie = Trie()
word1 = "apple"
word2 = "app"
word3 = "bat"
trie.insert(word1)
trie.insert(word2)
trie.insert(word3)
print(trie.Search(word1))
print(trie.Search(word2))
print(trie.Search(word3))
trie.delete(word2)
print(trie.Search(word2))
```

Wyświetl plik

@ -0,0 +1,235 @@
# Cost Functions in Machine Learning
Cost functions, also known as loss functions, play a crucial role in training machine learning models. They measure how well the model performs on the training data by quantifying the difference between predicted and actual values. Different types of cost functions are used depending on the problem domain and the nature of the data.
## Types of Cost Functions
### 1. Mean Squared Error (MSE)
**Explanation:**
MSE is one of the most commonly used cost functions, particularly in regression problems. It calculates the average squared difference between the predicted and actual values.
**Mathematical Formulation:**
The MSE is defined as:
$$MSE = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$
Where:
- `n` is the number of samples.
- $y_i$ is the actual value.
- $\hat{y}_i$ is the predicted value.
**Advantages:**
- Sensitive to large errors due to squaring.
- Differentiable and convex, facilitating optimization.
**Disadvantages:**
- Sensitive to outliers, as the squared term amplifies their impact.
**Python Implementation:**
```python
import numpy as np
def mean_squared_error(y_true, y_pred):
n = len(y_true)
return np.mean((y_true - y_pred) ** 2)
```
### 2. Mean Absolute Error (MAE)
**Explanation:**
MAE is another commonly used cost function for regression tasks. It measures the average absolute difference between predicted and actual values.
**Mathematical Formulation:**
The MAE is defined as:
$$MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y}_i|$$
Where:
- `n` is the number of samples.
- $y_i$ is the actual value.
- $\hat{y}_i$ is the predicted value.
**Advantages:**
- Less sensitive to outliers compared to MSE.
- Provides a linear error term, which can be easier to interpret.
**Disadvantages:**
- Not differentiable at zero, which can complicate optimization.
**Python Implementation:**
```python
import numpy as np
def mean_absolute_error(y_true, y_pred):
n = len(y_true)
return np.mean(np.abs(y_true - y_pred))
```
### 3. Cross-Entropy Loss (Binary)
**Explanation:**
Cross-entropy loss is commonly used in binary classification problems. It measures the dissimilarity between the true and predicted probability distributions.
**Mathematical Formulation:**
For binary classification, the cross-entropy loss is defined as:
$$\text{Cross-Entropy} = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)]$$
Where:
- `n` is the number of samples.
- $y_i$ is the actual class label (0 or 1).
- $\hat{y}_i$ is the predicted probability of the positive class.
**Advantages:**
- Penalizes confident wrong predictions heavily.
- Suitable for probabilistic outputs.
**Disadvantages:**
- Sensitive to class imbalance.
**Python Implementation:**
```python
import numpy as np
def binary_cross_entropy(y_true, y_pred):
n = len(y_true)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
```
### 4. Cross-Entropy Loss (Multiclass)
**Explanation:**
For multiclass classification problems, the cross-entropy loss is adapted to handle multiple classes.
**Mathematical Formulation:**
The multiclass cross-entropy loss is defined as:
$$\text{Cross-Entropy} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c})$$
Where:
- `n` is the number of samples.
- `C` is the number of classes.
- $y_{i,c}$ is the indicator function for the true class of sample `i`.
- $\hat{y}_{i,c}$ is the predicted probability of sample `i` belonging to class `c`.
**Advantages:**
- Handles multiple classes effectively.
- Encourages the model to assign high probabilities to the correct classes.
**Disadvantages:**
- Requires one-hot encoding for class labels, which can increase computational complexity.
**Python Implementation:**
```python
import numpy as np
def categorical_cross_entropy(y_true, y_pred):
n = len(y_true)
return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))
```
### 5. Hinge Loss (SVM)
**Explanation:**
Hinge loss is commonly used in support vector machines (SVMs) for binary classification tasks. It penalizes misclassifications by a linear margin.
**Mathematical Formulation:**
For binary classification, the hinge loss is defined as:
$$\text{Hinge Loss} = \frac{1}{n} \sum_{i=1}^{n} \max(0, 1 - y_i \cdot \hat{y}_i)$$
Where:
- `n` is the number of samples.
- $y_i$ is the actual class label (-1 or 1).
- $\hat{y}_i$ is the predicted score for sample \( i \).
**Advantages:**
- Encourages margin maximization in SVMs.
- Robust to outliers due to the linear penalty.
**Disadvantages:**
- Not differentiable at the margin, which can complicate optimization.
**Python Implementation:**
```python
import numpy as np
def hinge_loss(y_true, y_pred):
n = len(y_true)
loss = np.maximum(0, 1 - y_true * y_pred)
return np.mean(loss)
```
### 6. Huber Loss
**Explanation:**
Huber loss is a combination of MSE and MAE, providing a compromise between the two. It is less sensitive to outliers than MSE and provides a smooth transition to MAE for large errors.
**Mathematical Formulation:**
The Huber loss is defined as:
$$\text{Huber Loss} = \frac{1}{n} \sum_{i=1}^{n} \left\{
\begin{array}{ll}
\frac{1}{2} (y_i - \hat{y}_i)^2 & \text{if } |y_i - \hat{y}_i| \leq \delta \\
\delta(|y_i - \hat{y}_i| - \frac{1}{2} \delta) & \text{otherwise}
\end{array}
\right.$$
Where:
- `n` is the number of samples.
- $\delta$ is a threshold parameter.
**Advantages:**
- Provides a smooth loss function.
- Less sensitive to outliers than MSE.
**Disadvantages:**
- Requires tuning of the threshold parameter.
**Python Implementation:**
```python
import numpy as np
def huber_loss(y_true, y_pred, delta):
error = y_true - y_pred
loss = np.where(np.abs(error) <= delta, 0.5 * error ** 2, delta * (np.abs(error) - 0.5 * delta))
return np.mean(loss)
```
### 7. Log-Cosh Loss
**Explanation:**
Log-Cosh loss is a smooth approximation of the MAE and is less sensitive to outliers than MSE. It provides a smooth transition from quadratic for small errors to linear for large errors.
**Mathematical Formulation:**
The Log-Cosh loss is defined as:
$$\text{Log-Cosh Loss} = \frac{1}{n} \sum_{i=1}^{n} \log(\cosh(y_i - \hat{y}_i))$$
Where:
- `n` is the number of samples.
**Advantages:**
- Smooth and differentiable everywhere.
- Less sensitive to outliers.
**Disadvantages:**
- Computationally more expensive than simple losses like MSE.
**Python Implementation:**
```python
import numpy as np
def logcosh_loss(y_true, y_pred):
error = y_true - y_pred
loss = np.log(np.cosh(error))
return np.mean(loss)
```
These implementations provide various options for cost functions suitable for different machine learning tasks. Each function has its advantages and disadvantages, making them suitable for different scenarios and problem domains.

Wyświetl plik

@ -0,0 +1,99 @@
# Hierarchical Clustering
Hierarchical Clustering is a method of cluster analysis that seeks to build a hierarchy of clusters. This README provides an overview of the hierarchical clustering algorithm, including its fundamental concepts, types, steps, and how to implement it using Python.
## Introduction
Hierarchical Clustering is an unsupervised learning method used to group similar objects into clusters. Unlike other clustering techniques, hierarchical clustering does not require the number of clusters to be specified beforehand. It produces a tree-like structure called a dendrogram, which displays the arrangement of the clusters and their sub-clusters.
## Concepts
### Dendrogram
A dendrogram is a tree-like diagram that records the sequences of merges or splits. It is a useful tool for visualizing the process of hierarchical clustering.
### Distance Measure
Distance measures are used to quantify the similarity or dissimilarity between data points. Common distance measures include Euclidean distance, Manhattan distance, and cosine similarity.
### Linkage Criteria
Linkage criteria determine how the distance between clusters is calculated. Different linkage criteria include single linkage, complete linkage, average linkage, and Ward's linkage.
## Types of Hierarchical Clustering
1. **Agglomerative Clustering (Bottom-Up Approach)**:
- Starts with each data point as a separate cluster.
- Repeatedly merges the closest pairs of clusters until only one cluster remains or a stopping criterion is met.
2. **Divisive Clustering (Top-Down Approach)**:
- Starts with all data points in a single cluster.
- Repeatedly splits clusters into smaller clusters until each data point is its own cluster or a stopping criterion is met.
## Steps in Hierarchical Clustering
1. **Calculate Distance Matrix**: Compute the distance between each pair of data points.
2. **Create Clusters**: Treat each data point as a single cluster.
3. **Merge Closest Clusters**: Find the two clusters that are closest to each other and merge them into a single cluster.
4. **Update Distance Matrix**: Update the distance matrix to reflect the distance between the new cluster and the remaining clusters.
5. **Repeat**: Repeat steps 3 and 4 until all data points are merged into a single cluster or the desired number of clusters is achieved.
## Linkage Criteria
1. **Single Linkage (Minimum Linkage)**: The distance between two clusters is defined as the minimum distance between any single data point in the first cluster and any single data point in the second cluster.
2. **Complete Linkage (Maximum Linkage)**: The distance between two clusters is defined as the maximum distance between any single data point in the first cluster and any single data point in the second cluster.
3. **Average Linkage**: The distance between two clusters is defined as the average distance between all pairs of data points, one from each cluster.
4. **Ward's Linkage**: The distance between two clusters is defined as the increase in the sum of squared deviations from the mean when the two clusters are merged.
## Implementation
### Using Scikit-learn
Scikit-learn is a popular machine learning library in Python that provides tools for hierarchical clustering.
### Code Example
```python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
# Load dataset
data = pd.read_csv('path/to/your/dataset.csv')
# Preprocess the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
# Perform hierarchical clustering
Z = linkage(data_scaled, method='ward')
# Plot the dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z)
plt.title('Dendrogram')
plt.xlabel('Data Points')
plt.ylabel('Distance')
plt.show()
# Perform Agglomerative Clustering
agg_clustering = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')
labels = agg_clustering.fit_predict(data_scaled)
# Add cluster labels to the original data
data['Cluster'] = labels
print(data.head())
```
## Evaluation Metrics
- **Silhouette Score**: Measures how similar a data point is to its own cluster compared to other clusters.
- **Cophenetic Correlation Coefficient**: Measures how faithfully a dendrogram preserves the pairwise distances between the original data points.
- **Dunn Index**: Ratio of the minimum inter-cluster distance to the maximum intra-cluster distance.
## Conclusion
Hierarchical clustering is a versatile and intuitive method for clustering data. It is particularly useful when the number of clusters is not known beforehand. By understanding the different linkage criteria and evaluation metrics, one can effectively apply hierarchical clustering to various types of data.

Wyświetl plik

@ -1,17 +1,21 @@
# List of sections # List of sections
- [Binomial Distribution](binomial_distribution.md) - [Introduction to scikit-learn](sklearn-introduction.md)
- [Regression in Machine Learning](Regression.md) - [Binomial Distribution](binomial-distribution.md)
- [Regression in Machine Learning](regression.md)
- [Confusion Matrix](confusion-matrix.md) - [Confusion Matrix](confusion-matrix.md)
- [Decision Tree Learning](Decision-Tree.md) - [Decision Tree Learning](decision-tree.md)
- [Random Forest](random-forest.md) - [Random Forest](random-forest.md)
- [Support Vector Machine Algorithm](support-vector-machine.md) - [Support Vector Machine Algorithm](support-vector-machine.md)
- [Artificial Neural Network from the Ground Up](ArtificialNeuralNetwork.md) - [Artificial Neural Network from the Ground Up](ann.md)
- [Introduction To Convolutional Neural Networks (CNNs)](intro-to-cnn.md) - [Introduction To Convolutional Neural Networks (CNNs)](intro-to-cnn.md)
- [TensorFlow.md](tensorFlow.md) - [TensorFlow.md](tensorflow.md)
- [PyTorch.md](pytorch.md) - [PyTorch.md](pytorch.md)
- [Types of optimizers](Types_of_optimizers.md)
- [Ensemble Learning](ensemble-learning.md) - [Ensemble Learning](ensemble-learning.md)
- [Types of optimizers](types-of-optimizers.md)
- [Logistic Regression](logistic-regression.md) - [Logistic Regression](logistic-regression.md)
- [Types_of_Cost_Functions](cost-functions.md)
- [Clustering](clustering.md) - [Clustering](clustering.md)
- [Hierarchical Clustering](hierarchical-clustering.md)
- [Grid Search](grid-search.md) - [Grid Search](grid-search.md)
- [K-nearest neighbor (KNN)](knn.md)

Wyświetl plik

@ -0,0 +1,122 @@
# K-Nearest Neighbors (KNN) Machine Learning Algorithm in Python
## Introduction
K-Nearest Neighbors (KNN) is a simple, yet powerful, supervised machine learning algorithm used for both classification and regression tasks. It assumes that similar things exist in close proximity. In other words, similar data points are near to each other.
## How KNN Works
KNN works by finding the distances between a query and all the examples in the data, selecting the specified number of examples (K) closest to the query, then voting for the most frequent label (in classification) or averaging the labels (in regression).
### Steps:
1. **Choose the number K of neighbors**
2. **Calculate the distance** between the query-instance and all the training samples
3. **Sort the distances** and determine the nearest neighbors based on the K-th minimum distance
4. **Gather the labels** of the nearest neighbors
5. **Vote for the most frequent label** (in case of classification) or **average the labels** (in case of regression)
## When to Use KNN
### Advantages:
- **Simple and easy to understand:** KNN is intuitive and easy to implement.
- **No training phase:** KNN is a lazy learner, meaning there is no explicit training phase.
- **Effective with a small dataset:** KNN performs well with a small number of input variables.
### Disadvantages:
- **Computationally expensive:** The algorithm becomes significantly slower as the number of examples and/or predictors/independent variables increase.
- **Sensitive to irrelevant features:** All features contribute to the distance equally.
- **Memory-intensive:** Storing all the training data can be costly.
### Use Cases:
- **Recommender Systems:** Suggest items based on similarity to user preferences.
- **Image Recognition:** Classify images by comparing new images to the training set.
- **Finance:** Predict credit risk or fraud detection based on historical data.
## KNN in Python
### Required Libraries
To implement KNN, we need the following Python libraries:
- `numpy`
- `pandas`
- `scikit-learn`
- `matplotlib` (for visualization)
### Installation
```bash
pip install numpy pandas scikit-learn matplotlib
```
### Example Code
Let's implement a simple KNN classifier using the Iris dataset.
#### Step 1: Import Libraries
```python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
```
#### Step 2: Load Dataset
```python
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
```
#### Step 3: Split Dataset
```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
```
#### Step 4: Train KNN Model
```python
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
```
#### Step 5: Make Predictions
```python
y_pred = knn.predict(X_test)
```
#### Step 6: Evaluate the Model
```python
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
```
### Visualization (Optional)
```python
# Plotting the decision boundary for visualization (for 2D data)
h = .02 # step size in the mesh
# Create color maps
cmap_light = plt.cm.RdYlBu
cmap_bold = plt.cm.RdYlBu
# For simplicity, we take only the first two features of the dataset
X_plot = X[:, :2]
x_min, x_max = X_plot[:, 0].min() - 1, X_plot[:, 0].max() + 1
y_min, y_max = X_plot[:, 1].min() - 1, y_plot[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
plt.scatter(X_plot[:, 0], X_plot[:, 1], c=y, edgecolor='k', cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = 3)")
plt.show()
```
## Generalization and Considerations
- **Choosing K:** The choice of K is critical. Smaller values of K can lead to noisy models, while larger values make the algorithm computationally expensive and might oversimplify the model.
- **Feature Scaling:** Since KNN relies on distance calculations, features should be scaled (standardized or normalized) to ensure that all features contribute equally to the distance computation.
- **Distance Metrics:** The choice of distance metric (Euclidean, Manhattan, etc.) can affect the performance of the algorithm.
In conclusion, KNN is a versatile and easy-to-implement algorithm suitable for various classification and regression tasks, particularly when working with small datasets and well-defined features. However, careful consideration should be given to the choice of K, feature scaling, and distance metrics to optimize its performance.

Wyświetl plik

@ -0,0 +1,144 @@
# scikit-learn (sklearn) Python Library
## Overview
scikit-learn, also known as sklearn, is a popular open-source Python library that provides simple and efficient tools for data mining and data analysis. It is built on NumPy, SciPy, and matplotlib. The library is designed to interoperate with the Python numerical and scientific libraries.
## Key Features
- **Classification**: Identifying which category an object belongs to. Example algorithms include SVM, nearest neighbors, random forest.
- **Regression**: Predicting a continuous-valued attribute associated with an object. Example algorithms include support vector regression (SVR), ridge regression, Lasso.
- **Clustering**: Automatic grouping of similar objects into sets. Example algorithms include k-means, spectral clustering, mean-shift.
- **Dimensionality Reduction**: Reducing the number of random variables to consider. Example algorithms include PCA, feature selection, non-negative matrix factorization.
- **Model Selection**: Comparing, validating, and choosing parameters and models. Example methods include grid search, cross-validation, metrics.
- **Preprocessing**: Feature extraction and normalization.
## When to Use scikit-learn
- **Use scikit-learn if**:
- You are working on machine learning tasks such as classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.
- You need an easy-to-use, well-documented library.
- You require tools that are compatible with NumPy and SciPy.
- **Do not use scikit-learn if**:
- You need to perform deep learning tasks. In such cases, consider using TensorFlow or PyTorch.
- You need out-of-the-box support for large-scale data. scikit-learn is designed to work with in-memory data, so for very large datasets, you might want to consider libraries like Dask-ML.
## Installation
You can install scikit-learn using pip:
```bash
pip install scikit-learn
```
Or via conda:
```bash
conda install scikit-learn
```
## Basic Usage with Code Snippets
### Importing the Library
```python
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
```
### Loading Data
For illustration, let's create a simple synthetic dataset:
```python
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
```
### Splitting Data
Split the dataset into training and testing sets:
```python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
```
### Preprocessing
Standardizing the features:
```python
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
```
### Training a Model
Train a Logistic Regression model:
```python
model = LogisticRegression()
model.fit(X_train, y_train)
```
### Making Predictions
Make predictions on the test set:
```python
y_pred = model.predict(X_test)
```
### Evaluating the Model
Evaluate the accuracy of the model:
```python
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
```
### Putting it All Together
Here is a complete example from data loading to model evaluation:
```python
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load data
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Preprocess data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Train model
model = LogisticRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy * 100:.2f}%")
```
## Conclusion
scikit-learn is a powerful and versatile library that can be used for a wide range of machine learning tasks. It is particularly well-suited for beginners due to its easy-to-use interface and extensive documentation. Whether you are working on a simple classification task or a more complex clustering problem, scikit-learn provides the tools you need to build and evaluate your models effectively.

Wyświetl plik

@ -1,10 +1,11 @@
# List of sections # List of sections
- [Pandas Introduction and Dataframes in Pandas](introduction.md) - [Pandas Introduction and Dataframes in Pandas](introduction.md)
- [Pandas Series Vs NumPy ndarray](pandas_series_vs_numpy_ndarray.md) - [Viewing data in pandas](viewing-data.md)
- [Pandas Descriptive Statistics](Descriptive_Statistics.md) - [Pandas Series Vs NumPy ndarray](pandas-series-vs-numpy-ndarray.md)
- [Group By Functions with Pandas](GroupBy_Functions_Pandas.md) - [Pandas Descriptive Statistics](descriptive-statistics.md)
- [Excel using Pandas DataFrame](excel_with_pandas.md) - [Group By Functions with Pandas](groupby-functions.md)
- [Excel using Pandas DataFrame](excel-with-pandas.md)
- [Working with Date & Time in Pandas](datetime.md) - [Working with Date & Time in Pandas](datetime.md)
- [Importing and Exporting Data in Pandas](import-export.md) - [Importing and Exporting Data in Pandas](import-export.md)
- [Handling Missing Values in Pandas](handling-missing-values.md) - [Handling Missing Values in Pandas](handling-missing-values.md)

Wyświetl plik

@ -0,0 +1,67 @@
# Viewing rows of the frame
## `head()` method
The pandas library in Python provides a convenient method called `head()` that allows you to view the first few rows of a DataFrame. Let me explain how it works:
- The `head()` function returns the first n rows of a DataFrame or Series.
- By default, it displays the first 5 rows, but you can specify a different number of rows using the n parameter.
### Syntax
```python
dataframe.head(n)
```
`n` is the Optional value. The number of rows to return. Default value is `5`.
### Example
```python
import pandas as pd
df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion','tiger','rabit','dog','fox','monkey','elephant']})
df.head(n=5)
```
#### Output
```
animal
0 alligator
1 bee
2 falcon
3 lion
4 tiger
```
## `tail()` method
The `tail()` function in Python displays the last five rows of the dataframe by default. It takes in a single parameter: the number of rows. We can use this parameter to display the number of rows of our choice.
- The `tail()` function returns the last n rows of a DataFrame or Series.
- By default, it displays the last 5 rows, but you can specify a different number of rows using the n parameter.
### Syntax
```python
dataframe.tail(n)
```
`n` is the Optional value. The number of rows to return. Default value is `5`.
### Example
```python
import pandas as pd
df = pd.DataFrame({'fruits': ['mongo', 'orange', 'apple', 'lemon','banana','water melon','papaya','grapes','cherry','coconut']})
df.tail(n=5)
```
#### Output
```
fruits
5 water melon
6 papaya
7 grapes
8 cherry
9 coconut
```

Plik binarny nie jest wyświetlany.

Po

Szerokość:  |  Wysokość:  |  Rozmiar: 10 KiB

Plik binarny nie jest wyświetlany.

Po

Szerokość:  |  Wysokość:  |  Rozmiar: 1.2 KiB

Plik binarny nie jest wyświetlany.

Po

Szerokość:  |  Wysokość:  |  Rozmiar: 14 KiB

Plik binarny nie jest wyświetlany.

Po

Szerokość:  |  Wysokość:  |  Rozmiar: 28 KiB

Plik binarny nie jest wyświetlany.

Po

Szerokość:  |  Wysokość:  |  Rozmiar: 14 KiB

Plik binarny nie jest wyświetlany.

Po

Szerokość:  |  Wysokość:  |  Rozmiar: 16 KiB

Plik binarny nie jest wyświetlany.

Po

Szerokość:  |  Wysokość:  |  Rozmiar: 22 KiB

Plik binarny nie jest wyświetlany.

Po

Szerokość:  |  Wysokość:  |  Rozmiar: 19 KiB

Plik binarny nie jest wyświetlany.

Po

Szerokość:  |  Wysokość:  |  Rozmiar: 13 KiB

Plik binarny nie jest wyświetlany.

Po

Szerokość:  |  Wysokość:  |  Rozmiar: 53 KiB

Plik binarny nie jest wyświetlany.

Po

Szerokość:  |  Wysokość:  |  Rozmiar: 14 KiB

Plik binarny nie jest wyświetlany.

Po

Szerokość:  |  Wysokość:  |  Rozmiar: 18 KiB

Wyświetl plik

@ -1,5 +1,9 @@
# List of sections # List of sections
- [Installing Matplotlib](matplotlib-installation.md) - [Installing Matplotlib](matplotlib-installation.md)
- [Introducing Matplotlib](matplotlib-introduction.md)
- [Bar Plots in Matplotlib](matplotlib-bar-plots.md) - [Bar Plots in Matplotlib](matplotlib-bar-plots.md)
- [Pie Charts in Matplotlib](matplotlib-pie-charts.md) - [Pie Charts in Matplotlib](matplotlib-pie-charts.md)
- [Line Charts in Matplotlib](matplotlib-line-plots.md)
- [Introduction to Seaborn and Installation](seaborn-intro.md)
- [Getting started with Seaborn](seaborn-basics.md)

Wyświetl plik

@ -0,0 +1,80 @@
# Introducing MatplotLib
Data visualisation is the analysing and understanding the data via graphical representation of the data by the means of pie charts, histograms, scatterplots and line graphs.
To make this process of data visualization easier and clearer, matplotlib library is used.
## Features of MatplotLib library
- MatplotLib library is one of the most popular python packages for 2D representation of data
- Combination of matplotlib and numpy is used for easier computations and visualization of large arrays and data. Matplotlib along with NumPy can be considered as the open source equivalent of MATLAB.
- Matplotlib has a procedural interface named the Pylab, which is designed to resemble MATLAB. However, it is completely independent of Matlab.
## Starting with Matplotlib
### 1. Install and import the neccasary libraries - mayplotlib.pylplot
```bash
pip install matplotlib
```
```python
import maptplotlib.pyplot as plt
import numpy as np
```
### 2. Scatter plot
Scatter plot is a type of plot that uses the cartesian coordinates between x and y to describe the relation between them. It uses dots to represent relation between the data variables of the data set.
```python
x = [5,4,5,8,9,8,6,7,3,2]
y = [9,1,7,3,5,7,6,1,2,8]
plt.scatter(x,y, color = "red")
plt.title("Scatter plot")
plt.xlabel("X values")
plt.ylabel("Y values")
plt.tight_layout()
plt.show()
```
![scatterplot](images/scatterplot.png)
### 3. Bar plot
Bar plot is a type of plot that plots the frequency distrubution of the categorical variables. Each entity of the categoric variable is represented as a bar. The size of the bar represents its numeric value.
```python
x = np.array(['A','B','C','D'])
y = np.array([42,50,15,35])
plt.bar(x,y,color = "red")
plt.title("Bar plot")
plt.xlabel("X values")
plt.ylabel("Y values")
plt.show()
```
![barplot](images/barplot.png)
### 4. Histogram
Histogram is the representation of frequency distribution of qualitative data. The height of each rectangle defines the amount, or how often that variable appears.
```python
x = [9,1,7,3,5,7,6,1,2,8]
plt.hist(x, color = "red", edgecolor= "white", bins =5)
plt.title("Histogram")
plt.xlabel("X values")
plt.ylabel("Frequency Distribution")
plt.show()
```
![histogram](images/histogram.png)

Wyświetl plik

@ -0,0 +1,278 @@
# Line Chart in Matplotlib
A line chart is a simple way to visualize data where we connect individual data points. It helps us to see trends and patterns over time or across categories.
This type of chart is particularly useful for:
- Comparing Data: Comparing multiple datasets on the same axes.
- Highlighting Changes: Illustrating changes and patterns in data.
- Visualizing Trends: Showing trends over time or other continuous variables.
## Prerequisites
Line plots can be created in Python with Matplotlib's `pyplot` library. To build a line plot, first import `matplotlib`. It is a standard convention to import Matplotlib's pyplot library as `plt`.
```python
import matplotlib.pyplot as plt
```
## Creating a simple Line Plot
First import matplotlib and numpy, these are useful for charting.
You can use the `plot(x,y)` method to create a line chart.
```python
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(-1, 1, 50)
print(x)
y = 2*x + 1
plt.plot(x, y)
plt.show()
```
When executed, this will show the following line plot:
![Basic line Chart](images/simple_line.png)
## Curved line
The `plot()` method also works for other types of line charts. It doesnt need to be a straight line, y can have any type of values.
```python
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(-1, 1, 50)
y = 2**x + 1
plt.plot(x, y)
plt.show()
```
When executed, this will show the following Curved line plot:
![Curved line](images/line-curve.png)
## Line with Labels
To know what you are looking at, you need meta data. Labels are a type of meta data. They show what the chart is about. The chart has an `x label`, `y label` and `title`.
```python
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(-1, 1, 50)
y1 = 2*x + 1
y2 = 2**x + 1
plt.figure()
plt.plot(x, y1)
plt.xlabel("I am x")
plt.ylabel("I am y")
plt.title("With Labels")
plt.show()
```
When executed, this will show the following line with labels plot:
![line with labels](images/line-labels.png)
## Multiple lines
More than one line can be in the plot. To add another line, just call the `plot(x,y)` function again. In the example below we have two different values for `y(y1,y2)` that are plotted onto the chart.
```python
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(-1, 1, 50)
y1 = 2*x + 1
y2 = 2**x + 1
plt.figure(num = 3, figsize=(8, 5))
plt.plot(x, y2)
plt.plot(x, y1,
color='red',
linewidth=1.0,
linestyle='--'
)
plt.show()
```
When executed, this will show the following Multiple lines plot:
![multiple lines](images/two-lines.png)
## Dotted line
Lines can be in the form of dots like the image below. Instead of calling `plot(x,y)` call the `scatter(x,y)` method. The `scatter(x,y)` method can also be used to (randomly) plot points onto the chart.
```python
import matplotlib.pyplot as plt
import numpy as np
n = 1024
X = np.random.normal(0, 1, n)
Y = np.random.normal(0, 1, n)
T = np.arctan2(X, Y)
plt.scatter(np.arange(5), np.arange(5))
plt.xticks(())
plt.yticks(())
plt.show()
```
When executed, this will show the following Dotted line plot:
![dotted lines](images/dot-line.png)
## Line ticks
You can change the ticks on the plot. Set them on the `x-axis`, `y-axis` or even change their color. The line can be more thick and have an alpha value.
```python
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(-1, 1, 50)
y = 2*x - 1
plt.figure(figsize=(12, 8))
plt.plot(x, y, color='r', linewidth=10.0, alpha=0.5)
ax = plt.gca()
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')
ax.spines['bottom'].set_position(('data', 0))
ax.spines['left'].set_position(('data', 0))
for label in ax.get_xticklabels() + ax.get_yticklabels():
label.set_fontsize(12)
label.set_bbox(dict(facecolor='y', edgecolor='None', alpha=0.7))
plt.show()
```
When executed, this will show the following line ticks plot:
![line ticks](images/line-ticks.png)
## Line with asymptote
An asymptote can be added to the plot. To do that, use `plt.annotate()`. Theres lso a dotted line in the plot below. You can play around with the code to see how it works.
```python
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(-1, 1, 50)
y1 = 2*x + 1
y2 = 2**x + 1
plt.figure(figsize=(12, 8))
plt.plot(x, y2)
plt.plot(x, y1, color='red', linewidth=1.0, linestyle='--')
ax = plt.gca()
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')
ax.spines['bottom'].set_position(('data', 0))
ax.spines['left'].set_position(('data', 0))
x0 = 1
y0 = 2*x0 + 1
plt.scatter(x0, y0, s = 66, color = 'b')
plt.plot([x0, x0], [y0, 0], 'k-.', lw= 2.5)
plt.annotate(r'$2x+1=%s$' %
y0,
xy=(x0, y0),
xycoords='data',
xytext=(+30, -30),
textcoords='offset points',
fontsize=16,
arrowprops=dict(arrowstyle='->',connectionstyle='arc3,rad=.2')
)
plt.text(0, 3,
r'$This\ is\ a\ good\ idea.\ \mu\ \sigma_i\ \alpha_t$',
fontdict={'size':16,'color':'r'})
plt.show()
```
When executed, this will show the following Line with asymptote plot:
![Line with asymptote](images/line-asymptote.png)
## Line with text scale
It doesnt have to be a numeric scale. The scale can also contain textual words like the example below. In `plt.yticks()` we just pass a list with text values. These values are then show against the `y axis`.
```python
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(-1, 1, 50)
y1 = 2*x + 1
y2 = 2**x + 1
plt.figure(num = 3, figsize=(8, 5))
plt.plot(x, y2)
plt.plot(x, y1,
color='red',
linewidth=1.0,
linestyle='--'
)
plt.xlim((-1, 2))
plt.ylim((1, 3))
new_ticks = np.linspace(-1, 2, 5)
plt.xticks(new_ticks)
plt.yticks([-2, -1.8, -1, 1.22, 3],
[r'$really\ bad$', r'$bad$', r'$normal$', r'$good$', r'$readly\ good$'])
ax = plt.gca()
ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')
ax.xaxis.set_ticks_position('bottom')
ax.yaxis.set_ticks_position('left')
ax.spines['bottom'].set_position(('data', 0))
ax.spines['left'].set_position(('data', 0))
plt.show()
```
When executed, this will show the following Line with text scale plot:
![Line with text scale](images/line-with-text-scale.png)

Wyświetl plik

@ -0,0 +1,39 @@
Seaborn helps you explore and understand your data. Its plotting functions operate on dataframes and arrays containing whole datasets and internally perform the necessary semantic mapping and statistical aggregation to produce informative plots. Its dataset-oriented, declarative API lets you focus on what the different elements of your plots mean, rather than on the details of how to draw them.
Heres an example of what seaborn can do:
```Python
# Import seaborn
import seaborn as sns
# Apply the default theme
sns.set_theme()
# Load an example dataset
tips = sns.load_dataset("tips")
# Create a visualization
sns.relplot(
data=tips,
x="total_bill", y="tip", col="time",
hue="smoker", style="smoker", size="size",
)
```
Below is the output for the above code snippet:
![Seaborn intro image](images/seaborn-basics1.png)
```Python
# Load an example dataset
tips = sns.load_dataset("tips")
```
Most code in the docs will use the `load_dataset()` function to get quick access to an example dataset. Theres nothing special about these datasets: they are just pandas data frames, and we could have loaded them with `pandas.read_csv()` or build them by hand. Many users specify data using pandas data frames, but Seaborn is very flexible about the data structures that it accepts.
```Python
# Create a visualization
sns.relplot(
data=tips,
x="total_bill", y="tip", col="time",
hue="smoker", style="smoker", size="size",
)
```
This plot shows the relationship between five variables in the tips dataset using a single call to the seaborn function `relplot()`. Notice how only the names of the variables and their roles in the plot are provided. Unlike when using matplotlib directly, it wasnt necessary to specify attributes of the plot elements in terms of the color values or marker codes. Behind the scenes, seaborn handled the translation from values in the dataframe to arguments that Matplotlib understands. This declarative approach lets you stay focused on the questions that you want to answer, rather than on the details of how to control matplotlib.

Wyświetl plik

@ -0,0 +1,41 @@
Seaborn is a Python data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
## Seaborn Installation
Before installing Matplotlib, ensure you have Python installed on your system. You can download and install Python from the [official Python website](https://www.python.org/).
Below are the steps to install and setup Seaborn:
1. Open your terminal or command prompt and run the following command to install Seaborn using `pip`:
```bash
pip install seaborn
```
2. The basic invocation of `pip` will install seaborn and, if necessary, its mandatory dependencies. It is possible to include optional dependencies that give access to a few advanced features:
```bash
pip install seaborn[stats]
```
3. The library is also included as part of the Anaconda distribution, and it can be installed with `conda`:
```bash
conda install seaborn
```
4. As the main Anaconda repository can be slow to add new releases, you may prefer using the conda-forge channel:
```bash
conda install seaborn -c conda-forge
```
## Dependencies
### Supported Python versions
- Python 3.8+
### Mandatory Dependencies
- [numpy](https://numpy.org/)
- [pandas](https://pandas.pydata.org/)
- [matplotlib](https://matplotlib.org/)
### Optional Dependencies
- [statsmodels](https://www.statsmodels.org/stable/index.html) for advanced regression plots
- [scipy](https://scipy.org/) for clustering matrices and some advanced options
- [fastcluster](https://pypi.org/project/fastcluster/) for faster clustering of large matrices