Adding the BeautifulSoup web scrapping

2024-06-05 17:28:33 +05:30 · 2024-06-05 17:28:33 +05:30 · f742fddff3
commit f742fddff3
--- a/contrib/web-scrapping/beautifulsoup.md
+++ b/contrib/web-scrapping/beautifulsoup.md
@ -0,0 +1,206 @@
+# Beautiful Soup Library in Python
+
+## Table of Contents
+1. Introduction
+2. Prerequisites
+3. Setting Up Environment
+4. Beautiful Soup Objects
+5. Tag object
+6. Children, Parents and Siblings
+7. Filter: Findall method
+8. Web Scraping the Contents of a Web Page
+
+## 1. Introduction
+Beautiful Soup is a Python library used for web scraping purposes to extract data from HTML and XML files. It creates a parse tree for parsed pages that can be used to extract data easily.
+
+## 2. Prerequisites
+- Understanding of HTTP Requests and Responses.
+- Understanding of HTTP methods (GET method).
+- Basic Knowledge on HTML.
+-  Python installed on your machine (version 3.6 or higher).
+- pip (Python package installer) installed.
+
+## 3. Setting Up Environment
+1.**Install latest version of Python:** Go to Python.org and Download the latest version of Python.
+
+2.**Install a IDE:** Install any IDE of your choice to code. VS code is preferred.
+>Note: You can use Google Colab without installing Python and IDE.
+
+3.**Install Beautiful Soup:**
+```python
+!pip install bs4
+```
+
+## 4. Beautiful Soup Objects
+Beautiful Soup is a Python library for pulling data out of HTML and XML files. This can be done by presenting the HTML as a set of objects.
+
+Take the below HTML page as input
+```
+<!DOCTYPE html>
+<html>
+<head>
+<title>Page Title</title>
+</head>
+<body>
+<h3>Title</h3>
+<p>This is the main context of the page</p>
+</body>
+</html>
+```
+
+lets store the HTML code in a variable
+```
+html=  "<!DOCTYPE html>
+        <html>
+        <head>
+        <title>Page Title</title>
+        </head>
+        <body>
+        <h3><b> Title </b></h3>
+        <p>This is the main context of the page</p>
+        </body>
+        </html>"
+```
+
+To parse the HTML document, pass the variable to the `BeautifulSoup` Constructor.
+```
+soup = BeautifulSoup(html,'html.parser')
+```
+Beautiful Soup transforms a complex HTML document into a complex tree of Python objects.
+
+## 5. Tags
+
+The Tag object corresponds to an HTML tag in the original document.
+
+```
+tag_obj = soup.title
+print("tag object:", tag_obj)
+print("tag object type:",type(tag_obj))
+```
+```
+Result:
+tag object: <title>Page Title</title>
+
+tag object type: <class 'bs4.element.Tag'>
+```
+
+## 6. Children, Parent and Siblings
+
+We can access the child of the tag or navigate down.
+
+```
+tag_obj = soup.h3
+child = tag_obj.b
+print(child)
+parent = child.parent
+print(parent)
+sib = tag_obj.next_sibling
+print(sib);
+```
+```
+Result:
+<b> Title </b>
+<h3><b> Title </b></h3>
+<p>This is the main context of the page</p>
+```
+
+> We need to mention the child to which we want to navigate. It is because there can be more than one child to a tag but only one parent and next sibling.
+
+**Navigable String:** A string corresponds to a bit of text or content within a tag. Beautiful Soup uses the NavigableString class to contain this text.
+
+```
+tagstr = child.string
+print(tagstr)
+```
+```
+Result:
+Title
+```
+
+## 7. Filter
+
+Filters allow you to find complex patterns, the simplest filter is a string.
+
+Consider the following HTML code:
+
+```
+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Sample Table</title>
+</head>
+<body>
+    <h1>Sample Table</h1>
+    <table class="myTable">
+        <thead>
+            <tr>
+                <th>Name</th>
+                <th>Age</th>
+                <th>City</th>
+            </tr>
+        </thead>
+        <tbody>
+            <tr>
+                <td>John Doe</td>
+                <td>28</td>
+                <td>New York</td>
+            </tr>
+            <tr>
+                <td>Jane Smith</td>
+                <td>34</td>
+                <td>Los Angeles</td>
+            </tr>
+            <tr>
+                <td>Emily Jones</td>
+                <td>23</td>
+                <td>Chicago</td>
+            </tr>
+        </tbody>
+    </table>
+</body>
+</html>
+
+```
+Store the above code in a variable.
+```
+table = "<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Sample Table</title>
+</head>..............
+
+table_bs = BeautifulSoup(table, 'html.parser')
+"
+```
+**Find All:**  The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters. It places all the objects in a list. We can access by indexing.
+
+## 8. Web Scraping the Contents of a Web Page
+
+1. Get the URL of the webpage we need to scrape.
+
+2. Get the HTML content by using `get` method in HTTP request.
+
+3. Pass the HTML content to the BeautifulSoup Constructor to prepare an object.
+
+4. Access the HTML elements using Tag name.
+
+Below is the code to extract the img tags in a wikipedia page.
+
+```
+import requests
+!pip install bs4
+from bs4 import BeautifulSoup
+
+url = "https://en.wikipedia.org/wiki/Encyclopedia"
+data = requests.get(url).text
+soup = BeautifulSoup(data,"html.parser")
+
+for link in soup.find_all('a'):
+  print(link)
+  print(link.get('src'))
+```
+We can retrieve any data in the webpage by access through the tags.
--- a/contrib/web-scrapping/index.md
+++ b/contrib/web-scrapping/index.md
@ -2,3 +2,4 @@

 - [Section title](filename.md)
 - [Introduction to Flask](flask.md)
+- [Web Scrapping Using Beautiful Soup](beautifulsoup.md)