Adding the BeautifulSoup web scrapping

pull/1022/head
saiumasankar 2024-06-05 17:28:33 +05:30
rodzic 92571f6197
commit f742fddff3
2 zmienionych plików z 207 dodań i 0 usunięć

Wyświetl plik

@ -0,0 +1,206 @@
# Beautiful Soup Library in Python
## Table of Contents
1. Introduction
2. Prerequisites
3. Setting Up Environment
4. Beautiful Soup Objects
5. Tag object
6. Children, Parents and Siblings
7. Filter: Findall method
8. Web Scraping the Contents of a Web Page
## 1. Introduction
Beautiful Soup is a Python library used for web scraping purposes to extract data from HTML and XML files. It creates a parse tree for parsed pages that can be used to extract data easily.
## 2. Prerequisites
- Understanding of HTTP Requests and Responses.
- Understanding of HTTP methods (GET method).
- Basic Knowledge on HTML.
- Python installed on your machine (version 3.6 or higher).
- pip (Python package installer) installed.
## 3. Setting Up Environment
1.**Install latest version of Python:** Go to Python.org and Download the latest version of Python.
2.**Install a IDE:** Install any IDE of your choice to code. VS code is preferred.
>Note: You can use Google Colab without installing Python and IDE.
3.**Install Beautiful Soup:**
```python
!pip install bs4
```
## 4. Beautiful Soup Objects
Beautiful Soup is a Python library for pulling data out of HTML and XML files. This can be done by presenting the HTML as a set of objects.
Take the below HTML page as input
```
<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3>Title</h3>
<p>This is the main context of the page</p>
</body>
</html>
```
lets store the HTML code in a variable
```
html= "<!DOCTYPE html>
<html>
<head>
<title>Page Title</title>
</head>
<body>
<h3><b> Title </b></h3>
<p>This is the main context of the page</p>
</body>
</html>"
```
To parse the HTML document, pass the variable to the `BeautifulSoup` Constructor.
```
soup = BeautifulSoup(html,'html.parser')
```
Beautiful Soup transforms a complex HTML document into a complex tree of Python objects.
## 5. Tags
The Tag object corresponds to an HTML tag in the original document.
```
tag_obj = soup.title
print("tag object:", tag_obj)
print("tag object type:",type(tag_obj))
```
```
Result:
tag object: <title>Page Title</title>
tag object type: <class 'bs4.element.Tag'>
```
## 6. Children, Parent and Siblings
We can access the child of the tag or navigate down.
```
tag_obj = soup.h3
child = tag_obj.b
print(child)
parent = child.parent
print(parent)
sib = tag_obj.next_sibling
print(sib);
```
```
Result:
<b> Title </b>
<h3><b> Title </b></h3>
<p>This is the main context of the page</p>
```
> We need to mention the child to which we want to navigate. It is because there can be more than one child to a tag but only one parent and next sibling.
**Navigable String:** A string corresponds to a bit of text or content within a tag. Beautiful Soup uses the NavigableString class to contain this text.
```
tagstr = child.string
print(tagstr)
```
```
Result:
Title
```
## 7. Filter
Filters allow you to find complex patterns, the simplest filter is a string.
Consider the following HTML code:
```
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Sample Table</title>
</head>
<body>
<h1>Sample Table</h1>
<table class="myTable">
<thead>
<tr>
<th>Name</th>
<th>Age</th>
<th>City</th>
</tr>
</thead>
<tbody>
<tr>
<td>John Doe</td>
<td>28</td>
<td>New York</td>
</tr>
<tr>
<td>Jane Smith</td>
<td>34</td>
<td>Los Angeles</td>
</tr>
<tr>
<td>Emily Jones</td>
<td>23</td>
<td>Chicago</td>
</tr>
</tbody>
</table>
</body>
</html>
```
Store the above code in a variable.
```
table = "<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Sample Table</title>
</head>..............
table_bs = BeautifulSoup(table, 'html.parser')
"
```
**Find All:** The find_all() method looks through a tags descendants and retrieves all descendants that match your filters. It places all the objects in a list. We can access by indexing.
## 8. Web Scraping the Contents of a Web Page
1. Get the URL of the webpage we need to scrape.
2. Get the HTML content by using `get` method in HTTP request.
3. Pass the HTML content to the BeautifulSoup Constructor to prepare an object.
4. Access the HTML elements using Tag name.
Below is the code to extract the img tags in a wikipedia page.
```
import requests
!pip install bs4
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Encyclopedia"
data = requests.get(url).text
soup = BeautifulSoup(data,"html.parser")
for link in soup.find_all('a'):
print(link)
print(link.get('src'))
```
We can retrieve any data in the webpage by access through the tags.

Wyświetl plik

@ -2,3 +2,4 @@
- [Section title](filename.md)
- [Introduction to Flask](flask.md)
- [Web Scrapping Using Beautiful Soup](beautifulsoup.md)