kopia lustrzana https://github.com/animator/learn-python
Adding the BeautifulSoup web scrapping
rodzic
92571f6197
commit
f742fddff3
|
@ -0,0 +1,206 @@
|
||||||
|
# Beautiful Soup Library in Python
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
1. Introduction
|
||||||
|
2. Prerequisites
|
||||||
|
3. Setting Up Environment
|
||||||
|
4. Beautiful Soup Objects
|
||||||
|
5. Tag object
|
||||||
|
6. Children, Parents and Siblings
|
||||||
|
7. Filter: Findall method
|
||||||
|
8. Web Scraping the Contents of a Web Page
|
||||||
|
|
||||||
|
## 1. Introduction
|
||||||
|
Beautiful Soup is a Python library used for web scraping purposes to extract data from HTML and XML files. It creates a parse tree for parsed pages that can be used to extract data easily.
|
||||||
|
|
||||||
|
## 2. Prerequisites
|
||||||
|
- Understanding of HTTP Requests and Responses.
|
||||||
|
- Understanding of HTTP methods (GET method).
|
||||||
|
- Basic Knowledge on HTML.
|
||||||
|
- Python installed on your machine (version 3.6 or higher).
|
||||||
|
- pip (Python package installer) installed.
|
||||||
|
|
||||||
|
## 3. Setting Up Environment
|
||||||
|
1.**Install latest version of Python:** Go to Python.org and Download the latest version of Python.
|
||||||
|
|
||||||
|
2.**Install a IDE:** Install any IDE of your choice to code. VS code is preferred.
|
||||||
|
>Note: You can use Google Colab without installing Python and IDE.
|
||||||
|
|
||||||
|
3.**Install Beautiful Soup:**
|
||||||
|
```python
|
||||||
|
!pip install bs4
|
||||||
|
```
|
||||||
|
|
||||||
|
## 4. Beautiful Soup Objects
|
||||||
|
Beautiful Soup is a Python library for pulling data out of HTML and XML files. This can be done by presenting the HTML as a set of objects.
|
||||||
|
|
||||||
|
Take the below HTML page as input
|
||||||
|
```
|
||||||
|
<!DOCTYPE html>
|
||||||
|
<html>
|
||||||
|
<head>
|
||||||
|
<title>Page Title</title>
|
||||||
|
</head>
|
||||||
|
<body>
|
||||||
|
<h3>Title</h3>
|
||||||
|
<p>This is the main context of the page</p>
|
||||||
|
</body>
|
||||||
|
</html>
|
||||||
|
```
|
||||||
|
|
||||||
|
lets store the HTML code in a variable
|
||||||
|
```
|
||||||
|
html= "<!DOCTYPE html>
|
||||||
|
<html>
|
||||||
|
<head>
|
||||||
|
<title>Page Title</title>
|
||||||
|
</head>
|
||||||
|
<body>
|
||||||
|
<h3><b> Title </b></h3>
|
||||||
|
<p>This is the main context of the page</p>
|
||||||
|
</body>
|
||||||
|
</html>"
|
||||||
|
```
|
||||||
|
|
||||||
|
To parse the HTML document, pass the variable to the `BeautifulSoup` Constructor.
|
||||||
|
```
|
||||||
|
soup = BeautifulSoup(html,'html.parser')
|
||||||
|
```
|
||||||
|
Beautiful Soup transforms a complex HTML document into a complex tree of Python objects.
|
||||||
|
|
||||||
|
## 5. Tags
|
||||||
|
|
||||||
|
The Tag object corresponds to an HTML tag in the original document.
|
||||||
|
|
||||||
|
```
|
||||||
|
tag_obj = soup.title
|
||||||
|
print("tag object:", tag_obj)
|
||||||
|
print("tag object type:",type(tag_obj))
|
||||||
|
```
|
||||||
|
```
|
||||||
|
Result:
|
||||||
|
tag object: <title>Page Title</title>
|
||||||
|
|
||||||
|
tag object type: <class 'bs4.element.Tag'>
|
||||||
|
```
|
||||||
|
|
||||||
|
## 6. Children, Parent and Siblings
|
||||||
|
|
||||||
|
We can access the child of the tag or navigate down.
|
||||||
|
|
||||||
|
```
|
||||||
|
tag_obj = soup.h3
|
||||||
|
child = tag_obj.b
|
||||||
|
print(child)
|
||||||
|
parent = child.parent
|
||||||
|
print(parent)
|
||||||
|
sib = tag_obj.next_sibling
|
||||||
|
print(sib);
|
||||||
|
```
|
||||||
|
```
|
||||||
|
Result:
|
||||||
|
<b> Title </b>
|
||||||
|
<h3><b> Title </b></h3>
|
||||||
|
<p>This is the main context of the page</p>
|
||||||
|
```
|
||||||
|
|
||||||
|
> We need to mention the child to which we want to navigate. It is because there can be more than one child to a tag but only one parent and next sibling.
|
||||||
|
|
||||||
|
**Navigable String:** A string corresponds to a bit of text or content within a tag. Beautiful Soup uses the NavigableString class to contain this text.
|
||||||
|
|
||||||
|
```
|
||||||
|
tagstr = child.string
|
||||||
|
print(tagstr)
|
||||||
|
```
|
||||||
|
```
|
||||||
|
Result:
|
||||||
|
Title
|
||||||
|
```
|
||||||
|
|
||||||
|
## 7. Filter
|
||||||
|
|
||||||
|
Filters allow you to find complex patterns, the simplest filter is a string.
|
||||||
|
|
||||||
|
Consider the following HTML code:
|
||||||
|
|
||||||
|
```
|
||||||
|
<!DOCTYPE html>
|
||||||
|
<html lang="en">
|
||||||
|
<head>
|
||||||
|
<meta charset="UTF-8">
|
||||||
|
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||||
|
<title>Sample Table</title>
|
||||||
|
</head>
|
||||||
|
<body>
|
||||||
|
<h1>Sample Table</h1>
|
||||||
|
<table class="myTable">
|
||||||
|
<thead>
|
||||||
|
<tr>
|
||||||
|
<th>Name</th>
|
||||||
|
<th>Age</th>
|
||||||
|
<th>City</th>
|
||||||
|
</tr>
|
||||||
|
</thead>
|
||||||
|
<tbody>
|
||||||
|
<tr>
|
||||||
|
<td>John Doe</td>
|
||||||
|
<td>28</td>
|
||||||
|
<td>New York</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>Jane Smith</td>
|
||||||
|
<td>34</td>
|
||||||
|
<td>Los Angeles</td>
|
||||||
|
</tr>
|
||||||
|
<tr>
|
||||||
|
<td>Emily Jones</td>
|
||||||
|
<td>23</td>
|
||||||
|
<td>Chicago</td>
|
||||||
|
</tr>
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
</body>
|
||||||
|
</html>
|
||||||
|
|
||||||
|
```
|
||||||
|
Store the above code in a variable.
|
||||||
|
```
|
||||||
|
table = "<!DOCTYPE html>
|
||||||
|
<html lang="en">
|
||||||
|
<head>
|
||||||
|
<meta charset="UTF-8">
|
||||||
|
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
||||||
|
<title>Sample Table</title>
|
||||||
|
</head>..............
|
||||||
|
|
||||||
|
table_bs = BeautifulSoup(table, 'html.parser')
|
||||||
|
"
|
||||||
|
```
|
||||||
|
**Find All:** The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters. It places all the objects in a list. We can access by indexing.
|
||||||
|
|
||||||
|
## 8. Web Scraping the Contents of a Web Page
|
||||||
|
|
||||||
|
1. Get the URL of the webpage we need to scrape.
|
||||||
|
|
||||||
|
2. Get the HTML content by using `get` method in HTTP request.
|
||||||
|
|
||||||
|
3. Pass the HTML content to the BeautifulSoup Constructor to prepare an object.
|
||||||
|
|
||||||
|
4. Access the HTML elements using Tag name.
|
||||||
|
|
||||||
|
Below is the code to extract the img tags in a wikipedia page.
|
||||||
|
|
||||||
|
```
|
||||||
|
import requests
|
||||||
|
!pip install bs4
|
||||||
|
from bs4 import BeautifulSoup
|
||||||
|
|
||||||
|
url = "https://en.wikipedia.org/wiki/Encyclopedia"
|
||||||
|
data = requests.get(url).text
|
||||||
|
soup = BeautifulSoup(data,"html.parser")
|
||||||
|
|
||||||
|
for link in soup.find_all('a'):
|
||||||
|
print(link)
|
||||||
|
print(link.get('src'))
|
||||||
|
```
|
||||||
|
We can retrieve any data in the webpage by access through the tags.
|
|
@ -2,3 +2,4 @@
|
||||||
|
|
||||||
- [Section title](filename.md)
|
- [Section title](filename.md)
|
||||||
- [Introduction to Flask](flask.md)
|
- [Introduction to Flask](flask.md)
|
||||||
|
- [Web Scrapping Using Beautiful Soup](beautifulsoup.md)
|
||||||
|
|
Ładowanie…
Reference in New Issue