From f742fddff366a0eff24520c8ee292b81d2188d45 Mon Sep 17 00:00:00 2001 From: saiumasankar Date: Wed, 5 Jun 2024 17:28:33 +0530 Subject: [PATCH] Adding the BeautifulSoup web scrapping --- contrib/web-scrapping/beautifulsoup.md | 206 +++++++++++++++++++++++++ contrib/web-scrapping/index.md | 1 + 2 files changed, 207 insertions(+) create mode 100644 contrib/web-scrapping/beautifulsoup.md diff --git a/contrib/web-scrapping/beautifulsoup.md b/contrib/web-scrapping/beautifulsoup.md new file mode 100644 index 0000000..372b8f2 --- /dev/null +++ b/contrib/web-scrapping/beautifulsoup.md @@ -0,0 +1,206 @@ +# Beautiful Soup Library in Python + +## Table of Contents +1. Introduction +2. Prerequisites +3. Setting Up Environment +4. Beautiful Soup Objects +5. Tag object +6. Children, Parents and Siblings +7. Filter: Findall method +8. Web Scraping the Contents of a Web Page + +## 1. Introduction +Beautiful Soup is a Python library used for web scraping purposes to extract data from HTML and XML files. It creates a parse tree for parsed pages that can be used to extract data easily. + +## 2. Prerequisites +- Understanding of HTTP Requests and Responses. +- Understanding of HTTP methods (GET method). +- Basic Knowledge on HTML. +- Python installed on your machine (version 3.6 or higher). +- pip (Python package installer) installed. + +## 3. Setting Up Environment +1.**Install latest version of Python:** Go to Python.org and Download the latest version of Python. + +2.**Install a IDE:** Install any IDE of your choice to code. VS code is preferred. +>Note: You can use Google Colab without installing Python and IDE. + +3.**Install Beautiful Soup:** +```python +!pip install bs4 +``` + +## 4. Beautiful Soup Objects +Beautiful Soup is a Python library for pulling data out of HTML and XML files. This can be done by presenting the HTML as a set of objects. + +Take the below HTML page as input +``` + + + +Page Title + + +

Title

+

This is the main context of the page

+ + +``` + +lets store the HTML code in a variable +``` +html= " + + + Page Title + + +

Title

+

This is the main context of the page

+ + " +``` + +To parse the HTML document, pass the variable to the `BeautifulSoup` Constructor. +``` +soup = BeautifulSoup(html,'html.parser') +``` +Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. + +## 5. Tags + +The Tag object corresponds to an HTML tag in the original document. + +``` +tag_obj = soup.title +print("tag object:", tag_obj) +print("tag object type:",type(tag_obj)) +``` +``` +Result: +tag object: Page Title + +tag object type: +``` + +## 6. Children, Parent and Siblings + +We can access the child of the tag or navigate down. + +``` +tag_obj = soup.h3 +child = tag_obj.b +print(child) +parent = child.parent +print(parent) +sib = tag_obj.next_sibling +print(sib); +``` +``` +Result: + Title +

Title

+

This is the main context of the page

+``` + +> We need to mention the child to which we want to navigate. It is because there can be more than one child to a tag but only one parent and next sibling. + +**Navigable String:** A string corresponds to a bit of text or content within a tag. Beautiful Soup uses the NavigableString class to contain this text. + +``` +tagstr = child.string +print(tagstr) +``` +``` +Result: +Title +``` + +## 7. Filter + +Filters allow you to find complex patterns, the simplest filter is a string. + +Consider the following HTML code: + +``` + + + + + + Sample Table + + +

Sample Table

+ + + + + + + + + + + + + + + + + + + + + + + + + +
NameAgeCity
John Doe28New York
Jane Smith34Los Angeles
Emily Jones23Chicago
+ + + +``` +Store the above code in a variable. +``` +table = " + + + + + Sample Table +.............. + +table_bs = BeautifulSoup(table, 'html.parser') +" +``` +**Find All:** The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters. It places all the objects in a list. We can access by indexing. + +## 8. Web Scraping the Contents of a Web Page + +1. Get the URL of the webpage we need to scrape. + +2. Get the HTML content by using `get` method in HTTP request. + +3. Pass the HTML content to the BeautifulSoup Constructor to prepare an object. + +4. Access the HTML elements using Tag name. + +Below is the code to extract the img tags in a wikipedia page. + +``` +import requests +!pip install bs4 +from bs4 import BeautifulSoup + +url = "https://en.wikipedia.org/wiki/Encyclopedia" +data = requests.get(url).text +soup = BeautifulSoup(data,"html.parser") + +for link in soup.find_all('a'): + print(link) + print(link.get('src')) +``` +We can retrieve any data in the webpage by access through the tags. diff --git a/contrib/web-scrapping/index.md b/contrib/web-scrapping/index.md index 276014e..29f558a 100644 --- a/contrib/web-scrapping/index.md +++ b/contrib/web-scrapping/index.md @@ -2,3 +2,4 @@ - [Section title](filename.md) - [Introduction to Flask](flask.md) +- [Web Scrapping Using Beautiful Soup](beautifulsoup.md)