From b2643ad26d31b8e70e3bb41842745bb9f3129716 Mon Sep 17 00:00:00 2001 From: saiumasankar Date: Sat, 8 Jun 2024 15:03:43 +0530 Subject: [PATCH] deleting unnecessary --- contrib/web-scrapping/beautifulsoup.md | 206 ------------------------- contrib/web-scrapping/index.md | 1 - 2 files changed, 207 deletions(-) delete mode 100644 contrib/web-scrapping/beautifulsoup.md diff --git a/contrib/web-scrapping/beautifulsoup.md b/contrib/web-scrapping/beautifulsoup.md deleted file mode 100644 index 372b8f2..0000000 --- a/contrib/web-scrapping/beautifulsoup.md +++ /dev/null @@ -1,206 +0,0 @@ -# Beautiful Soup Library in Python - -## Table of Contents -1. Introduction -2. Prerequisites -3. Setting Up Environment -4. Beautiful Soup Objects -5. Tag object -6. Children, Parents and Siblings -7. Filter: Findall method -8. Web Scraping the Contents of a Web Page - -## 1. Introduction -Beautiful Soup is a Python library used for web scraping purposes to extract data from HTML and XML files. It creates a parse tree for parsed pages that can be used to extract data easily. - -## 2. Prerequisites -- Understanding of HTTP Requests and Responses. -- Understanding of HTTP methods (GET method). -- Basic Knowledge on HTML. -- Python installed on your machine (version 3.6 or higher). -- pip (Python package installer) installed. - -## 3. Setting Up Environment -1.**Install latest version of Python:** Go to Python.org and Download the latest version of Python. - -2.**Install a IDE:** Install any IDE of your choice to code. VS code is preferred. ->Note: You can use Google Colab without installing Python and IDE. - -3.**Install Beautiful Soup:** -```python -!pip install bs4 -``` - -## 4. Beautiful Soup Objects -Beautiful Soup is a Python library for pulling data out of HTML and XML files. This can be done by presenting the HTML as a set of objects. - -Take the below HTML page as input -``` - - - -Page Title - - -

Title

-

This is the main context of the page

- - -``` - -lets store the HTML code in a variable -``` -html= " - - - Page Title - - -

Title

-

This is the main context of the page

- - " -``` - -To parse the HTML document, pass the variable to the `BeautifulSoup` Constructor. -``` -soup = BeautifulSoup(html,'html.parser') -``` -Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. - -## 5. Tags - -The Tag object corresponds to an HTML tag in the original document. - -``` -tag_obj = soup.title -print("tag object:", tag_obj) -print("tag object type:",type(tag_obj)) -``` -``` -Result: -tag object: Page Title - -tag object type: -``` - -## 6. Children, Parent and Siblings - -We can access the child of the tag or navigate down. - -``` -tag_obj = soup.h3 -child = tag_obj.b -print(child) -parent = child.parent -print(parent) -sib = tag_obj.next_sibling -print(sib); -``` -``` -Result: - Title -

Title

-

This is the main context of the page

-``` - -> We need to mention the child to which we want to navigate. It is because there can be more than one child to a tag but only one parent and next sibling. - -**Navigable String:** A string corresponds to a bit of text or content within a tag. Beautiful Soup uses the NavigableString class to contain this text. - -``` -tagstr = child.string -print(tagstr) -``` -``` -Result: -Title -``` - -## 7. Filter - -Filters allow you to find complex patterns, the simplest filter is a string. - -Consider the following HTML code: - -``` - - - - - - Sample Table - - -

Sample Table

- - - - - - - - - - - - - - - - - - - - - - - - - -
NameAgeCity
John Doe28New York
Jane Smith34Los Angeles
Emily Jones23Chicago
- - - -``` -Store the above code in a variable. -``` -table = " - - - - - Sample Table -.............. - -table_bs = BeautifulSoup(table, 'html.parser') -" -``` -**Find All:** The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters. It places all the objects in a list. We can access by indexing. - -## 8. Web Scraping the Contents of a Web Page - -1. Get the URL of the webpage we need to scrape. - -2. Get the HTML content by using `get` method in HTTP request. - -3. Pass the HTML content to the BeautifulSoup Constructor to prepare an object. - -4. Access the HTML elements using Tag name. - -Below is the code to extract the img tags in a wikipedia page. - -``` -import requests -!pip install bs4 -from bs4 import BeautifulSoup - -url = "https://en.wikipedia.org/wiki/Encyclopedia" -data = requests.get(url).text -soup = BeautifulSoup(data,"html.parser") - -for link in soup.find_all('a'): - print(link) - print(link.get('src')) -``` -We can retrieve any data in the webpage by access through the tags. diff --git a/contrib/web-scrapping/index.md b/contrib/web-scrapping/index.md index 29f558a..276014e 100644 --- a/contrib/web-scrapping/index.md +++ b/contrib/web-scrapping/index.md @@ -2,4 +2,3 @@ - [Section title](filename.md) - [Introduction to Flask](flask.md) -- [Web Scrapping Using Beautiful Soup](beautifulsoup.md)