diff --git a/contrib/web-scrapping/beautifulsoup.md b/contrib/web-scrapping/beautifulsoup.md new file mode 100644 index 0000000..372b8f2 --- /dev/null +++ b/contrib/web-scrapping/beautifulsoup.md @@ -0,0 +1,206 @@ +# Beautiful Soup Library in Python + +## Table of Contents +1. Introduction +2. Prerequisites +3. Setting Up Environment +4. Beautiful Soup Objects +5. Tag object +6. Children, Parents and Siblings +7. Filter: Findall method +8. Web Scraping the Contents of a Web Page + +## 1. Introduction +Beautiful Soup is a Python library used for web scraping purposes to extract data from HTML and XML files. It creates a parse tree for parsed pages that can be used to extract data easily. + +## 2. Prerequisites +- Understanding of HTTP Requests and Responses. +- Understanding of HTTP methods (GET method). +- Basic Knowledge on HTML. +- Python installed on your machine (version 3.6 or higher). +- pip (Python package installer) installed. + +## 3. Setting Up Environment +1.**Install latest version of Python:** Go to Python.org and Download the latest version of Python. + +2.**Install a IDE:** Install any IDE of your choice to code. VS code is preferred. +>Note: You can use Google Colab without installing Python and IDE. + +3.**Install Beautiful Soup:** +```python +!pip install bs4 +``` + +## 4. Beautiful Soup Objects +Beautiful Soup is a Python library for pulling data out of HTML and XML files. This can be done by presenting the HTML as a set of objects. + +Take the below HTML page as input +``` + + +
+This is the main context of the page
+ + +``` + +lets store the HTML code in a variable +``` +html= " + + +This is the main context of the page
+ + " +``` + +To parse the HTML document, pass the variable to the `BeautifulSoup` Constructor. +``` +soup = BeautifulSoup(html,'html.parser') +``` +Beautiful Soup transforms a complex HTML document into a complex tree of Python objects. + +## 5. Tags + +The Tag object corresponds to an HTML tag in the original document. + +``` +tag_obj = soup.title +print("tag object:", tag_obj) +print("tag object type:",type(tag_obj)) +``` +``` +Result: +tag object:This is the main context of the page
+``` + +> We need to mention the child to which we want to navigate. It is because there can be more than one child to a tag but only one parent and next sibling. + +**Navigable String:** A string corresponds to a bit of text or content within a tag. Beautiful Soup uses the NavigableString class to contain this text. + +``` +tagstr = child.string +print(tagstr) +``` +``` +Result: +Title +``` + +## 7. Filter + +Filters allow you to find complex patterns, the simplest filter is a string. + +Consider the following HTML code: + +``` + + + + + +Name | +Age | +City | +
---|---|---|
John Doe | +28 | +New York | +
Jane Smith | +34 | +Los Angeles | +
Emily Jones | +23 | +Chicago | +