Learn Python – Web Scraping Using Python- Basic and advance

What is Web Scraping?

Web Scraping is a method to extract a massive quantity of records from numerous websites. The time period “scraping” refers to acquiring the records from every other supply (webpages) and saving it into a nearby file. For example: Suppose you are working on a venture known as “Phone evaluating website,” the place you require the price of cell phones, ratings, and mannequin names to make comparisons between the distinctive mobile phones. If you accumulate these important points by way of checking more than a few sites, it will take a good deal time. In that case, net scrapping performs an vital role the place by writing a few lines of code you can get the favored results.

Web Scrapping extracts the statistics from websites in the unstructured format. It helps to acquire these unstructured statistics and convert it in a structured form.

Startups prefer web scrapping due to the fact it is a low priced and positive way to get a large amount of information barring any partnership with the records promoting company.

Is Web Scrapping legal?

Here the question arises whether the internet scrapping is prison or not. The reply is that some web sites enable it when used legally. Web scraping is simply a tool you can use it in the proper way or incorrect way.

Web scrapping is unlawful if anyone tries to scrap the nonpublic data. Nonpublic statistics is not reachable to everyone; if you strive to extract such facts then it is a violation of the prison term.

There are quite a few equipment on hand to scrap information from websites, such as:

Scrapping-bot

Scrapper API

Octoparse

Import.io

Webhose.io

Dexi.io

Outwit

Diffbot

Content Grabber

Mozenda

Web Scrapper Chrome Extension

Why Web Scrapping?

As we have discussed above, net scrapping is used to extract the information from websites. But we need to understand how to use that raw data. That uncooked facts can be used in more than a few fields. Let’s have a seem at the utilization of web scrapping:

Dynamic Price Monitoring

It is extensively used to collect statistics from quite a few on line purchasing web sites and evaluate the expenditures of merchandise and make worthwhile pricing decisions. Price monitoring using internet scrapped records gives the capability to the companies to know the market situation and facilitate dynamic pricing. It ensures the corporations they continually outrank others.

Market Research

eb Scrapping is flawlessly excellent for market vogue analysis. It is gaining insights into a precise market. The giant organisation requires a notable deal of data, and net scrapping provides the statistics with a assured stage of reliability and accuracy.

Email Gathering

Many organizations use personals e-mail statistics for electronic mail marketing. They can goal the unique target audience for their marketing.

News and Content Monitoring

A single information cycle can create an amazing effect or a actual hazard to your business. If your enterprise depends on the news analysis of an organization, it often appears in the news. So internet scraping provides the closing solution to monitoring and parsing the most imperative stories. News articles and social media platform can immediately have an effect on the stock market.

Social Media Scrapping

Web Scrapping plays an necessary position in extracting records from social media websites such as Twitter, Facebook, and Instagram, to locate the trending topics.

Research and Development

The large set of information such as widespread information, statistics, and temperature is scrapped from websites, which is analyzed and used to lift out surveys or research and development.

Why use Python for Web Scrapping?

There are different popular programming languages, however why we pick the Python over other programming languages for web scraping? Below we are describing a listing of Python’s aspects that make the most beneficial programming language for net scrapping.

Dynamically Typed

In Python, we do not want to outline facts kinds for variables; we can directly use the variable anyplace it requires. It saves time and makes a project faster. Python defines its lessons to discover the records type of variable.

Vast collection of libraries

Python comes with an considerable vary of libraries such as NumPy, Matplotlib, Pandas, Scipy, etc., that furnish flexibility to work with a range of purposes. It is proper for almost each and every rising discipline and also for internet scrapping for extracting information and do manipulation.

Less Code

The reason of the net scrapping is to retailer time. But what if you spend more time in writing the code? That’s why we use Python, as it can operate a project in a few traces of code.

Open-Source Community

Python is open-source, which capability it is freely available for everyone. It has one of the biggest communities throughout the world the place you can are seeking for assist if you get stuck somewhere in Python code.

The basics of web scraping

The web scrapping consists of two parts: a web crawler and a net scraper. In simple words, the net crawler is a horse, and the scrapper is the chariot. The crawler leads the scrapper and extracts the requested data. Let’s apprehend about these two aspects of web scrapping:

The crawler

A internet crawler is commonly known as a “spider.” It is an artificial intelligence technology that browses the web to index and searches for the content by using given links. It searches for the relevant facts requested by using the programmer.

The scrapper

A web scraper is a committed device that is designed to extract the records from several web sites rapidly and effectively. Web scrappers vary extensively in layout and complexity, relying on the projects.

How does Web Scrapping work?

These are the following steps to perform net scraping. Let’s recognize the working of web scraping.

Step -1: Find the URL that you want to scrape

First, you need to apprehend the requirement of statistics in accordance to your project. A webpage or website carries a massive amount of information. That’s why scrap only relevant information. In simple words, the developer must be familiar with the records requirement.

Step – 2: Inspecting the Page

The facts is extracted in uncooked HTML format, which ought to be cautiously parsed and limit the noise from the raw data. In some cases, data can be simple as title and tackle or as complicated as high dimensional climate and inventory market data.

Step – 3: Write the code

Write a code to extract the information, grant relevant information, and run the code.

Step – 4: Store the data in the file

Store that information in required csv, xml, JSON file format.

Getting Started with Web Scrapping

Python has a considerable collection of libraries and also presents a very useful library for internet scrapping. Let’s recognize the required library for Python.

Library used for web scrapping

Selenium- Selenium is an open-source automated testing library. It is used to check browser activities. To install this library, type the following command in your terminal.

pip install selenium  

Note – It is good to use the PyCharm IDE.

Pandas

Pandas library is used for data manipulation and analysis. It is used to extract the data and shop it in the preferred format.

BeautifulSoup

Let’s understand the BeautifulSoup library in detail.

Installation of BeautifulSoup

You can deploy BeautifulSoup through typing the following command:

pip install bs4  

Installing a parser

BeautifulSoup helps HTML parser and numerous third-party Python parsers. You can set up any of them according to your dependency. The list of BeautifulSoup’s parsers is the following:

Parser Typical usage
Python’s html.parser BeautifulSoup(markup,”html.parser”)
lxml’s HTML parser BeautifulSoup(markup,”lxml”)
lxml’s XML parser BeautifulSoup(markup,”lxml-xml”)
Html5lib BeautifulSoup(markup,”html5lib”)

We suggest you to installation html5lib parser due to the fact it is a lot suitable for the more moderen model of Python, or you can installation lxml parser.

Type the following command in your terminal:

pip install html5lib  

BeautifulSoup is used to radically change a complicated HTML record into a complex tree of Python objects. But there are a few necessary types object which are usually used:

Tag

A Tag object corresponds to an XML or HTML original document.

soup = bs4.BeautifulSoup("<b class = "boldest">Extremely bold</b>)  
tag = soup.b  
type(tag)  

Output:

<class "bs4.element.Tag">

Tag consists of lot of attributes and methods, but most important features of a tag are title and attribute.

Name

Every tag has a name, accessible as .name:

tag.name  

Attributes

A tag may additionally have any range of attributes. The tag id = “boldest”> has an attribute “id” whose fee is “boldest”. We can get admission to a tag’s attributes by means of treating the tag as dictionary.

tag[id]  

We can add, remove, and alter a tag’s attributes. It can be achieved by means of using tag as dictionary.

# add the element  
tag['id'] = 'verybold'  
tag['another-attribute'] = 1  
tag  
# delete the tag  
del tag['id']     

Multi-valued Attributes

In HTML5, there are some attributes that can have multiple values. The type (consists greater than one css) is the most common multivalued attributes. Other attributes are rel, rev, accept-charset, headers, and accesskey.

class_is_multi= { '*' : 'class'}  
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml', multi_valued_attributes=class_is_multi)  
xml_soup.p['class']  
# [u'body', u'strikeout']  

NavigableString

A string in BeautifulSoup refers text inside a tag. BeautifulSoup uses the NavigableString type to contain these bits of text.

tag.string  
# u'Extremely bold'  
type(tag.string)  
# <class 'bs4.element.NavigableString'>  

A string is immutable ability it can not be edited. But it can be replaced with every other string the use of replace_with().

tag.string.replace_with("No longer bold")  
tag  

In some cases, if you choose to use a NavigableString outside the BeautifulSoup, the unicode() helps it to turn into ordinary Python Unicode string.

BeautifulSoup object

The BeautifulSoup object represents the whole parsed file as a whole. In many cases, we can use it as a Tag object. It ability it helps most of the methods described in navigating the tree and looking the tree.

doc=BeautifulSoup("<document><content/>INSERT FOOTER HERE</document","xml")  
footer=BeautifulSoup("<footer>Here's the footer</footer>","xml")  
doc.find(text="INSERT FOOTER HERE").replace_with(footer)  
print(doc)  

Output:

?xml version="1.0" encoding="utf-8"?>
# <document><content/><footer>Here's the footer</footer></document>

Web Scrapping Example:

Let’s take an example to apprehend the scrapping practically with the aid of extracting the facts from the webpage and inspecting the whole page.

First, open your preferred page on Wikipedia and look at the entire page, and before extracting information from the webpage, you must ensure your requirement. Consider the following code:

#importing the BeautifulSoup Library  
  
importbs4  
import requests  
  
#Creating the requests  
  
res = requests.get("https://en.wikipedia.org/wiki/Machine_learning")  
print("The object type:",type(res))  
  
# Convert the request object to the Beautiful Soup Object  
soup = bs4.BeautifulSoup(res.text,'html5lib')  
print("The object type:",type(soup)  

Output:

The object type <class 'requests.models.Response'>
Convert the object into: <class 'bs4.BeautifulSoup'>

In the following lines of code, we are extracting all headings of a webpage by means of classification name. Here front-end expertise performs an crucial function in inspecting the webpage.

soup.select('.mw-headline')  
for i in soup.select('.mw-headline'):  
print(i.text,end = ',')  

Output:

Overview,Machine learning tasks,History and relationships to other fields,Relation to data mining,Relation to optimization,Relation to statistics, Theory,Approaches,Types of learning algorithms,Supervised learning,Unsupervised learning,Reinforcement learning,Self-learning,Feature learning,Sparse dictionary learning,Anomaly detection,Association rules,Models,Artificial neural networks,Decision trees,Support vector machines,Regression analysis,Bayesian networks,Genetic algorithms,Training models,Federated learning,Applications,Limitations,Bias,Model assessments,Ethics,Software,Free and open-source software,Proprietary software with free and open-source editions,Proprietary software,Journals,Conferences,See also,References,Further reading,External links,

In the above code, we imported the bs4 and requested the library. In the 1/3 line, we created a res object to ship a request to the webpage. As you can study that we have extracted all heading from the webpage.

Webpage of Wikipedia Learning

Let’s apprehend any other example; we will make a GET request to the URL and create a parse Tree object (soup) with the use of BeautifulSoup and Python built-in “html5lib” parser.

Here we will scrap the webpage of given hyperlink (https://www.javatpoint.com/). Consider the following code:

following code:  
# importing the libraries  
from bs4 import BeautifulSoup  
import requests  
  
url="https://www.javatpoint.com/"  
  
# Make a GET request to fetch the raw HTML content  
html_content = requests.get(url).text  
  
# Parse the html content  
soup = BeautifulSoup(html_content, "html5lib")  
print(soup.prettify()) # print the parsed data of html  

The above code will display the all html code of javatpoint homepage.

Using the BeautifulSoup object, i.e. soup, we can gather the required information table. Let’s print some fascinating facts the usage of the soup object:

Let’s print the title of the web page.

print(soup.title)  

Output: It will give an output as follow:

<title>Tutorials List - Javatpoint</title>

In the above output, the HTML tag is included with the title. If you want text without tag, you can use the following code:

print(soup.title.text)  

Output: It will give an output as follow:

Tutorials List - Javatpoint

We can get the entire link on the page along with its attributes, such as href, title, and its inner Text. Consider the following code:

for link in soup.find_all("a"):  
print("Inner Text is: {}".format(link.text))  
print("Title is: {}".format(link.get("title")))  
print("href is: {}".format(link.get("href")))  

Output: It will print all hyperlinks along with its attributes. Here we display a few of them:

href is: https://www.facebook.com/javatpoint
Inner Text is: 
The title is: None
href is: https://twitter.com/pagejavatpoint
Inner Text is: 
The title is: None
href is: https://www.youtube.com/channel/UCUnYvQVCrJoFWZhKK3O2xLg
Inner Text is: 
The title is: None
href is: https://javatpoint.blogspot.com
Inner Text is: Learn Java
Title is: None
href is: https://www.javatpoint.com/java-tutorial
Inner Text is: Learn Data Structures
Title is: None
href is: https://www.javatpoint.com/data-structure-tutorial
Inner Text is: Learn C Programming
Title is: None
href is: https://www.javatpoint.com/c-programming-language-tutorial
Inner Text is: Learn C++ Tutorial

Demo: Scraping Data from Flipkart Website

In this example, we will scrap the cellular telephone prices, ratings, and mannequin title from Flipkart, which is one of the popular e-commerce websites. Following are the stipulations to accomplish this task:

Prerequisites:

Python 2.x or Python 3.x with Selenium, BeautifulSoup, Pandas libraries installed.

Google – chrome browser

Scrapping Parser such as html.parser, xlml, etc.

Step – 1: Find the desired URL to scrap

The preliminary step is to locate the URL that you desire to scrap. Here we are extracting cellular telephone small print from the flipkart. The URL of this web page is https://www.flipkart.com/search?q=iphones&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off.

Step -2: Inspecting the page

It is crucial to look at the web page cautiously because the statistics is usually contained inside the tags. So we need to look at to select the desired tag. To look at the page, right-click on the factor and click on “inspect”.

Step – 3: Find the data for extracting

Extract the Price, Name, and Rating, which are contained in the “div” tag, respectively.

Step – 4: Write the Code

from bs4 import BeautifulSoupas soup  
from urllib.request import urlopen as uReq  
  
# Request from the webpage  
myurl = "https://www.flipkart.com/search?q=iphones&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off"  
  
  
uClient  = uReq(myurl)  
page_html = uClient.read()  
uClient.close()  
  
page_soup = soup(page_html, features="html.parser")  
  
# print(soup.prettify(containers[0]))  
  
# This variable held all html of webpage  
containers = page_soup.find_all("div",{"class": "_3O0U0u"})  
# container = containers[0]  
# # print(soup.prettify(container))  
#  
# price = container.find_all("div",{"class": "col col-5-12 _2o7WAb"})  
# print(price[0].text)  
#  
# ratings = container.find_all("div",{"class": "niH0FQ"})  
# print(ratings[0].text)  
#  
# #  
# # print(len(containers))  
# print(container.div.img["alt"])  
  
# Creating CSV File that will store all data   
filename = "product1.csv"  
f = open(filename,"w")  
  
headers = "Product_Name,Pricing,Ratings\n"  
f.write(headers)  
  
for container in containers:  
    product_name = container.div.img["alt"]  
  
    price_container = container.find_all("div", {"class": "col col-5-12 _2o7WAb"})  
    price = price_container[0].text.strip()  
  
    rating_container = container.find_all("div",{"class":"niH0FQ"})  
    ratings = rating_container[0].text  
  
# print("product_name:"+product_name)  
    # print("price:"+price)  
    # print("ratings:"+ str(ratings))  
  
     edit_price = ''.join(price.split(','))  
     sym_rupee = edit_price.split("?")  
     add_rs_price = "Rs"+sym_rupee[1]  
     split_price = add_rs_price.split("E")  
     final_price = split_price[0]  
  
     split_rating = str(ratings).split(" ")  
     final_rating = split_rating[0]  
  
     print(product_name.replace(",", "|")+","+final_price+","+final_rating+"\n")  
f.write(product_name.replace(",", "|")+","+final_price+","+final_rating+"\n")  
  
f.close()  

Output:

We scrapped the important points of the iPhone and saved these important points in the CSV file as you can see in the output. In the above code, we put a remark on the few lines of code for checking out purpose. You can cast off these comments and observe the output.

In this tutorial, we have discussed all basic standards of internet scrapping and described the sample scrapping from the leading on line ecommerce web page flipkart.