LearnPython.com
  • Courses
  • Articles
  • Log in
  • Create free account
  • fullName

    User profile menu open Open user profile menu avatar
    avatar
    fullName
    Dashboard
    My Profile
    Payment & Billing
    Log out
MENU CLOSE
  • Courses
  • Articles
  • Dashboard
  • My Profile
  • Payment & Billing
  • Log in
  • Create free account
  • Log out 
Back to articles list Articles
29th Jun 2022 8 minutes read

Web Scraping With Python Libraries

Author's photo
Luke Hande
  • python
  • web scraping
See More

Here are some useful Python libraries to get you started in web scraping.

Looking for Python website scrapers? In this article, we will get you started with some helpful libraries for Python web scraping. You'll find the tools and the inspiration to kickstart your next web scraping project.

Web scraping is the process of extracting information from the source code of a web page. This may be text, numerical data, or even images. It is the first step for many interesting projects! However, there is no fixed technology or methodology for Python web scraping. The best approach is very use-case dependent.

This article is aimed at people with a little more experience in Python and data analysis. If you're new to Python and need some learning material, take a look at this track to give you a background in data analysis.

Let's get started!

Requests

The first step in the process is to get data from the web page we want to scrape. The requests library is used for making HTTP requests to a URL.

As an example, let's say we're interested in getting an article from the learnpython.com blog. To import the library and get the page just requires a few lines of code:

>>> import requests
>>> url = 'https://learnpython.com/blog/python-match-case-statement/'
>>> r = requests.get(url)

The object r is the response from the host server and contains the results of the get() request. To see if the request was successful, check the status with r.status_code. Hopefully, we don't see the dreaded 404! You also need to be aware of the potential for the equally vexing 403 error in web scraping, but luckily this is something you have more control over, as it normally relates to anti-scraping systems, rather than the missing page issue of 404 errors. It is possible to customize the get() request with some optional arguments to modify the response from the server. For more information on this library, including how to send a customized request, take a look at the documentation and user guide.

To get the contents of the web page, we simply need to do the following:

>>> page_text = r.text

This returns the contents of the whole page as a string. From here, we may try to manually extract the required information, but that is messy and error-prone. Thankfully, there is an easier way.

Beautiful Soup

Beautiful Soup is a user-friendly library with functionality for parsing HTML and XML documents automatically into a tree structure. This library only parses the data, which is why we need another library to get the data as we have seen in the previous section.

The library also provides functions for navigating, searching, and modifying the parsed data. Trying different parsing strategies is very easy, and we do not need to worry about document encodings.

We can use this library to parse the HTML-formatted string from the data we have retrieved and extract the information we want. Let's import the library and start making some soup:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(page_text, 'html.parser')

We now have a BeautifulSoup object, which represents the string as a nested data structure. How to proceed from here depends on what information we want to scrape from the page. That may be the text, the code snippets, the headings, or anything else.

To get a sense of how the information is represented, open the URL in your favorite browser and take a look at the source code behind the web page. It looks something like this:

python web scraping

Let's say we want to scrape the Python code snippets from the HTML source code. Notice they always appear between <pre class="brush: python; title: ; notranslate"> and </pre>. We can use this to extract the Python code from the soup as follows:

>>> string = soup.find(class_ = "brush: python; title: ; notranslate").text

Here, we use the find() method, which extracts only the first match. If you want to find all matches, use find_all() to return a list-like data structure that can be indexed like normal.

Now, we have the code snippet as a string including newline characters and spaces to indent the code. To run this code, we have to clean it up a little to remove unwanted characters and save it to a .py file. For example, we can use string.replace('>', '') to remove the > characters.

Check out this article, which has an example that may be useful at this stage. Writing a program to download and run other programs has a nice recursive feel to it. However, be wary of downloading any potentially malicious code.

Selenium

Selenium was developed primarily as a framework for browser automation and testing. However, the library has found another use as a toolbox for web scraping with Python, making it quite versatile. For example, it's useful if we need to interact with a website by filling out a form or clicking on a button. Selenium may also be used to scrape information from JavaScript used by many sites to load the content dynamically.

Let's use Selenium to open a browser, navigate to a web page, enter text into a field, and retrieve some information. However, before we do all that, we need to download an extra executable file to drive the browser. In this example, we'll work with the Chrome browser, but there are other options. You can find the drivers for your version of Chrome here. Download the correct driver and save it in directory.

To open the browser with Selenium in Python, do the following:

>>> from selenium import webdriver
>>> driver = webdriver.Chrome(directory+'chromedriver.exe')
>>> driver.get('https://learnpython.com/')
>>> driver.maximize_window()

This opens a browser window, navigates to https://learnpython.com and maximizes the window. The next step is to find and click on the "Courses" button:

>>> courses_button = driver.find_element_by_link_text('Courses')
>>> courses_button.click()
>>> driver.refresh()

The browser navigates to the Courses page. Let's find the search box and enter a search term:

>>> search_field = driver.find_element_by_class_name('TextFilterComponent__search-bar')
>>> search_field.clear()
>>> search_field.send_keys('excel')

The results automatically update. Next, we want to find the first result and print out the course name:

>>> result = driver.find_element_by_class_name('CourseBlock')
>>> innerhtml = result.get_attribute('innerHTML')
>>> more_soup = BeautifulSoup(innerhtml, 'html.parser')
>>> title = more_soup.find(class_ = 'CourseBlock__name').text

We use BeautifulSoup to parse the HTML from the first search result and then return the name of the course as a string. If we want to run this code in one block, it may be necessary to let the program sleep for a few seconds to let the page load properly. Try this workflow with a different search term, for example, "strings" or "data science".

To do all this for your own project, you need to inspect the source code of the web page to find the relevant names or IDs of the elements with which you want to interact. This is always use-case dependent and involves a little bit of investigative work.

Scrapy

Unlike the two previous libraries, scrapy is very fast and efficient. This makes it useful for scraping large amounts of data from the web – a big advantage of this library. It also takes care of scraping and parsing the data.

However, it is not the most user-friendly library ever written. It is difficult to get your head around it. It is also difficult to show a simple example here.

The workflow for using scrapy involves creating a dedicated project in a separate directory, where several files and directories are automatically created. You may want to check out the course on LearnPython.com that teaches you how to work with files and directories efficiently.

One of the directories created is the "spiders/" directory in which you put your spiders. Spiders are classes that inherit from the scrapy.Spider class. They define what requests to make, how to follow any links on the web page, and how to parse the content. Once you have defined your spider to crawl a web page and extract content, you can run your script from the terminal. Check out this article to learn more about using Python and the command-line interface.

Another powerful feature of scrapy is the automated login. For some sites, we can access the data only after a successful login, but we can automate this with scrapy.FormRequest.

Read through the scrapy documentation page for more information. There, you find the installation guide and an example of this library in action.

Where to From Here in Web Scraping?

We have seen the basics of web scraping with Python and discussed some popular libraries. Web scraping has a huge number of applications. You may want to extract text from Wikipedia to use for natural language processing. You may want to get the weather forecast for your hometown automatically. You may even write a program to compare the prices of flights or hotels before your next holiday.

There are many advantages of using Python for data science projects. It is generally a good idea to start with a small project and slowly build up your skills. If you develop more complex projects with multiple libraries, keep track of them with a requirements.txt file. Before you know it, you will have mastered another skill on your Python journey!

Tags:

  • python
  • web scraping

You may also like

A Guide to the Python argparse Module
Discover how to build a command-line interface with the Python argparse module.
Read more
The Python Requirements File and How to Create it
Learn what a Python requirements.txt file is, how to create it, and how to maintain it with a list of required modules.
Read more
What Are the Advantages of Using Python for Data Science?
Why is Python the go-to language for data science? Read our article to learn about the many advantages of using Python for data science.
Read more
Data Science Projects for Python Practice
Are you looking for a dataset to practice your Python and data science skills? Let’s explore the best resources for your next project.
Read more
The Best Python Books for Data Science
Not enough Python learning yet? Here are the best Python books for learning data science.
Read more
6 Reasons Why Python Is Used For Data Science
Why is Python so widely used in data science? Here are key factors behind Python’s popularity among data professionals.
Read more
Subscribe to our newsletter Join our monthly newsletter to be notified about the latest posts.

How Do You Write a SELECT Statement in SQL?

What Is a Foreign Key in SQL?

Enumerate and Explain All the Basic Elements of an SQL Query

Quick links

  • Pricing
  • Blog
  • Vertabelo.com

Assistance

Need assistance? Drop us a line at [email protected]

Write to us

Follow us

LearnSQL Facebook We Learn SQL Facebook Linkedin LearnPython.com We Learn SQL Youtube
go to top
Copyright ©2016-2018 Vertabelo SA All rights reserved
Vertabelo
  • Terms of service
  • Privacy policy
  • Imprint