×

Python Tutorial

Python Basics

Python I/O

Python Operators

Python Conditions & Controls

Python Functions

Python Strings

Python Modules

Python Lists

Python OOPs

Python Arrays

Python Dictionary

Python Sets

Python Tuples

Python Exception Handling

Python NumPy

Python Pandas

Python File Handling

Python WebSocket

Python GUI Programming

Python Image Processing

Python Miscellaneous

Python Practice

Python Programs

Scraping Links from a Webpage in Python

Here, we are going to learn how to scrape links from a webpage in Python, we are implementing a python program to extract all the links in a given WebPage.
Submitted by Aditi Ankush Patil, on May 17, 2020

Prerequisite:

  1. Urllib3: It is a powerful, sanity-friendly HTTP client for Python with having many features like thread safety, client-side SSL/TSL verification, connection pooling, file uploading with multipart encoding, etc.
    Installing urllib3:
        $ pip install urllib3
    
  2. BeautifulSoup: It is a Python library that is used to scrape/get information from the webpages, XML files i.e. for pulling data out of HTML and XML files.
    Installing BeautifulSoup:
        $ pip install beautifulsoup4
    

Commands Used:

html= urllib.request.urlopen(url).read(): Opens the URL and reads the whole blob with newlines at the end and it all comes into one big string.

soup= BeautifulSoup(html,'html.parser'): Using BeautifulSoup to parse the string BeautifulSoup converts the string and it just takes the whole file and uses the HTML parser, and we get back an object.

tags= soup('a'): To get the list of all the anchor tags.

tag.get('href',None): Extract and get the data from the href.

Python program to Links from a Webpage

# import statements
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup

# Get links
# URL of a WebPage
url = input("Enter URL: ") 

# Open the URL and read the whole page
html = urllib.request.urlopen(url).read()
# Parse the string
soup = BeautifulSoup(html, 'html.parser')
# Retrieve all of the anchor tags
# Returns a list of all the links
tags = soup('a')

#Prints all the links in the list tags
for tag in tags: 
  # Get the data from href key
  print(tag.get('href', None), end = "\n")

Output:

Enter URL: https://www.google.com/
https://www.google.com/imghp?hl=en&tab=wi
https://maps.google.com/maps?hl=en&tab=wl
https://play.google.com/?hl=en&tab=w8
https://www.youtube.com/?gl=US&tab=w1
https://news.google.com/nwshp?hl=en&tab=wn
https://mail.google.com/mail/?tab=wmhttps://drive.google.com/?tab=wo
https://www.google.com/intl/en/about/products?tab=wh
http://www.google.com/history/optout?hl=en
/preferences?hl=en
https://accounts.google.com/ServiceLogin?hl=en&passive=true
&continue=https://www.google.com/
/advanced_search?hl=en&authuser=0
/intl/en/ads/
/services/
/intl/en/about.html
/intl/en/policies/privacy/
/intl/en/policies/terms/
Advertisement
Advertisement


Comments and Discussions!

Load comments ↻


Advertisement
Advertisement
Advertisement

Copyright © 2025 www.includehelp.com. All rights reserved.