Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 8

Web Scraping by using

Regular Expressions
Regex Parsing

• from bs4 import BeautifulSoup:


Beautiful Soup is a library that makes it easy to scrape information from
web pages.
Link : webscraping
https://www.webscrapingapi.com/parse-html-like-a-pro-scraping-with-
python-and-regex
import requests

• The requests module allows you to send HTTP requests


using Python.
• Make a request to a web page, and print the response
text:
• import requests

x=requests.get('https://www.irctc.co.in/nget/train-
search')

print(x.text)
• url="https://akshardham.com/"
• page = requests.get(url)
• page.content

• # parse rhe data


• soup = BeautifulSoup(page.content,"html.parser")
• print(soup.prettify())

• re.findall(r'<title>(.*?)</title>',page.text)
Extract text that is before or after specific
keywords.

text regex capture group result


price: $14.99 inc.VAT price:\s+([^\s]+) 1 $14.99

4.2 out of 5 stars ([^\s]+) out of 1 4.2

date: 2014-08-20 \d+-\d+-\d+ 0 2014-08-20


• import re

• # Example HTML content


• html_content = '''
• <html>
• <head>
• <title>Web Scraping Example</title>
• </head>
• <body>
• <h1>Web Scraping with Regular Expressions</h1>
• <ul>
• <li><a href="https://example.com/page1">Page 1</a></li>
• <li><a href="https://example.com/page2">Page 2</a></li>
• <li><a href="https://example.com/page3">Page 3</a></li>
• </ul>
• </body>
• </html>
• # Regular expression pattern to find links
• link_pattern = re.compile(r'<a\s+href=["\'](https?://[^"\']+)["\']',
re.IGNORECASE)

• # Find all links using the regular expression pattern


• links = link_pattern.findall(html_content)

• # Print the extracted links


• print("Extracted Links:")
• for link in links:
• print(link)
Output:
• Extracted Links:
• https://example.com/page1
• https://example.com/page2
• https://example.com/page3

You might also like