Web Scraping with Python: A Comprehensive Guide for 2026

Business
Build your own web scraper with Python from scratch. A step-by-step guide to choosing libraries, extracting data, and automating analysis with ELECTE.

You’re likely dealing with a very specific situation. You need competitive pricing data, listings, reviews, catalogs, public data, or content from vertical portals. The alternatives are almost always the same: manual copy-and-paste, incomplete exports, limited APIs, or data scattered across pages that no one in the company can consistently gather.

This is where a Python web scraper stops being a technical exercise and becomes an operational asset. Python is the most practical choice when you want to turn web pages into clean datasets, because it allows you to start with simple scripts and then move on to more advanced crawlers, browser automation, and analysis pipelines.

In the Italian context, this issue is even more relevant. Python has become the standard for automation and data analysis, and web scraping is one of the most widely used applications in companies. The real difference, however, isn’t made by those who simply “download data.” It’s made by those who know how to choose the right library, avoid common mistakes, comply with GDPR and terms of use, and deliver data that the business can read and use.

Table of Contents

  • Key Points to Remember
  • Conclusion: Start Harnessing the Power of Web Data
  • Introduction: Turning the Web into a Source of Strategic Data

    Many early web scraping projects start with a simple need: keeping an eye on a competitor’s prices, collecting headlines from an industry portal, building a product list, or monitoring calls for bids or job postings. The problem isn’t finding the data. The problem is collecting it in a way that’s repeatable, clean, and reliable enough to use in decision-making.

    A Python web scraper solves exactly this problem. It allows you to visit a page, download its content, identify useful elements, and save them in a structured format. If you set it up properly from the start, you can turn a manual and error-prone task into a reliable workflow.

    The part that tutorials often skip is the most important part of the actual work. It’s not enough to just “do some scraping.” You have to choose the right level of complexity. Requests and BeautifulSoup are sufficient for many sites. Others require Selenium or Playwright because the content is generated by JavaScript. For larger projects, Scrapy comes into play. And when the data involves people, profiles, or contact information, you also need to follow specific legal guidelines.

    A good scraper isn't the one that extracts the most data. It's the one that extracts the right data at the lowest maintenance cost.

    Why Python Is the Ideal Tool for Web Scraping

    A young woman wearing glasses is looking at a computer screen displaying Python code.

    Python dominates this field for a practical reason. It allows you to move very quickly from an idea to a working script, without having to make too many compromises as the project grows. In the Italian market, this is not merely a technical preference. According to 2023 data from the Digital Innovation Observatory at the Politecnico di Milano, Python is used by 75% of Italian companies for data analysis and automation, with web scraping among the primary applications. Along the same lines, in 2022, 40% of Lombardy-based SMEs implemented Python scrapers to monitor competitor prices, resulting in a 25% increase in competitiveness in retail, as reported onthe University of Texas’s reference pageon scraping with Python.

    Python works well because it reduces friction

    Python’s greatest strength is its readability. Whether you need to explain a script to a colleague, debug HTML selectors, or modify the extraction logic in two weeks’ time, the clarity of your code matters more than you might think.

    The second strength is the ecosystem. There are mature libraries available for almost every level of development:

    • Requests to download HTML or query endpoints.
    • BeautifulSoup for navigating the DOM and extracting text, links, and attributes.
    • Selenium and Playwright for websites that rely on browser rendering.
    • Scrapy is the tool to use when you need to manage spiders, pipelines, retries, and exports on a larger scale.
    • Pandas is the next step when it comes to cleaning and analyzing data.

    The right choice depends on the site

    This is where many beginners go wrong. They see Selenium and assume it’s always the best solution. It isn’t.

    For a static page, using a full-featured browser means consuming more resources, writing slower code, and increasing the number of potential failure points. Conversely, using only Requests on a site that loads data via JavaScript leads to a classic outcome: nearly empty HTML and no useful data.

    It makes sense to think of it this way:

    • A simple website with HTML already in place. Start with Requests and BeautifulSoup.
    • Website with content that loads after the page has loaded. Switch to Playwright or Selenium.
    • Many pages, recurring structure, need to crawl. Consider using Scrapy.
    • Data is available via the JSON endpoint. It’s better to use that endpoint than to parse the HTML.

    Rule of thumb: Always choose the simplest tool that can actually read the data you need.

    Another advantage of Python is that this transition is gradual. You don’t have to rewrite everything from scratch every time. Often, you can keep the parsing logic and just change how you retrieve the page.

    Choosing the Right Python Libraries for Every Task

    The most practical way to choose a library isn’t to ask which one is “the best.” The right question is a different one: what kind of site do I need to build, how long will this project last, and how much maintenance can I handle?

    An infographic showing the recommended Python libraries for scraping static and dynamic websites.

    A 2025 report by Unioncamere Lombardia indicates that many tech companies in Lombardy use Python for web scraping, contributing significantly to the region’s economic value. In the same context, Scrapy has a 45% adoption rate among Italian developers, and Selenium is used in 55% of projects requiring interaction with JavaScript sites, with a 90% reduction in CAPTCHA blocks when combined with proxies, according to ScraperAPI’s reference page dedicated to scraping with Python.

    A lightweight stack for static pages

    If the content is already in the original HTML, don't make things harder for yourself.

    Requests + BeautifulSoup is still the most sensible starting point for:

    • editorial websites with a standard structure
    • simple public directories
    • server-rendered product pages
    • listing pages with no specific interactions

    This stack is great when you want to:

    • quickly launch a scraper
    • debug with ease
    • save the data as CSV or JSON
    • keep the code readable even for non-specialist colleagues

    A simple example:

    import requests from bs4 import BeautifulSoup url = "https://example.com/news" response = requests.get(url, timeout=20) response.raise_for_status()soup = BeautifulSoup(response.text, "html.parser")for article in soup.select("article"):title = article.select_one("h2")link = article.select_one("a")if title and link:print(title.get_text(strip=True), link.get("href"))

    This approach works well as long as the data is actually in the HTML source. Before using it, open “View Page Source,” not just “Inspect.” If the data isn’t in the source, Requests alone won’t be enough.

    When you need a real browser

    If you see asynchronous loading, “load more” buttons, infinite scrolling, content generated by front-end frameworks, or mandatory user interactions, then the HTML parser alone won’t solve the problem.

    This is where Selenium and Playwright come into play.

    Selenium is a stable and widely used choice. It's a good option when you need to:

    • click buttons
    • fill in fields
    • wait for elements to load in the browser
    • manage complex websites with user flows

    Playwright tends to offer a more modern and streamlined API. If you're just getting started today, many teams find it more straightforward for:

    • more reliable forecasts
    • multi-browser support
    • Headless automation
    • Interactions in SPAs and modern interfaces

    The reality is this: browser automation offers more power, but it also means higher memory usage, longer processing times, and more maintenance.

    If you can read a JSON endpoint from network traffic, do so. It's almost always more reliable than simulating clicks and scrolls.

    When a project stops being just a script

    There comes a point where you’re no longer just “scraping data.” You’re building a process.

    This is where Scrapy gets interesting. Not because it’s easier, but because it organizes things better:

    • request queues
    • page layout management
    • retry
    • throttling
    • cleaning pipeline
    • structured exports

    I recommend it when you need to work with many categories, many pages, or multiple domains that follow recurring patterns. For a one-time data extraction, it’s often overkill. For a continuous crawler, however, it saves you from having to reinvent components that you would otherwise have to spread across separate scripts.

    You can also use a hybrid approach:

    1. Requests for rapid tests.
    2. Playwright for testing dynamic cases.
    3. Scrapy when the process goes live.

    Quick Comparison Chart

    LibraryIdeal Use CaseJavaScript ManagementLearning CurveSpeedRequestsStatic pages, APIs, rapid prototypingNoLowHighBeautifulSoupSimple, readable HTML parsingNoLowMediumSeleniumBrowser interaction, forms, clicks, dynamic sitesYesMediumLowPlaywrightModern dynamic sites, more robust handling of delaysYesMediumMediumScrapyLarge-scale crawling, structured processesNot native, requires extensionHighHigh

    A Practical Guide to Creating Your First Web Scraper

    The first version of a scraper should do a few things well: read a page, find the right elements, clean up the text, and save the output in a useful format. Nothing more.

    A person writing Python code for web scraping on a computer in a bright home office.

    Prepare the environment and facilities

    Keep the project isolated. A virtual environment prevents conflicts and makes the work reproducible.

    Install only what is necessary:

    pip install requests beautifulsoup4

    Basic initial structure:

    • scraper.py for the code
    • output.csv for export
    • an internal README file containing target URLs, selectors used, and operational notes

    It may seem obvious, but documenting the selectors you use right from the start will save you time when the site changes.

    Review the page before writing code

    Open the target page in your browser and use the developer tools. Look for the nodes that actually contain the data you're interested in.

    Suppose we want to extract:

    • news headline
    • link to the article

    Check three things:

    1. Is the content in the HTML source code?
    2. Are the elements' classes or tags fairly stable?
    3. Is the link absolute or relative?

    Don't choose fragile selectors, such as classes automatically generated by the frontend. If you can select a article, a h2 or an area with a consistent structure, your scraper will last longer.

    Writing a Basic Web Scraper with Requests and BeautifulSoup

    Here is a complete and easy-to-read example.

    import csvimport requestsfrom bs4 import BeautifulSoupfrom urllib.parse import urljoinBASE_URL = "https://example.com"TARGET_URL = "https://example.com/news"headers = {"User-Agent": "Mozilla/5.0"}response = requests.get(TARGET_URL, headers=headers, timeout=20)response.raise_for_status()soup = BeautifulSoup(response.text, "html.parser")rows = []for card in soup.select("article"):title_el = card.select_one("h2")link_el = card.select_one("a")if not title_el or not link_el:continuetitle = title_el.get_text(strip=True)link = urljoin(BASE_URL, link_el.get("href", "").strip())if title and link:rows.append({"titolo": title,"url": link})with open("output.csv", "w", newline="", encoding="utf-8") as f:writer = csv.DictWriter(f, fieldnames=["titolo", "url"])writer.writeheader()writer.writerows(rows)print(f"Elementi estratti: {len(rows)}")

    For a first web scraper in Python, this structure is more than enough.

    The flow is linear:

    • Download the page
    • Build the parser
    • Select the repeated blocks
    • extract the fields
    • save the output

    Clean and save the results

    Data quality is determined here. The most common issues aren’t technical. They’re operational:

    • titles with extra spaces
    • related links
    • duplicate lines
    • incorrect encoding
    • empty fields

    Before submitting the CSV file, be sure to open it. If the file will be imported into Excel, you should check that the columns and text are legible. If you need help with this step, this guide from ELECTE how to handle CSV files in Excel may be useful.

    A scraper that generates a messy CSV file just shifts the problem downstream. It doesn't solve it.

    Good habits to start practicing right away:

    • Use strip() to clean up the text.
    • Validate the required fields before saving.
    • Normalize URLs with urljoin.
    • Check for duplicates if the page contains repeated elements.
    • Handle HTTP errors with raise_for_status().

    If the result seems fragile to you, it is. Before adding new features, make sure the foundation is solid.

    Overcoming Advanced Obstacles Such as JavaScript and Anti-Bot Measures

    A programmer works at a computer with complex graphics illustrating the process of web scraping and data rendering.

    When a scraper returns a nearly empty page, the problem is usually not Python. The problem lies in the site’s rendering model. Many modern interfaces load data after the initial HTML, using asynchronous requests or JavaScript components. Requests downloads the initial document. It does not act like a browser.

    Understanding why a page returns empty data

    Before switching to Selenium or Playwright, take a quick look at the developer tools:

    • Check the Network tab
    • filter Fetch/XHR requests
    • Search for JSON responses
    • check whether the relevant data is coming from separate endpoints

    If you can find a clean, readable endpoint, that’s often the best approach. You get more structured data, less HTML clutter, and less maintenance.

    If, on the other hand, the site actually builds the content in the browser, it uses browser automation. In that case, you need to handle timeouts correctly. The right approach isn’t “wait 5 seconds and hope for the best.” It’s to wait for the element to appear or for an observable condition to be met.

    Anti-bot defenses cannot be overcome by brute force

    Many websites block aggressive scraping to protect their infrastructure, data, and user experience. If you send too many requests, use unnatural headers, or repeatedly open browser sessions, the website will take action.

    The most common mistakes are always the same:

    • Requests that are too fast and trigger rate limiting.
    • Poor or inconsistent headers that give away the use of a script.
    • Stateless sessions when the site expects cookies or tokens.
    • Selectors that rely on repeated clicks and break as soon as the frontend is changed.

    The professional approach is more understated:

    • The pace of requests is slowing down.
    • Use sessions when continuity is needed.
    • Use credible and consistent headers.
    • Limit the number of pages you visit to only the information you really need.
    • Whenever possible, use structured endpoints instead of full rendering.

    It’s not worth pursuing every anti-bot measure as a technical challenge. If the site is clearly hostile to scraping, consider whether the data can actually be obtained in a sustainable and compliant manner.

    Building resilient web scrapers means reducing friction with the site, not winning a race against its defenses.

    Ethical and Legal Web Scraping in Compliance with the GDPR in Italy

    The most overlooked aspect of web scraping projects isn’t the parser. It’s liability. In the Italian context, this becomes much more significant when the data involves individuals, professional profiles, résumés, contact information, or data from job portals.

    According to AGID 2025 data, several Italian SMEs have been fined for violations related to the scraping of EU data, with a significant number of penalties imposed in Lombardy and Veneto in 2024–2025. The same source notes that scraping personal data from job portals may entail criminal liability under Article 167 of Legislative Decree 196/03. This reference appears in Real Python’s practical guide to web scraping.

    Public does not mean free use

    This is the first misconception we need to clear up. Just because data is available online doesn’t mean you can collect, combine, store, and reuse it without restriction.

    In any serious work, at least four elements must be checked:

    • Robots.txt. It is not the only legal criterion, but it indicates the site’s policy.
    • Terms of Service. Some websites expressly prohibit automated scraping or reuse.
    • Presence of personal data. Names, email addresses, profiles, identifiable reviews, resumes.
    • Purpose of processing. You need to know why you collect the data, how long you retain it, and who has access to it.

    To help you navigate consent, data collection, and compliance, this in-depth article by ELECTE cookies and online privacy, EU vs. U.S. regulations, Google Consent Mode, and consent management is also helpful.

    A basic compliance checklist

    If you need to build a web scraper at your company, this foundation is non-negotiable:

    • Limit the scope. Collect only the data necessary for the stated purpose.
    • Avoid collecting personal data that isn't essential. If you don't need it, don't collect it.
    • Pseudonymize or anonymize data wherever possible right within the pipeline.
    • Document the source of the data and the collection process.
    • Set retention periods that align with actual usage.

    The point here isn't to become lawyers. It's to work like professionals. A well-written scraper isn't just efficient. It's also defensible.

    From Data Extraction to Action with the ELECTE Platform

    Many projects come to a halt too soon. The team manages to scrape the data, saves a CSV file, and maybe updates the file once a week. Then the process stops there. Without data cleansing, historical analysis, reporting, or forecasting, the value remains limited.

    How to structure the process of turning data into insights

    Here is the relevant passage:

    1. Extract consistent data from web sources.
    2. Standardize fields, formats, naming conventions, and keys.
    3. Provide historical context for the findings.
    4. Compare variations, exceptions, and patterns.
    5. Analyze the data in a way that makes it understandable to the business as well.

    If you work in retail, this might involve tracking competitors’ prices and promotions over time. In finance or compliance, it might involve supplementing controls and monitoring lists with data from public sources. In marketing, reviews and editorial content can inform qualitative rankings and trend analysis.

    When data collection becomes a recurring process, it’s best to connect the scraping tool to an analytics system rather than a folder of local files. For those who need to integrate data collected from external sources into a broader ecosystem, it may also be helpful to see how ELECTE API integration using a verified Postman profile.

    The principle is simple. Web scraping gathers raw data. The value emerges when that raw data is incorporated into a decision-making process.

    Key Points to Remember

    • Python is the most practical choice when you want to build a web scraper that is readable, extensible, and can be integrated with data analysis.
    • The right library depends on the website. Requests and BeautifulSoup for static HTML. Playwright or Selenium for dynamic content. Scrapy for larger-scale projects.
    • The first real task is to understand the page, not to write code.
    • Raw data isn't enough. It needs to be cleaned, validated, and saved in a reusable format.
    • The GDPR, terms of use, and personal data are not minor details. They are part of the project.
    • A Python web scraper only makes sense if it leads to better decisions, not if it just generates files that get forgotten.

    Conclusion: Start Harnessing the Power of Web Data

    Building a good web scraper means making sensible choices. The right tool for the right website. Stable selectors. Clean output. Controlled request rate. Legal compliance from the start.

    This is why a Python-based web scraper remains one of the most useful tools for analysts, digital teams, and small and medium-sized businesses. It allows you to turn the web into a practical source of data, without having to rely solely on manual exports or limited integrations.

    The bottom line, however, isn’t the data extraction itself. It’s how the data is used. If you link the collected data to reports, trends, alerts, and historical data, data scraping ceases to be a technical task and becomes a concrete tool for decision-making.

    You’ve already collected the data. The next step is to turn it into clear, actionable insights. With ELECTE, an AI-powered data analytics platform for SMEs, you can connect different sources, prepare data more quickly, and get reports and analyses that truly help your business make decisions. If you want to go from raw data to faster decision-making, it’s worth seeing how it works.