Last updated on 2020-10-12
A while back I wrote a post on how to scrape web pages using C# and HtmlAgilityPack (It was in May? So long ago? Wow!). This works fine for static pages, but most web pages are dynamic apps where elements appear and disappear when you interact with the page. So we need a better solution.
Selenium is an open-source application/library that let’s us automate browsing using code. And it is awesome. In this tutorial, I’m going to show how to scrape a simple dynamic web page that changes when an element is clicked. A pre-requisite for this tutorial is having the Chrome browser installed in your computer (more on that later).
Let’s start by creating a new .NET core project:
> dotnet new console -n DynamicWebScraping
To use Selenium we need two things: a Selenium WebDriver which interacts with the browser, and the Selenium library which connects our code with the Selenium WebDriver. You can read more in the docs. Gladly, both of them come as NuGet packages that we can add to the solution. We’ll also add a library that provides some Selenium extensions:
> dotnet add package Selenium.WebDriver > dotnet add package Selenium.WebDriver Selenium.WebDriver.ChromeDriver > dotnet add package DotNetSeleniumExtras.WaitHelpers
One important thing to note when you install these packages is the version of the Selenium.WebDriver.ChromeDriver
that is installed, which looks something like this: PackageReference for package 'Selenium.WebDriver.ChromeDriver' version '85.0.4183.8700'
. The major version of the driver (in this case 85) must match the major version of the Chrome browser that is installed on your computer (you can check the version you have by going to Help->About in your browser).
To demonstrate the dynamic scraping, I’ve created a web page that has the word “Hello” on it, that when clicked, adds the word “World” below it. I’m not a web developer and don’t pretend to be, so the code here is probably ugly, but it does the job:
<html> <head> <meta charset="utf-8" /> <title>Sample</title> <script> function addMoreContent() { var myDiv = document.getElementById("MyDiv"); var newElement = document.createElement("h2"); newElement.id = "heading2"; newElement.appendChild(document.createTextNode("World")); document.getElementById("body").appendChild(newElement); } </script> </head> <body id="body"> <h1 id="heading1" onclick="addMoreContent()" style="cursor:pointer">Hello</h1> </body> </html>
I added this page to the project and defined that the page must be copied to the output directory of the project so it can be easily consumed by the scraping code. This is achieved by adding the following lines to the DynamicWebScraping.csproj
project file somewhere between the opening and closing Project
nodes:
<ItemGroup> <None Update="page.html"> <CopyToOutputDirectory>Always</CopyToOutputDirectory> </None> </ItemGroup>
The scraping code will navigate to this page and wait for the heading1
element to appear. When it does it will click on the element and wait for the heading2
element to appear, fetching the textContent
that is located in that element:
using OpenQA.Selenium; using OpenQA.Selenium.Chrome; using OpenQA.Selenium.Support.UI; using SeleniumExtras.WaitHelpers; using System; using System.IO; using System.Reflection; namespace DynamicWebScraping { class Program { static void Main(string[] args) { Scrape(); Console.ReadLine(); } public static void Scrape() { ChromeOptions options = new ChromeOptions(); using (IWebDriver driver = new ChromeDriver(options)) { WebDriverWait wait = new WebDriverWait(driver, TimeSpan.FromSeconds(10)); driver.Navigate().GoToUrl($"file://{Path.GetDirectoryName(Assembly.GetExecutingAssembly().Location)}/page.html"); driver.FindElement(By.Id("heading1")).Click(); IWebElement firstResult = wait.Until(ExpectedConditions.ElementExists(By.Id("heading2"))); Console.WriteLine(firstResult.GetAttribute("textContent")); } } } }
Let’s build and run the project:
> dotnet build > dotnet run
The program opens a browser window and starts to interact with it, returning the text inside the second heading. Pretty cool, right? I have to admit that the first time I running this it feels really powerful, and opens a whole new world of things to build… If only I had more time :-).
As always, the full source code for this tutorial can be found in my GitHub repository.
Hoping that the next post comes sooner. Until next time, happy coding!
Be First to Comment