Learn Web Scraping
Learn the structure of HTML. We begin by explaining why web scraping can be a valuable addition to your data science toolbox and then delving into some basics of HTML. We end the chapter by giving a brief introduction on XPath notation, which is used to navigate the elements within HTML code. Some titles associated with Web Scraping include Data Scientist, Web Developer, Web Collection Specialist, Research Assistant, Application Developer, Web Mining Developer, Site Merchandiser, Market Intelligence Analyst, and of course, Web Scraper. In the U.S., Web Scraping can earn learners an average of $79,018 per year, according to ZipRecruiter.
Want to scrape the web with R? You’re at the right place!
We will teach you from ground up on how to scrape the web with R, and will take you through fundamentals of web scraping (with examples from R).
Throughout this article, we won’t just take you through prominent R libraries like rvest and Rcrawler, but will also walk you through how to scrape information with barebones code.
Overall, here’s what you are going to learn:
- R web scraping fundamentals
- Handling different web scraping scenarios with R
- Leveraging rvest and Rcrawler to carry out web scraping
Let’s start the journey!
Introduction
The first step towards scraping the web with R requires you to understand HTML and web scraping fundamentals. You’ll learn how to get browsers to display the source code, then you will develop the logic of markup languages which sets you on the path to scrape that information. And, above all - you’ll master the vocabulary you need to scrape data with R.
We would be looking at the following basics that’ll help you scrape R:
- HTML Basics
- Browser presentation
- And Parsing HTML data in R
So, let’s get into it.
HTML Basics
HTML is behind everything on the web. Our goal here is to briefly understand how Syntax rules, browser presentation, tags and attributes help us learn how to parse HTML and scrape the web for the information we need.
Browser Presentation
Before we scrape anything using R we need to know the underlying structure of a webpage. And the first thing you notice, is what you see when you open a webpage, isn’t the HTML document. It’s rather how an underlying HTML code is represented. You can basically open any HTML document using a text editor like notepad.
HTML tells a browser how to show a webpage, what goes into a headline, what goes into a text, etc. The underlying marked up structure is what we need to understand to actually scrape it.
For example, here’s what ScrapingBee.com looks like when you see it in a browser.
And, here’s what the underlying HTML looks like for it
Looking at this source code might seem like a lot of information to digest at once, let alone scrape it! But don’t worry. The next section exactly shows how to see this information better.
HTML elements and tags
If you carefully checked the raw HTML of ScrapingBee.com earlier, you would notice something like <title>...</title>, <body>...</body
etc. Those are tags that HTML uses, and each of those tags have their own unique property. For example <title>
tag helps a browser render the title of a web page, similarly <body>
tag defines the body of an HTML document.
Once you understand those tags, that raw HTML would start talking to you and you’d already start to get the feeling of how you would be scraping web using R. All you need to take away form this section is that a page is structured with the help of HTML tags, and while scraping knowing these tags can help you locate and extract the information easily.
Parsing a webpage using R
With what we know, let’s use R to scrape an HTML webpage and see what we get. Keep in mind, we only know about HTML page structures so far, we know what RAW HTML looks like. That’s why, with the code, we will simply scrape a webpage and get the raw HTML. It is the first step towards scraping the web as well.
Earlier in this post, I mentioned that we can even use a text editor to open an HTML document. And in the code below, we will parse HTML in the same way we would parse a text document and read it with R.
I want to scrape the HTML code of ScrapingBee.com and see how it looks. We will use readLines() to map every line of the HTML document and create a flat representation of it.
Now, when you see what flat_html looks like, you should see something like this in your R Console:
The whole output would be a hundred pages so I’ve trimmed it for you. But, here’s something you can do to have some fun before I take you further towards scraping web with R:
- Scrape www.google.com and try to make sense of the information you received
- Scrape a very simple web page like https://www.york.ac.uk/teaching/cws/wws/webpage1.html and see what you get
Remember, scraping is only fun if you experiment with it. So, as we move forward with the blog post, I’d love it if you try out each and every example as you go through them and bring your own twist. Share in comments if you found something interesting or feel stuck somewhere.
While our output above looks great, it still is something that doesn’t closely reflect an HTML document. In HTML we have a document hierarchy of tags which looks something like
But clearly, our output from readLines() discarded the markup structure/hierarchies of HTML. Given that, I just wanted to give you a barebones look at scraping, this code looks like a good illustration.
However, in reality, our code is a lot more complicated. But fortunately, we have a lot of libraries that simplify web scraping in R for us. We will go through four of these libraries in later sections.
First, we need to go through different scraping situations that you’ll frequently encounter when you scrape data through R.
Common web scraping scenarios with R
Access web data using R over FTP
FTP is one of the ways to access data over the web. And with the help of CRAN FTP servers, I’ll show you how you can request data over FTP with just a few lines of code. Overall, the whole process is:
- Save ftp URL
- Save names of files from the URL into an R object
- Save files onto your local directory
Let’s get started now. The URL that we are trying to get data from is ftp://cran.r-project.org/pub/R/web/packages/BayesMixSurv/.
Let’s check the name of the files we received with get_files
Looking at the string above can you see what the file names are?
The screenshot from the URL shows real file names
It turns out that when you download those file names you get carriage return representations too. And it is pretty easy to solve this issue. In the code below, I used str_split() and str_extract_all() to get the HTML file names of interest.
Let’s print the file names to see what we have now:
extracted_html_filenames
Great! So, we now have a list of HTML files that we want to access. In our case, it was only one HTML file.
Now, all we have to do is to write a function that stores them in a folder and a function that downloads HTML docs in that folder from the web.
We are almost there now! All we now have to do is to download these files to a specified folder in your local drive. Save those files in a folder called scrapignbee_html. To do so, use GetCurlHandle().
After that, we’ll use plyr package’s l_ply() function.
And, we are done!
I can see that on my local drive I have a folder named scrapingbee_html, where I have inde.html file stored. But, if you don’t want to manually go and check the scraped content, use this command to retrieve a list of HTMLs downloaded:
That was via FTP, but what about HTML retrieving specific data from a webpage? That’s what our next section covers.
How Long To Learn Web Scraping
Scraping information from Wikipedia using R
In this section, I’ll show you how to retrieve information from Leonardo Da Vinci’s Wikipedia page https://en.wikipedia.org/wiki/Leonardo_da_Vinci.
Let’s take the basic steps to parse information:
Leonardo Da Vinci’s Wikipedia HTML has now been parsed and stored in parsed_wiki.
But, let’s say you wanted to see what text we were able to parse. A very simple way to do that would be:
By doing that, we have essentially parsed everything that exists within the <p>
node. And since it is an XML node set, we can easily use subsetting rules to access different paragraphs. For example, let’s say we pick the 4th element on a random name. Here’s what you’ll see:
Reading text is fun, but let’s do something else - let’s get all links that exist on this page. We can easily do that by using getHTMLLinks() function:
Notice what you see above is a mix of actual links and links to files.
You can also see the total number of links on this page by using length() function:
I’ll throw in one more use case here which is to scrape tables off such HTML pages. And it is something that you’ll encounter quite frequently too for web scraping purposes. XML package in R offers a function named readHTMLTable() which makes our life so easy when it comes to scraping tables from HTML pages.
Leonardo’s Wikipedia page has no HTML though, so I will use a different page to show how we can scrape HTML from a webpage using R. Here’s the new URL:
As usual, we will read this URL:
If you look at the page you’ll disagree with the number “108”. For a closer inspection I’ll use name() function to get names of all 108 tables:
Our suspicion was right, there are too many “NULL” and only a few tables. I’ll now read data from one of those tables in R:
Here’s how this table looks in HTML
Awesome isn’t it? Imagine being able to access census, pricing, etc data over R and scraping it. Wouldn’t it be fun? That’s why I took a boring one, and kept the fun part for you. Try something much cooler than what I did. Here’s an example of table data that you can scrape https://en.wikipedia.org/wiki/United_States_Census
Let me know how it goes for you. But it usually isn’t that straightforward. We have forms and authentication that can block your R code from scraping. And that’s exactly what we are going to learn to get through here.
Handling HTML forms while scraping with R
Often we come across pages that aren’t that easy to scrape. Take a look at the Meteorological Service Singapore’s page (that lack of SSL though :O). Notice the dropdowns here
Imagine if you want to scrape information that you can only get upon clicking on the dropdowns. What would you do in that case?
Well, I’ll be jumping a few steps forward and will show you a preview of rvest package while scraping this page. Our goal here is to scrape data from 2016 to 2020.
Let’s check what type of data have been able to scrape. Here’s what our data frame looks like:
From the dataframe above, we can now easily generate URLs that provide direct access to data of our interest.
Now, we can download those files at scale using lappy().
Note: This is going to download a ton of data once you execute it.
Web scraping using Rvest
Inspired by libraries like BeautifulSoup, rvest is probably one of most popular packages in R that we use to scrape the web. While it is simple enough that it makes scraping with R look effortless, it is complex enough to enable any scraping operation.
Let’s see rvest in action now. I will scrape information from IMDB and we will scrape Sharknado (because it is the best movie in the world!) https://www.imdb.com/title/tt8031422/
Awesome movie, awesome cast! Let's find out what was the cast of this movie.
Awesome cast! Probably that’s why it was such a huge hit. Who knows.
Still, there are skeptics of Sharknado. I guess the rating would prove them wrong? Here’s how you extract ratings of Sharknado from IMDB
I still stand by my words. But I hope you get the point, right? See how easy it is for us to scrape information using rvest, while we were writing 10+ lines of code in much simpler scraping scenarios.
Next on our list is Rcrawler.
Web Scraping using Rcrawler
Rcrawler is another R package that helps us harvest information from the web. But unlike rvest, we use Rcrawler for network graph related scraping tasks a lot more. For example, if you wish to scrape a very large website, you might want to try Rcrawler in a bit more depth.
Note: Rcrawler is more about crawling than scraping.
We will go back to Wikipedia and we will try to find the date of birth, date of death and other details of scientists.
Output looks like this:
And that’s it!
You pretty much know everything you need to get started with Web Scraping in R.
Try challenging yourself with interesting use cases and uncover challenges. Scraping the web with R can be really fun!
While this whole article tackles the main aspect of web scraping with R, it does not talk about web scraping without getting blocked.
If you want to learn how to do it, we have wrote this complete guide, and if you don't want to take care of this, you can always use our web scraping API.
Happy scraping.
A growing number of business activities and our lives are being spent online, this has led to an increase in the amount of publicly available data. Web scraping allows you to tap into this public information with the help of web scrapers.
In the first part of this guide to basics of web scraping you will learn –
- What is web scraping?
- Web scraping use cases
- Types of web scrapers
- How does a web scraper work?
- Difference between a web scraper and web crawler
- Is web scraping legal?
What is web scraping?
Web scraping automates the process of extracting data from a website or multiple websites. Web scraping or data extraction helps convert unstructured data from the internet into a structured format allowing companies to gain valuable insights. This scraped data can be downloaded as a CSV, JSON, or XML file.
Web scraping (or Data Scraping or Data Extraction or Web Data Extraction used synonymously), helps transform this content on the Internet into structured data that can be consumed by other computers and applications. The scraped data can help users or businesses to gather insights that would otherwise be expensive and time-consuming.
Since the basic idea of web scraping is automating a task, it can be used to create web scraping APIs and Robotic Process Automation (RPA) solutions. Web scraping APIs allow you to stream scraped website data easily into your applications. This is especially useful in cases where a website does not have an API or has a rate/volume-limited API.
Uses of Web Scraping
People use web scrapers to automate all sorts of scenarios. Web scrapers have a variety of uses in the enterprise. We have listed a few below:
- Price Monitoring –Product data is impacting eCommerce monitoring, product development, and investing. Extracting product data such as pricing, inventory levels, reviews and more from eCommere websites can help you create a better product strategy.
- Marketing and Lead Generation –As a business, to reach out to customers and generate sales, you need qualified leads. That is getting details of companies, addresses, contacts, and other necessary information. Publicly information like this is valuable. Web scraping can enhance the productivity of your research methods and save you time.
- Location Intelligence – The transformation of geospatial data into strategic insights can solve a variety of business challenges. By interpreting rich data sets visually you can conceptualize the factors that affect businesses in various locations and optimize your business process, promotion, and valuation of assets.
- News and Social Media – Social media and news tells your viewers how they engage with, share, and perceive your content. When you collect this information through web scraping you can optimize your social content, update your SEO, monitor other competitor brands, and identify influential customers.
- Real Estate – The real estate industry has myriad opportunities. Including web scraped data into your business can help you identify real estate opportunities, find emerging markets analyze your assets.
How to get started with web scraping
There are many ways to get started with web scraper, writing code from scratch is fine for smaller data scraping needs. But beyond that, if you need to scrape a few different types of web pages and thousands of data fields, you will need a web scraping service that is able to scrape multiple websites easily on a large scale.
Custom Web Scraping Services
Many companies build their own web scraping departments but other companies use Web Scraping services. While it may make sense to start an in house web scraping solution, the time and cost involved far outweigh the benefits. Hiring a custom web scraping service ensures that you can concentrate on your projects.
Web scraping companies such as ScrapeHero, have the technology and scalability to handle web scraping tasks that are complex and massive in scale – think millions of pages. You need not worry about setting up and running scrapers, avoiding and bypassing CAPTCHAs, rotating proxies, and other tactics websites use to block web scraping.
Web Scraping Tools and Software
Point and click web scraping tools have a visual interface, where you can annotate the data you need, and it automatically builds a web scraper with those instructions. Web Scraping tools (free or paid) and self-service applications can be a good choice if the data requirement is small, and the source websites aren’t complicated.
ScrapeHero Cloud has pre-built scrapers that in addition to scraping search engine data, can Scrape Job data, Scrape Real Estate Data, Scrape Social Media and more. These scrapers are easy to use and cloud-based, where you need not worry about selecting the fields to be scraped nor download any software. The scraper and the data can be accessed from any browser at any time and can deliver the data directly to Dropbox.
Scraping Data Yourself
You can build web scrapers in almost any programming language. It is easier with Scripting languages such as Javascript (Node.js), PHP, Perl, Ruby, or Python. If you are a developer, open-source web scraping tools can also help you with your projects. If you are just new to web scraping these tutorials and guides can help you get started with web scraping.
If you don't like or want to code, ScrapeHero Cloud is just right for you!
Skip the hassle of installing software, programming and maintaining the code. Download this data using ScrapeHero cloud within seconds.
How does a web scraper work
A web scraper is a software program or script that is used to download the contents (usually text-based and formatted as HTML) of multiple web pages and then extract data from it.
Web scrapers are more complicated than this simplistic representation. They have multiple modules that perform different functions.
What are the components of a web scraper
Web scraping is like any other Extract-Transform-Load (ETL) Process. Web Scrapers crawl websites, extracts data from it, transforms it into a usable structured format, and loads it into a file or database for subsequent use.
A typical web scraper has the following components:
1. Crawl
First, we start at the data source and decide which data fields we need to extract. For that, we have web crawlers, that crawl the website and visit the links that we want to extract data from. (e.g the crawler will start at https://scrapehero.com and crawl the site by following links on the home page.)
The goal of a web crawler is to learn what is on the web page, so that the information when it is needed, can be retrieved. The web crawling can be based on what it finds or it can search the whole web (just like the Google search engine does).
2. Parse and Extract
Extracting data is the process of taking the raw scraped data that is in HTML format and extracting and parsing the meaningful data elements. In some cases extracting data may be simple such as getting the product details from a web page or it can get more difficult such as retrieving the right information from complex documents.
You can use data extractors and parsers to extract the information you need. There are different kinds of parsing techniques: Regular Expression, HTML Parsing, DOM Parsing (using a headless browser), or Automatic Extraction using AI.
3. Format
Now the data extracted needs to be formatted into a human-readable form. These can be in simple data formats such as CSV, JSON, XML, etc. You can store the data depending on the specification of your data project.
The data extracted using a parser won’t always be in the format that is suitable for immediate use. Most of the extracted datasets need some form of “cleaning” or “transformation.” Regular expressions, string manipulation, and search methods are used to perform this cleaning and transformation.
4. Store and Serialize Data
After the data has been scraped, extracted, and formatted you can finally store and export the data. Once you get the cleaned data, it needs to be serialized according to the data models that you require. Choosing an export method largely depends on how large your data files are and what data exports are preferred within your company.
Web Scraping Vba
This is the final module that will output data in a standard format that can be stored in Databases using ETL tools (Check out our guide on ETL Tools), JSON/CSV files, or data delivery methods such as Amazon S3, Azure Storage, and Dropbox.
Web Crawling vs. Web Scraping
People often use Web Scraping and Web Crawling interchangeably. Although the underlying concept is to extract data from the web, they are different.
Web Crawling mostly refers to downloading and storing the contents of a large number of websites, by following links in web pages. A web crawler is a standalone bot, that scans the internet, searching, and indexing for content. In general, a ‘crawler’ means the ability to navigate pages on its own. Crawlers are the backbones of search engines like Google, Bing, Yahoo, etc.
A Web scraper is built specifically to handle the structure of a particular website. The scraper then uses this site-specific structure to extract individual data elements from the website. Unlike a web crawler, a web scraper extracts specific information such as pricing data, stock market data, business leads, etc.
Is web scraping legal?
Although web scraping is a powerful technique in collecting large data sets, it is controversial and may raise legal questions related to copyright and terms of service. Most times a web scraper is free to copy a piece of data from a web page without any copyright infringement. This is because it is difficult to prove copyright over such data since only a specific arrangement or a particular selection of the data is legally protected.
Legality is totally dependent on the legal jurisdiction (i.e. Laws are country and locality specific). Publicly available information gathering or scraping is not illegal, if it were illegal, Google would not exist as a company because they scrape data from every website in the world.
Terms of Service
Although most web applications and companies include some form of TOS agreement, it lies within a gray area. For instance, the owner of a web scraper that violates the TOS may argue that he or she never saw or officially agreed to the TOS
Some forms of web scraping can be illegal such as scraping non-public data or disclosed data. Non-public data is something that isn’t reachable or open to the public. An example of this would be, the stealing of intellectual property.
Ethical Web Scraping
If a web scraper sends data acquiring requests too frequently, the website will block you. The scraper may be refused entry and may be liable for damages because the owner of the web application has a property interest. An ethical scraping tool or professional web scraping services will avoid this issue by maintaining a reasonable requesting frequency. We talk in other guides about how you can make your scraper more “polite” so that it doesn’t get you into trouble.
Learn Web Scraping
What’s next?
Let’s do something hands-on before we get into web page structures and XPaths. We will make a very simple scraper to scrape Reddit’s top pages and extract the title and URLs of the links shared.
Check out part 2 and 3 of this post in the link here – A beginners guide to Web Scraping: Part 2 – Build a web scraper for Reddit using Python and BeautifulSoup
Web Scraping Tutorial for Beginners – Part 3 – Navigating and Extracting Data – Navigating and Scraping Data from Reddit
We can help with your data or automation needs
Turn the Internet into meaningful, structured and usable data