Web scraping in Python (Part 1): Getting started

By | December 12, 2019

Hello, and welcome to part one of web scraping with Python. This is a four part introductory tutorial in which you’ll use web scraping to build a dataset from a New York Times article about President Trump. If you’d like to follow along at home, you can download this Jupyter notebook from GitHub and there’s a link to it in the description below. In this video, you’ll learn what web scraping is and why it’s useful. As well, I’ll explain the three basic facts about HTML that you need to know in order to get started with web scraping. So let’s start with what is web scraping. On July 21st, 2017, the New York Times updated an opinion article called Trump’s Lies detailing every public lie the president has told since taking office. Because this is a newspaper, the information was of course published as a block of text. This is a great format for human consumption, but it can’t easily be understood by a computer. In this tutorial, we’ll extract the President’s lies from the New York Times article and store them in a structured dataset. Now this is a common scenario. You find a web page that contains data you want to analyze, but it’s not presented in a format that you can easily download and read into your favorite data analysis tool. You might imagine manually copying and pasting the data into a spreadsheet, but in most cases that is way too time-consuming. A technique called web scraping is a useful way to automate this process. So what is web scraping? It’s the process of extracting information from a web page by taking advantage of patterns in the webpage’s underlying code. Let’s start looking for these patterns. Take a second and notice how the article presents its information. When converting this article into a dataset, you can think of each lie as a record with four fields. First is the date of the lie. Second is the lie itself as a quotation. Third is the writer’s brief explanation of why it was a lie, and fourth is the URL of the article that substantiates that claim that it was a lie. Now importantly those four fields have different formatting which is consistent throughout the article. The date is bold red text, the lie is regular text, the explanation is gray italics text, and the URL is linked from that gray italics text. So why does the formatting matter? Because it’s very likely that the code underlying the webpage tags those fields differently and we can take advantage of that pattern when scraping the page. So let’s take a look at the source code for this page known as HTML. To view the HTML code that generates a web page, you right click on it and select View Page Source in Chrome or Firefox, View Source in Internet Explorer, or Show Page Source in Safari. Now if that option doesn’t appear in Safari, just open Safari preferences, select the Advanced tab, and check “Show Develop menu in menu bar.” Again, that’s for Safari users only. So, notice the first few lines you’ll see when you’re viewing the source of the New York times article. Then, let’s locate the first lie by searching the HTML for the text Iraq. So Ctrl+F and then “iraq”. Thankfully you only have to understand three basic facts about HTML in order to get started with web scraping. Fact one is that HTML consists of tags. You’ll see that the HTML contains the article text along with tags specified using angle brackets that mark up the text. HTML actually stands for Hypertext Markup Language. For example, one tag is strong, which means use bold formatting. There’s a strong tag before January 21st and a strong tag after January 21st. The first is an opening tag and the second is a closing tag denoted by this forward slash which indicates to the web browser where to start and where to stop applying the formatting. In other words, this tag tells the web browser to make the text January 21st bold. Now don’t worry about this nbsp. We’ll deal with that later. Okay, fact two: HTML tags can have attributes which are specified in the opening tag. For example, span class equals short-desc indicates that this particular span tag has a class attribute with a value of short-desc. Now for the purpose of web scraping you don’t actually need to understand the meaning of span class or short-desc. Instead you just need to recognize that tags can have attributes and that they are specified in this particular way. So that’s fact 2. Fact 3 is that tags can be nested. Let’s pretend my HTML code said hello strong em Data School em students strong. Okay? The text Data School students would all be bold because all of that text is between the opening strong tag and the closing strong tag. The text Data School would also be in italics because the em tag means use italics. The text Hello would not be bold or italics because it’s not within either the strong or em tags. Thus it would appear as follows: Hello Data School (bold and italics) and then students (just in bold). The central point to take away from this example is that tags mark up text from wherever they open to wherever they close regardless of whether they are nested within other tags. Okay, if you’ve got that, you now know enough about HTML in order to start web scraping. In the next video, we’ll read the New York Times article into Python, parse its HTML using the Beautiful Soup library, and then start building our dataset by taking advantage of the patterns we noticed in the article formatting. If you have a question or a tip, let me know in the comments section below. Please click subscribe if you like this video. Thank you so much for joining me and I hope to see you again soon!

0 thoughts on “Web scraping in Python (Part 1): Getting started

Leave a Reply

Your email address will not be published. Required fields are marked *