The Legality and Ethics of Web Scraping in 2020 ⚖️ (Data Nerd Newsletter #27)

🍺 First, a short(ish) story about a brewery API

I maintain an open-source side project called Open Brewery DB. It’s a public and transparent dataset and API of breweries, cideries, and bottle shops in the United States. Most users are hobbyists or students (or hackers), and it fields roughly 8 million requests a month.

The brief history of the API is that I wanted to build a map of breweries that were dog-friendly. When I researched, I found most brewery datasets were either very old or behind a paywall. So, I started building my own.

Open Brewery DB was born when I scraped the dataset from Brewers Association. It was a one-time scrape, the dataset and API are free and open-source, and I’ve always given BA attribution. At least that’s what I told myself at the time to make myself feel better.

The problem I’m facing is that the dataset is over two years old and needs an update. There have been new additions, updates, and sadly some breweries have closed.

Since web scraping is the easiest way to get the information, I figured it would be best to review the legalities and consider the ethics with web scraping.

🤷‍♂️ Why care?

While there are quite a few well-maintained datasets out there, if you need to gather some data and some programming chops, web scraping is the fastest and easiest way to compile custom information from websites.

👩‍⚖️ Is web scraping legal?

The short answer is yes, web scraping is legal as of this newsletter (September 2020).

In September 2019, the Ninth Circuit court in California ruled in favor of the company hiQ, who sued LinkedIn for blocking their web scraping efforts. LinkedIn is appealing to the Supreme Court.

In general, the legalities of web scraping fall under the Computer Fraud and Abuse Act (CFAA), which prohibits accessing a computer without authorization. hiQ argues that they only scrape public user profile data. LinkedIn contests:

“Whether a company that deploys anonymous computer ‘bots’ to circumvent technical barriers and harvest millions of individuals’ data from computer servers that host public-facing websites—even after the computer servers’ owner has expressly denied permission to access the data—‘intentionally accesses a computer without authorization’ in violation of the Computer Fraud and Abuse Act.”

So, since LinkedIn specifically denied access to hiQ’s bots, that signifies “without authorization.”

Additionally, LinkedIn paints a bleak login-only future if the world belongs to web scrapers:

“The decision below wrongly requires websites to choose between allowing free riders to abuse their users’ data and slamming the door on the benefits to their users of the Internet’s open forum.”

Given the consistently changing technology space, it seems like the CFAA may need some updates to handle some cases. I think it’s unlikely that web scraping will become completely illegal, but it’s also worth considering some ethics surrounding scraping.

Deep Dive:
LinkedIn Files Petition to the Supreme Court in hiQ Web Scraping Case | New Media and Technology Law Blog
hiQ Labs, Inc. v. LinkedIn Corporation (No. 17-16783)

⚖️ Is web scraping ethical?

James Densmore has an excellent write-up on what he considers ethical web scraping. He presents his ideas in a personal declaration form:

I, the web scraper will live by the following principles:

If you have a public API that provides the data I’m looking for, I’ll use it and avoid scraping all together.

I will always provide a User Agent string that makes my intentions clear and provides a way for you to contact me with questions or concerns.

I will request data at a reasonable rate. I will strive to never be confused for a DDoS attack.

I will only save the data I absolutely need from your page. If all I need it OpenGraph meta-data, that’s all I’ll keep.

I will respect any content I do keep. I’ll never pass it off as my own.

I will look for ways to return value to you. Maybe I can drive some (real) traffic to your site or credit you in an article or post.

I will respond in a timely fashion to your outreach and work with you towards a resolution.

I will scrape for the purpose of creating new value from the data, not to duplicate it.

I particularly like the personal responsibility of the User-Agent string (ex, User-Agent: “Public Brewery Data Extractor (info@openbrewerydb.org)”) combined with being communicative.

My only addition is to respect the “Terms of Use,” if there’s one posted. If still concerned or an unusually special use case, it never hurts to contact the site owner.

Deep Dive
Ethics in Web Scraping | James Densmore

🚀 Summary

While web scraping is entirely legal, be a responsible web scraper and follow some simple guidelines to achieve complete data happiness.