5 Things You Need to Know Before Scraping Data From FacebookWednesday, January 30, 2019
1. Actually, Facebook disallows any scraper, according to its robots.txt file.
When planning to scrape a website, you should always check its robots.txt first. Robots.txt is a file used by websites to let "bots" know if or how the site should be scrapped or crawled and indexed. You could access the file by adding "/robots.txt" by the end of the link to your target website.
Enter https://www.facebook.com/robots.txt in your browser, and let’s check the robots file of Facebook. These two lines could be found at the bottom of the file:
The lines state that Facebook prohibits all automated scrapers. That is, no part of the website should be visited by an automated crawler.
Why do we need to respect robots.txt?
Websites use the robots file to specify a set of rules on how you or a bot should interact with them. When a website blocks all access to crawlers, the best thing to do is to leave that site alone. To follow the robots file is to avoid unethical data gathering as well as any legal ramifications.
2. Technically, the only legal way to collect data from Facebook with a crawler is to obtain a prior written permission
Facebook warns at the very beginning of their robots file: "Crawling Facebook is prohibited unless you have express written permission."
Check the link on the second line, you could find Facebook’s Automated Data Collection Terms, last revised on April 15th, 2010.
Like any other terms and conditions in the world, Facebook Automated Data Collection Terms are long (in abnormally small font size) and full of legal terms that few people could fully understand.
These terms look so familiar, as we would see them each time we install a new app on our mobile phone or sign up for a website.
- "By obtaining permission to…you agree to abide by…"
- "You agree that you will not…"
- "You agree that any violation of these terms may result in…"
However, they may not be the same innocent.
As the social media giant, Facebook has money, time and a dedicated legal team. If you proceed with scraping Facebook by ignoring their Automated Data Collection Terms, that’s OK, but just be warned that they have been reminded you to at least obtain "written permission". Sometimes they could be quite aggressive towards illegitimate scraping.
3. But surely you are still able to scrape data from Facebook as you need
If you have done crawling without respecting the robots.txt, it doesn't mean you would get into legal complications because you've violated the rules.
Data scraped from social media is undoubtedly the largest and most dynamic dataset about human behavior and real-world events. For more than a decade, researchers and business experts around the world have harvested information from Facebook using scrapers, producing representative samples to understand individuals, groups and society, as well as exploring brand new opportunities hidden in the data.
For users, they would agree that the use of social data is not always a bad thing. For example, it is the use of social data to personalize marketing that keeps the internet free and makes the ads and content we see more relevant.
Tools you could use for obtaining Facebook data
In response to the public outcry following the Cambridge Analytica scandal, Facebook implemented dramatic access restrictions on its APIs in April last year.
Application Programming Interfaces (APIs) are software interfaces designed for consumption by computer programs, which allow people to retrieve large-scale data with automated processes. Nowadays many companies provide a public API as a means for users, researchers and third-party app developers to access their infrastructure.
Facebook's API lockdown and radical data access restrictions as an attempt to protect its user information are quite arguable. But still, as a result, now people are left with only one choice.
Without APIs, now we could only obtain Facebook data through the interfaces for users, that is, the web pages. This is exactly when web scrapers come into play. We have written a blog about some best social media scraping tools. 👉 Check our article Top 5 Social Media Scraping Tools for 2020
4. After GDPR in force, however, there’s more chance to get sued if you’re trying to scrape personal data
The EU General Data Protection Regulation, or GDPR as it is more commonly known, came into force on 25th May 2018. It is said to be the most important change in data privacy regulation in 20 years, setting to force sweeping changes in everything from technology to advertising, and medicine to banking.
Companies or organizations that hold and process large amounts of consumer data, such as technology firms like Facebook, are affected the most under GDPR. Before it was all up to these companies to enforce the rules to protect user data. Now under GDPR, they need to make sure they are in full compliance with the law.
The good news is…
GDPR only applies to personal data.
Here "personal data" refers to the data that could be used to directly or indirectly identify a specific individual. This kind of information is known as Personally Identifiable Information(PII), which includes a person's name, physical address, email address, phone number, IP address, date of birth, employment info and even video/audio recording.
If you aren't scraping personal data, then GDPR does not apply.
In short, unless you have the person's explicit consent it is now illegal to scrape an EU resident personal data under GDPR.
5. And you could try Facebook alternative sources for your scraping project
As mentioned above, though Facebook prohibits all automated crawlers, it is still technically feasible to scrape data from the site. The problem is —
It is risky.
Apart from the legal ramifications, you could find that it may get harder to retrieve the desired data on a regular basis, as Facebook block suspicious IPs, and could even implement harder blocking mechanisms in the future, which may make scraping data from the site totally impossible.
Hence, it is recommended to look for more reliable sources for social media data to gain business intelligence and insights on your target market.
Four data sources alternative to Facebook
With about 500 million tweets generated per day, Twitter is a sea of information that can be used as a great source for brand monitoring and customer sentiment measurement. Unlike Facebook, Twitter allows people to retrieve data on a large scale via Twitter's APIs.
Having as many users as Twitter, Reddit is one of the greatest sources of UGC (User Generated Content) in the world. Reddit also provides public APIs that can be used for a variety of purposes such as data collection, automatic commenting bots, or even to assist in subreddit moderation.
- VKontakte (VK)
VK is a Russian social media platform geared toward Russians and other Eastern European users. By far, it boasts over 90 million unique visitors per month, and 9 billion page views every day. As a Russian company, VK adheres to Russian laws, and if you check its robots file you’ll find it is quite friendly with crawlers.
Owned by Facebook, Instagram focuses more on visual content sharing, especially videos and pictures. The platform is used by many brands to humanize their content for better connecting customers and growing brand awareness. Alongside Facebook’s data lockdown last year, however, Instagram has also implemented radical restrictions on data access, which made the site much less reliable than before.
Artículo en español: 5 Cosas que Debes Saber Antes de Scraping de Facebook
También puede leer artículos de web scraping en el Website Oficial
Edit: Ashley Weldon