Data scraping from a South African perspective

By Dérick Swart on 23 May 2022
  Back

LinkedIn has been trying to stop a data analytics company from scraping its publicly visible content over the past years due to concerns about users' privacy and unlawful competition.  The U.S. Ninth Circuit of Appeals in the United States of America recently confirmed its previous view that web scraping is legal in certain circumstances and prevented LinkedIn from blocking it.  

LinkedIn's action was motivated at least in part by some bad press that it had unfairly suffered, namely that it was hacked and the data of countless users compromised, when in fact it was information that that had been scraped without its consent.

This post explores some initial thoughts on the position under South African law.

What is scraping and where is it used?

Scraping involves programmatically reading content from any source document, with or without further persistence (i.e. storage) and processing.  Scraping is most often found on the web, but the technique can also be used to read any document or even display.
Using C# and the .Net Framework, it is easy to get the text from any publicly accessible web page:
  • HttpWebRequest request = (HttpWebRequest)WebRequest.Create([insert URL here]);
  • HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Once you have the text, you can manipulate it however you want.  If you are extracting specific data elements, you would know how to find them.  For this reason the process is typically fairly inefficient and error-prone, as compared to for instance exposing an application programming interface ("API").  

Scraping is most often used on publicly accessible web sites, but it is also possible to apply the technique to pages that require prior authentication.  In such cases, the code (i.e. bot) would programmatically log in and then perform the scraping actions.  More sophisticated bots can mimic user behaviour without the application detecting that it is interacting with a non-human.

Thus, on the shallow end a bot can programmatically read and display weather or exchange rate data from a third-party source or, on the deep end, a bot can log into an internet banking application and perform transactions on the user's behalf.

I should also mention that in recent years a lot of effort has been put into securing applications from automated access and so the vulnerability of such systems has been greatly reduced, for instance by multi-factor authentication ("MFA") or CAPTCHA (acronym for "Completely Automated Public Turing test to tell Computers and Humans Apart").  

What happened in the LinkedIn case?

HiQ used data scraped from public sections of LinkedIn to create reports for corporate customers, identifying which of their employees are most likely to quit and which are most likely to be targeted by recruiters.  

This is sometimes called "open-source intelligence" or "OSINT" and it is a multibillion-dollar industry, especially where multiple data sources are blended to derive unique and valuable insights using big data tools.  

LinkedIn sought to rely on the American Computer Fraud and Abuse Act ("CFAA"), which forbids individuals from intentionally accessing a protected computer without authorization or exceeding authorized access.  

The court upheld the line of reasoning that states that where a party has lawful access to public information, criminalising terms of service violations would “attach criminal penalties to a breath-taking amount of commonplace computer activity”.  Shortly put, if data is publicly accessible then the CFAA could not be used to criminalise access to it.  The same would not apply to data that requires authenticated access.

Importantly, the case only dealt with the application of the CFAA and the legislation has a very specific wording that differs from that found in our Cyber Crimes Act.

Legal considerations in South Africa

The Electronic Communications and Transactions Act

Search engine operators rely heavily on scraping bots to cache (i.e. copy) web pages, which are then indexed to provide a lighting fast response.  When used in this manner, scraping bots are often called "web crawlers".

Section 76 of the act provides relief to a service provider that uses "information location tools" under specified circumstances.  The section provides that in such a case, the service provider "…is not liable for damages incurred by a person if the service provider refers or links users to a web page containing an infringing data message or infringing activity, by using information location tools, including a directory, index, reference, pointer, or hyperlink…".

The Copyright Act

The Copyright Act protects compilations of data and it could therefore potentially constitute copyright infringement to scrape a significant portion of the data, such as a table of data.

Where images are scraped, an infringement of copyright would likely be committed if a copy is persisted in any manner.  

The Protection of Personal Information Act

Where personal information is involved, any "processing" that takes place in South Africa would typically constitute a regulated activity in terms of our data privacy legislation.  For instance, if a third-party data scraper which falls within the definition of a "responsible party" and does not enjoy the benefit of an exemption or other relief, any processing of personal information gathered via scraping will likely be unlawful and there will moreover be duty in terms of section 18 to notify data subjects of the processing of their personal information.   

Where data about a person is combined from multiple sources using unique identifiers, prior authorisation from the regulator will be required.

There are technical measures that can be implemented to defend against unauthorized data scraping activities.  Our data privacy legislation, like most others, require responsible parties to take reasonable steps to protect against unlawful actions in relation to personal data in their care.  It is an interesting question whether deploying such measures to frustrate data scraping would be considered a reasonable and necessary response to achieving compliance with the law in this regard…

The Cybercrimes Act

Section 2(2)(a) provides that "[a]ny person who unlawfully and intentionally accesses a computer system or a computer data storage medium, is guilty of an offence."

To "access" includes to "use", which in turn includes "…obtaining the output…" of a computer program or of data per section 2(2)(c).  

In my view there appears to be arguable that if an application communicates terms of service in a legally binding manner and specifies that no automated access to data is permitted, intentionally continuing such actions should constitute a criminal offence as it is unauthorised.  

In the LinkedIn case, a cease-and-desist letter was sent to clearly inform HiQ that they were not authorised to scrape data and were contravening the terms of service.  To my mind, this should also suffice under South African law.  

We will have to see what our courts do here – do they give effect to the broad language of the act or apply restraint in keeping with the American precedent?

Then there is the intriguing section 12, which states that the common law crime of theft "…must be interpreted so as not to exclude the theft of incorporeal property".  While theft can be committed with any act where property is unlawfully appropriated, it will likely require a permanent deprivation of the particular intellectual property and so scraping would not constitute theft in this sense. 

The Regulation of Interception of Communications and Provision of Communication-related Information

Constitutional issues aside, I would not typically classify scraping as interception or monitoring of direct or indirect communications, as there is typically communication directly between the scraping bot and the resource being interrogated.  It is however technically possible to monitor a conversation using the technique.

There are also some provisions in the act regarding overcoming security measures, which may come into play in certain cases.

The common law

If terms of service have been communicated in a legally binding manner, then contractual remedies and injunctive relief of course will come into play.

Additionally, unlawful competition is a broad remedy in our law that can serve to protect against sharp practice, such as where someone unlawfully competes using the fruits of another's labour.

Our courts have (rightly) been slow to use this remedy so as not to unduly limit competition in the free market.  Even so, I am sure there will be scenarios where the competing use of scraped data will be such that a court will find it crosses the line and is unlawful.  

The burden of proof may be somewhat easier to discharge if the complainant can show that the scraping activity places an undue load on its computing infrastructure.  This is entirely possible.  If the scraper does not purposefully limit the number of requests, the code will execute as fast it can with the computing resources at its disposal. 

How to protect against unauthorised data scraping and mining

Technical measures

Deploy technical measures to protect against attempted automated access, such as scraping bots by detecting and blocking offending requests.    

In addition to aforementioned measures such as MFA and CAPTCHAS, masking information can also protect it against scraping.  This could be achieved by for instance having a JavaScript function that only reveals an email address when someone clicks on a link.  

Terms of service

Publish terms of service in a legally binding manner that expressly denies automated access, other than for public pages that are scraped by bona fide search engines.

Since automated access may bypass a webpage in human-readable form and open up the argument that no human had sight of and accepted the terms of service, the fact that access is regulated by terms of service may also be included in metadata (for instance in the header provided in response to a web request).  

The HTML protocol allows for a file to be published (robots.txt) that instructs search engines on the pages they may scrape and cache.  I am not aware of law that expressly obliges a search engine to obey your robots.txt file, but it can potentially be used in a way to indicate which pages are authorised for indexing and which not.

Dummy data

This suggestion is a little tongue in cheek, but nonetheless has merit for certain scenarios.  

The publishers of telephone directories were the first victims of wholesale copying of their datasets and initially battled to be able to prove copying of their work because a lot of the information they compiled was in the public domain.  What they did to help show that copying took place was to insert dummy data into their datasets, thereby making it very easy to show that their data was in fact copied.  

In conclusion

I think the term "open-source intelligence" is a misnomer.  You find the same confusion when it comes to open-source software.  The fact that something is in the public domain, does not mean that it is free or can be used without conditions or compliance with applicable law.

On my first take, LinkedIn should probably be able to prevent their website from being mined by third parties insofar as South African law applies.

At the risk of stating the obvious, data is valuable and proprietors should take steps to protect it from unauthorised automated access and exploitation.

Notes

See hiQ Labs Inc. v. LinkedIn Corp. Deems Data Scraping not Unlawful (https://www.natlawreview.com/article/hiq-labs-v-linkedin) for a high-level review of the American caselaw on this point. 


Back to top

Please note that our blog posts are informal commentaries on developments in the law as at the time of publication and not legal advice. You should place no reliance on our blog posts; we look forward to discussing your particular matter with you.