Will the real public data please stand up?

Understanding 'publicly available data' and the conundrum of using it.

Jan 23, 2024

Welcome back to Legitimately Interested, my fortnightly newsletter on data protection and privacy!

[Before we get into it, the content of this newsletter will always be free. But, if you would like to support my work, consider taking a paid subscription? If you don’t want to commit to one, I also welcome one time contributions here].

In the past fortnight, we’ve seen the blow up over the issue with cap table management platform Carta (I posted about this here, and here is a good summary) and the use of confidential customer data for a secondary sale. We’ve also seen the continued controversy over the lawsuit between the New York Times and OpenAI, on the usage of what NYT alleges are copyrighted articles for training GenAI models.

While both these incidents are nuanced with multiple issues beyond the scope of data protection and privacy including some very interesting discussions on copyright and fair use in the context of the NYT dispute, a data protection issue at the heart of this is the usage of publicly available data. OpenAI has consistently maintained that it only uses public data for training its models, and Carta also claimed that apart from this one instance caused by a rogue employee, it has only used public information for any secondary sales.

So at this juncture, I thought it worthwhile to investigate how using public data works, what the question marks are, and how regulators are working with it.

Issue in Focus

The primary question: Most data protection laws including the GDPR and India’s DPDPA, have exceptions for public data. The GDPR is fairly narrow, where ‘sensitive’ personal data can be processed when ‘manifestly’ made public by the concerned individual - for eg, if I posted on social media about my religious views or health conditions (there is some argument as to when that would ‘manifestly’ be made public. If I posted this to a specific group where an admin needs to grant people access, that would not be an intention to make this information truly public as opposed to a post with unrestricted viewership).

However, public data isn’t excluded from the scope of the GDPR entirely - it is still personal data, but can be used subject to whether it was manifestly made public by the concerned person or by a public authority. As an interesting contrast, India’s DPDPA excludes public data from the material scope of the law entirely, where any personal data is either made publicly available by the individual to whom it relates, or any other person due to a requirement under law.

To compare the same example used above, in India, my data could be considered as data I’ve made publicly available and no longer personal, even without an express intention - say, if my profile details on LinkedIn are on a public setting and not only for my connections, but not available to users who don’t have a LinkedIn account. A company looking to conduct web crawling on LinkedIn could argue that it doesn’t matter that others cannot view my post without a LinkedIn account - the fact that I did not activate any restrictions is proof enough. Their case would be made even stronger, if for example I put up a review on Amazon which is meant to be publicly viewed.

The secondary question: Just with these limited provocations, I think it’s safe to say that it’s still an activity of interpretation as to what public data IS, in the first place, before coming to the secondary question of whether/how it can be used. In fact, a simple Google search for ‘public data definition’ gave me multiple takes - 1) that it is data made available by government bodies or local collectives, 2) that it is any any data which is under an open source license, 3) in contrast, articles distinguishing the concepts of public data and open data and 4) that it is any data in the public domain, personal or not. There are also much larger questions about data ownership itself - here’s an interesting piece of scholarship for anyone wanting further reading.

Legal basis: The secondary question of how public data can be used is fundamentally linked to the question of the legal basis for public data collection - i.e, whether it was based on consent, a legitimate use/interest, or contractual necessity. Of course, if there is a law like the DPDPA where public data is excluded from personal data altogether, the question of a legal basis does not then arise.

Even so, it is still useful to take a look at these from the perspective of risk mitigation, since there is ambiguity on what exactly public data is.

Here is a good summary on what these bases are in the context of the GDPR, but which gives useful guidance in general - that usage should be aligned with the original purpose, that data subjects should be informed about obtaining data about them from other sources (see this useful guidance from the UK ICO) and considering the expectations of the data subject, their vulnerability and sensitivity of the data.

Even though these factors wouldn’t be mandated by law in the Indian context, it might still make (even if only utilitarian) sense to consider them given the reputational harm to an organisation in the event of data usage which individuals find undesirable - just take Carta, which had to shut down its entire secondary trading vertical. Also, if India does get some distinguishing standards for what counts as data made public by a data subject and what doesn’t, a Data Fiduciary which is at least mindful of these issues will be much better prepared to tackle them.

Public data and AI: Of course, all roads these days lead to AI - which brings us to the most burning case for allowing the usage of public data, training GenAI models - where OpenAI is now stuck with the New York Times. Keeping apart the copyright angle (although here is a great piece on these competing legal issues), this case will also have implications for understanding the line between AI training and personal information, who would need to give consent (whether the authors or the Times), the implications of storage practices on the legality of training models and the possibility that personal data can be retrieved from AI models.

A lot of people I’ve spoken to wager that the Indian position of giving a blanket exemption for public data has to be to incentivise AI innovation in the country and establish India as an AI friendly jurisdiction. This may very well be, but I do anticipate confusion as we see the implementation of the law evolve, especially in the context of applications like web scraping and secondary markets for these insights. The UK ICO has released a stakeholder consultation series on this practice of web scraping for training LLMs and other AI models, given the possibility of personal data extraction which gives rise to privacy concerns.

As we saw in the first issue, an increased number of people care very much about how their data is used online, even if they can’t fully understand the scope of its usage. Like I mentioned above, using public data is as much of a reputational and business issue as it is one under privacy law - since the average daily user may not be concerned about what is exempted under law or not if they think there’s been an unethical use of their ‘public’ information.

Twitter/X also recently amended its privacy policy to say that publicly available information on the platform will be used to train their AI models, which caused some buzz about privacy concerns. To this, Elon Musk tweeted, ‘Just public data, not DM’s or anything private’. After all this investigation into this issue, this brings us (or at least me), back to the question - what IS public data, and is all of it not private?

Privacy Roundup

EU AI Act Text leaked online
Sri Lanka notifies enforcement dates for its data protection law
US FTC enters into multilateral agreement for privacy and data security enforcement
France’s CNIL imposes fine on Yahoo for cookie consent violations
Saudi Arabia’s data protection law comes into effect
Read Shivangi Nadkarni’s notes on the APAC region for IAPP (Jan 2024)
New Jersey’s privacy law gets Governor’s assent
European Commission reviews adequacy decisions, no countries lose adequacy status
New Hampshire legislature passes privacy law

That’s a wrap on issue 4! Feel free to reach out to me on LinkedIn for suggestions on topics which I could cover, or contact me at the coordinates on my website.

Legitimately Interested

Will the real public data please stand up?

Understanding 'publicly available data' and the conundrum of using it.

Issue in Focus

Privacy Roundup

Discussion about this post