To anonymise but how to anonymise?: that is the question
Is data anonymisation really the get out of jail free card we think it is?
Welcome back to Legitimately Interested! I am so thrilled with the positive response to this newsletter, and thank you so much to everyone who has subscribed and read my first issue. Linking it here again, if you are new here and would like to revisit it.
[Before we get into it, the content of this newsletter will always be free. But, if you would like to support my work, consider taking a paid subscription? If you don’t want to commit to one, I also welcome one time contributions here].
This issue, we’re talking about data anonymisation - what it is, why it matters, and the points of confusion. I thought it important to address this before India’s data protection rules come out, the latest news being that they are likely to be notified in the following week. Read Aditi Agrawal’s great piece on this, which provides a recap of the significant issues with India’s DPDPA and the subject matter to be covered by these rules.
Issue in Focus
Anonymisation often seems like the golden brick road to a risk free future, since anonymised data does not fall within the ambit of privacy regulations, under the GDPR, US state laws, India’s DPDPA, HIPAA, etc. Sounds great, right? Especially when it comes to training LLMs and other AI models, anonymised data could be a way to ensure that there’s continued innovation in AI and derivation of new insights from data, while at the same time protecting user privacy. I like this parallel drawn on the potential energy of data as its unique property, where its value increases with its use, generating new insights and revenue streams. This is unlike oil, which data keeps getting compared to.
But this has never been quite so simple, and the source of this difficulty lies in the legal standards (when I say legal standards, I’m drawing upon guidance from jurisdictions outside India, since we haven’t developed these standards as of yet). Let’s go through some of the problems, and what it could mean for India as we evolve standards under our Digital Personal Data Protection Act, 2023 (DPDPA).
Problem 1: Lack of definition
All data protection laws define personal information, as we covered in my last issue. But no law, including India’s DPDPA, defines anonymised information. Anonymised information is then viewed as the antithesis to personal information, which is information that (i) does not identify an individual, and (ii) cannot, even in combination with other information identify an individual. India did try to take a crack at a Non-personal data Governance Framework via the Kris Gopalakrishnan Committee Report, which also begins the definition of ‘non-personal data’ as any data which is not personal, and has been either anonymised to prevent re-identification or was not capable of identifying someone to begin with. In an earlier draft of India’s data protection law, anonymisation was defined as an ‘irreversible process’, in compliance with a standard to be prescribed. But, this did not make it into the final version.
Similarly under the GDPR, anonymised data is data which needs an understanding of the factors that constitute personal data, and negating these elements. But as we discussed in the previous issue, it’s not always easy to accurately categorise what personal data is.
This leads to anonymisation being essentially driven by principle based regulation, in tandem with purpose limitation, data minimisation and fair information processing. While not prescriptive, these are helpful - for example, if the purpose for which personal data is collected is fulfilled, and there is no lawful reason to retain it even in anonymised form, the best thing to do is delete it entirely.
Problem 2: Anonymisation isn’t perfect
To tackle problem 1, regulators like the EDPS and the UK’s ICO have put out guidance on anonymisation. While informative, none of them amount to bright line tests. On one hand, this makes the threshold for effective anonymisation vague, but on the other hand, it does provide some leeway if the entity processing that information is able to prove that it has met what the regulator considers to be an adequate level of anonymisation, where there is no ‘reasonable likelihood’ of reidentification.
Regulators have themselves acknowledged that some information cannot be anonymised completely, although a 100% anonymisation is most desirable - the EDPS has also said that it is highly unlikely that the re-identification risk is going to be zero, unless the dataset is highly generalised. This was also acknowledged by the Kris Gopalakrishnan Committee. A middle ground is pseudonymisation, where an individual can’t be identified by the dataset alone, but could be potentially re-identified in combination with other information, which does not fall outside the purview of personal data entirely, but affords a lower level of risk.
Scholars in the EU have also written about how the current approaches are geared towards anonymisation of ‘structured data’, i.e., data which follows a predefined model, making it easier to understand personal attributes. Unstructured data on the other hand does not follow a predefined structure, some examples being text documents, images and audio visual recordings which don’t follow a tabular or graph like structure.
Problem 3: Arriving at a standard
All this brings us to the question of how to reconcile these issues - how do we create a standard for anonymisation that is at once achievable and predictable, but also acknowledges that it is a continuous process of risk tradeoffs? Different jurisdictions have had varying approaches - the famous Breyer case decided by the CJEU adopted a strict approach, holding that dynamic IP addresses had to be considered personal data since third parties like the government could ultimately use it in combination with other information to re-identify an individual. In contrast, the Article 29 Working Party, the UK’s ICO and California have all adopted some version of a ‘reasonable likelihood’ test, to avoid making it unduly difficult for companies to comply and have to anticipate every possible avenue for re-identification, while still adopting safeguards. There is also an approach of considering whether the personal and non-personal elements of a given dataset are ‘inextricably linked.’
A specific method of anonymisation may also have varying levels of acceptance between jurisdictions - for instance ‘salting’ or ‘hashing’ can be considered as methods which do not ‘reasonably’ lead to re-identification, making it acceptable under the California Consumer Privacy Act subject to some safeguards. The same method under the GDPR may only satisfy the threshold of pseudonymisation rather than anonymisation.
Interestingly, on December 20th, Quebec has adopted a Draft regulation on anonymisation. While this also has some nebulous standards without being prescriptive, it provides guidance for criteria on which to base re-identification risk, guidelines for conducting the analysis and recordkeeping obligations. This might be a good approach which provides flexibility, while also bridging the gap between the legal and technical standards, paving the way for joint audits which account for both legalities and technical realities.
The way forward
It is clear that regulators want companies to anonymise their data. The ‘how’ and ‘to what extent’ is unclear, but it is clear that there is utility in doing it as much as possible. This article has some great practical guidance for anonymisation strategies in organisations.
After doing a lot of reading on this, I don’t think it’s possible or in fact desirable to establish a bright line rule here, rather it is better to continue adopting a risk based approach - the benefits of which are wonderfully summarised in this paper, namely (i) it allows for adaptation to new technology and ensures a reasonable interpretation of legislation, (ii) the purpose of data protection law is not to eliminate all risk, rather to balance the free flow of information with individual privacy, and (iii) allows for nuanced assessments of the likelihood of re-identification, especially in the context of unstructured data or mixed datasets, where personal and non-personal information are interlinked.
This way, there’s a method to prove that a procedure adopted for anonymisation satisfies the regulator’s guidelines, while at once ensuring that the burden of proof is on the companies processing, sharing and training on this data. It will be interesting to see which way India goes.
Privacy Roundup
EU Council adopts final version of EU Data Act, on fair access and use of data
EU Council and Parliament arrive at agreement for the world’s first AI legislation
ISO publishes standards for AI intelligence management
Read Robert Bateman’s roundup of US state privacy law deadlines
Read Charmian Aw’s great piece on how privacy law in Asia is like rice
India’s RBI issues directions on IT governance, risk controls and assurance practices.
That’s a wrap on issue 2! Feel free to reach out to me on LinkedIn for suggestions on topics which I could cover, or contact me at the coordinates on my website.


