Bailey Kacsmar is a PhD candidate in the School of Computer Science at the University of Waterloo and an incoming faculty member at the University of Alberta. Her research interests are in the development of user-conscious privacy-enhancing technologies, through the parallel study of technical approaches for private computation alongside the corresponding user perceptions, concerns, and comprehension of these technologies. Her work aims at identifying the potential and the limitations for privacy in machine learning applications.
Your research interests are in the development of user-conscious privacy-enhancing technologies, why is privacy in AI so important?
Privacy in AI is so important, in large part because AI in our world does not exist without data. Data, while a useful abstraction, is ultimately something that describes people and their behaviours. We are rarely working with data about tree populations and water levels; so, anytime we are working with something that can affect real people we need to be cognizant of that and understand how our system can do good, or harm. This is particularly true for AI where many systems benefit from massive quantities of data or hope to use highly sensitive data (such as health data) to try to develop new understandings of our world.
What are some ways that you’ve seen that machine learning has betrayed the privacy of users?
Betrayed is a strong word. However, anytime a system uses information about people without their consent, without informing them, and without considering potential harms it runs the risk of betraying individual’s or societal privacy norms. Essentially, this results in betrayal by a thousand tiny cuts. Such practices can be training a model on users email inboxes, training on users text messages, or on health data; all without informing the subjects of the data.
Could you define what differential privacy is, and what your views on it are?
Privacy is not limited to one definition or concept, and it is important to be aware of notions beyond that. For instance, contextual integrity which is a conceptual notion of privacy that accounts for things like how different applications or different organizations change the privacy perceptions of an individual with respect to a situation. There are also legal notions of privacy such as those encompassed by Canada’s PIPEDA, Europe’s GDPR, and California’s consumer protection act (CCPA). All of this is to say that we cannot treat technical systems as though they exist in a vacuum free from other privacy factors, even if differential privacy is being employed.
Another privacy enhancing type of machine learning is federated learning, how would you define what this is, and what are your views on it?
Federated learning is a way of performing machine learning when the model is to be trained on a collection of datasets that are distributed across several owners or locations. It is not intrinsically a privacy enhancing type of machine learning. A privacy enhancing type of machine learning needs to formally define what is being protected, who is being protected from, and the conditions that must be met for these protections to hold. For example, when we think of a simple differentially private computation, it guarantees that someone viewing the output will not be able to determine whether a certain data point was contributed or not.
Further, differential privacy does not make this guarantee if, for instance, there is correlation among the data points. Federated learning does not have this feature; it simply trains a model on a collection of data without requiring the holders of that data to directly provide their datasets to each other or a third party. While that sounds like a privacy feature, what is needed is a formal guarantee that one cannot learn the protected information given the intermediaries and outputs that the untrusted parties will observe. This formality is especially important in the federated setting where the untrusted parties include everyone providing data to train the collective model.
What are some of the current limitations of these approaches?
Current limitations could best be described as the nature of the privacy-utility trade-off. Even if you do everything else, communicate the privacy implications to those effected, evaluated the system for what you are trying to do, etc, it still comes down to achieving perfect privacy means we don’t make the system, achieving perfect utility will generally not have any privacy protections, so the question is how do we determine what is the “ideal” trade-off. How do we find the right tipping point and build towards it such that we still achieve the desired functionality while providing the needed privacy protections.
You currently aim to develop user conscious privacy technology through the parallel study of technical solutions for private computation. Could you go into some details on what some of these solutions are?
What I mean by these solutions is that we can, loosely speaking, develop any number of technical privacy systems. However, when doing so it is important to determine whether the privacy guarantees are reaching those effected. This can mean developing a system after finding out what kinds of protections the population values. This can mean updating a system after finding out how people actually use a system given their real-life threat and risk considerations. A technical solution could be a correct system that satisfies the definition I mentioned earlier. A user-conscious solution would design its system based on inputs from users and others effected in the intended application domain.
You’re currently seeking interested graduate students to start in September 2024, why do you think students should be interested in AI privacy?
I think students should be interested because it is something that will only grow in its pervasiveness within our society. To have some idea of how quickly these systems look no further than the recent Chat-GPT amplification through news articles, social media, and debates of its implications. We exist in a society where the collection and use of data is so embedded in our day-to-day life that we are almost constantly providing information about ourselves to various companies and organizations. These companies want to use the data, in some cases to improve their services, in others for profit. At this point, it seems unrealistic to think these corporate data usage practices will change. However, the existence of privacy preserving systems that protect users while still allowing certain analysis’ desired by companies can help balance the risk-rewards trade-off that has become such an implicit part of our society.
Thank you for the great interview, readers who are interested to learn more should visit Bailey Kacsmar’s Github page.