Artificial intelligence, or AI, has quickly become an integral part of our everyday lives, whether it's the chatbot that takes our support call, the telehealth tool that prompts us to provide our symptoms, or the online editor that instantly turns our rough ramblings into polished prose.
As more and more organizations develop and use AI applications, it's important to focus on how to maintain the security and privacy of the large volumes of data that are required to develop, train, and maintain them.
In this guide, we'll:
- Provide an overview of AI privacy and its risks
- Outline steps to mitigate those risks
What do we mean by AI privacy?
AI privacy is a close relative of data privacy.
The principle of data privacy has existed since long before the age of AI. Data privacy contends that every individual should have control over their personal information, including how that data is collected, stored, and used.
The emergence of AI has introduced the more specific concept of AI privacy, which is concerned with the protection of sensitive information that AI collects, uses, shares, or stores.
What are the privacy risks of AI?
There are potential privacy risks throughout the AI workflow—collection, usage, and storage. We'll quickly go over the following types of risk:
- Collecting sensitive data
- Collecting data without consent
- Using data without permission
- Data exfiltration and leakage
Collecting sensitive data
The data that organizations collect for AI applications is likely to include some extremely sensitive values, especially in industries such as finance and healthcare.
These can include personally identifiable information (PII) and protected health information (PHI). Both types of information could allow an individual to be identified. In the case of PHI, the data could also reveal details about medical conditions and treatment.
The data sensitivity contributes to the other risk factors.
Collecting data without consent
When you collect data, especially sensitive and personal information, without first obtaining a person's consent, you not only violate their right to privacy, you also potentially violate data privacy regulations that require you to obtain their consent.
Using data without permission
Collecting and storing data are not the same thing as using the data to train or develop an AI application.
A client might be fine with you storing their records—your doctor needs to know about your previous treatments and your bank needs to know your account balance.
However, they might be less willing to have you feed that data into a machine learning algorithm.
If you do not obtain that additional permission, you once again risk running into trouble with both your clients' personal privacy and with existing regulations on data usage.
Data exfiltration and leakage
With any data that you store, there is always a risk of a bad actor trying to steal it, to get access to personal information that they can use or sell for fraudulent purchases or identity theft.
In addition to deliberate acts of data theft and hacking, there is also the risk of accidental data leakage.
Whether the data is deliberately stolen or inadvertently leaked, the outcome is the same. Private data falls into the wrong hands.
AI and data privacy regulations
When you use customer data for AI training, you need to make sure that you follow the applicable regulations and guidelines.
Here are some regulations and guidelines that were in place as of this writing. You also need to make sure that you stay on top of new rules and regulations as they come out.
US AI privacy regulations
Individual states have started to pass their own regulations, including:
- California Consumer Privacy Act
- Texas Data Privacy and Security Act
- Utah Artificial Intelligence and Policy Act
And while there are no official federal regulations, there have been some initial guidelines:
- In 2022, the White House Office of Science and Technology Policy (OSTP) released the Blueprint for an AI Bill of Rights.
- In October 2023, the White House issued an Executive Order on Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence.
- The National Institute of Standards and Technology (NIST) has published a draft voluntary AI Risk Management Framework.
EU AI privacy regulations
In the European Union, the General Data Protection Regulation (GDPR) provides strict regulations for overall data privacy.
More recently, the EU Artificial Intelligence (AI) Act provides additional regulations that more specifically address the use of personal data for AI.
Unblock your AI initiatives and build features faster by securely leveraging your proprietary data.
How do you protect your organization's private data?
Protecting private data needs to start before you even collect it, and continues through the data collection and storage. Some key steps include:
- Conduct risk assessments
- Only collect the data that you need
- Ask for and confirm consent for data collection
- Follow security best practices
- Provide additional protection for data from sensitive domains
- Report on data collection and storage
Conduct risk assessments
Before you even start to collect and use private data, make sure that you completely understand the associated risks, so that you can ensure that they are addressed in advance.
The risks can include both physical risks (can bad actors steal the data) and general privacy risks (can data be collected without a user's knowledge).
Some questions to ask (and answer):
- What data will you collect, and what sensitive values does it contain?
- How will you ensure user consent?
- How will you store the data?
- Who will have access to the data?
- How will they use the data?
Only collect the data that you need
The less data that you collect, the less risk that you incur. Only collect the data that you need, whether it's the amount of data or the specific fields that you collect.
This might partially be based on the specific purpose. For example, for training a data model, generally the more data the better. However, for basic software development, you might be able to use a smaller subset.
Collecting only what you need can also mean specific fields in the data. For example, to train a healthcare model based on patient records, you might need information such as symptoms, treatments, and patient risk factors (age, weight, other conditions). On the other hand, you might not need identifying information such as patient names, Social Security numbers, or email addresses.
Ask for and confirm consent to data collection
Before you collect or use any private data, make sure that you obtain the required consent.
When you request consent, make it absolutely clear how you plan to store, use, and most importantly, protect their personal data.
Also make sure to keep records of the consent, so that you can verify that you requested and received it.
Follow security best practices
Don't forget the basics.
Security practices such as encryption, firewalls, strict access control, and stringent authentication requirements apply to any data that you don't want to share, and are especially important for personal data.
Provide additional protection for data from sensitive domains
In addition to those basic security best practices, provide additional security for the extra-sensitive private data.
For example, store private data on a separate server with additional security measures, and enforce additional limitations on access.
Report on data collection and storage
And finally, always keep records of what you collect and store, and use regular reports to check the current status.
This can help to surface potential security or privacy issues early. It can also help identify information that you might no longer need.
Using synthetic data tools to support AI privacy
Synthetic data tools allow you to use realistic data that is free from personal information.
Tonic Structural can de-identify data in databases and text files. Structural scans the data to identify sensitive values. You can then quickly replace all of those values with realistic alternatives (Michael becomes John, identifiers are scrambled, and so on).
You can also use Structural subsetting to produce smaller datasets that maintain referential integrity.
The resulting datasets are then safe to use for development. They are also easy to reproduce and reuse for multiple rounds of development and testing.
If you instead have large volumes of files such as PDFs, images, and even .docx files—like patient notes—then Tonic Textual can help. Textual also detects and replaces sensitive values. You can then download the redacted files to use for purposes such as AI development and model training.
Conclusion
AI is a very powerful tool that is quickly making its way into almost every aspect of our daily lives. To develop effective AI systems, you need large volumes of data, data that potentially contains extremely private and personal information such as PII and PHI.
When you use data for AI development and training, you need to be aware of the applicable privacy regulations and standards, and ensure that your data collection, storage, and usage meets or even exceeds those requirements.
You can use Tonic.ai's synthetic data tools, Tonic Structural and Tonic Textual, to identify, remove, and replace sensitive values in your collected data. The de-identified data is then safe to use in your AI tool and model development.
To learn more about Structural de-identification and Textual file redaction, connect with our team or start a free trial of Structural or Textual.