Using an LLM or Pattern-based Rules for PII/PHI Redaction

In our data-driven world, being able to protect Personally Identifiable Information (PII) and Protected Health Information (PHI) is imperative. Whether you’re securing customer data, complying with regulations like GDPR or HIPAA, or simply aiming for responsible data handling, the need to effectively redact sensitive information is crucial.

Today, there are two primary approaches: leveraging the power of Large Language Models (LLMs) and employing traditional pattern-based rules. While LLMs have understandably received significant attention for their impressive natural language understanding, it’s essential to compare their capabilities against the tried-and-true methods of pattern matching.

In this blog post, we will take a look at the benefits of pattern-based rules and some drawbacks of an LLM-based approach.

Pattern-based Rules

Pattern-based rules operate on a straightforward principle: defining specific regular expressions or keyword lists to identify and redact PII. Think of it as providing a precise set of instructions to locate and mask patterns that match social security numbers, phone numbers, email addresses, and other sensitive data.

A pattern-based approach has several advantages:

  • Speed and Efficiency: Pattern matching is generally very fast, processing large volumes of text quickly without significant computational overhead.
  • Transparency and Control: You have direct control over the rules, making it easy to understand why a particular piece of text was flagged as PII. Debugging and refining these rules are straightforward and simple. If something expected is not matched, it is relatively easy to figure out why.
  • Computational Efficiency: Pattern-based methods require minimal computational resources, running effectively on standard CPUs without the need for specialized hardware like a GPU.
  • Precise Matching: When rules are well-defined, they offer high precision in identifying specific PII formats. For instance, a regex for a credit card number can incorporate the Luhn algorithm for basic validation.
  • Flexibility for Specific Formats: Handling variations like phone numbers with or without parentheses or different date formats can be explicitly addressed through rule creation.

Drawbacks of LLM-Based PII Redaction

Using an LLM for PII redaction might seem like a good option – just download a model, give it your text (and maybe a prompt), and get back the redacted text. A search for “pii” on the Hugging Face model hub returns several models trained to identify PII.

But, before you go that route, be aware of some the drawbacks:

  • Tokenization and Sentence Splitting: Before an LLM can analyze text, the input needs to be broken down into smaller units called tokens (which are often just words). This pre-processing step can sometimes lead to the fragmentation of PII that spans across token boundaries or sentence splits, potentially hindering accurate identification. Additionally, accurately splitting sentences in some text can be quite a challenge on its own. Content such as medical notes or technical information may not even be comprised of complete sentences.
  • Ambiguity of Tokens: A fundamental limitation of current LLM tokenization is the lack of inherent support for tokens that could belong to multiple PII categories. For example, a sequence of digits might represent a zip code, a portion of a social security number, or part of a phone number. The LLM might struggle to definitively classify such ambiguous tokens without sufficient contextual clues, leading to PII not being redacted.
  • Resource Intensive: Running LLMs demands substantial computational resources. They typically require powerful GPUs significant hard disk space to store the model weights. This can translate to increased infrastructure costs and more complex deployments.
  • Redaction Speed: Compared to the rapid execution of pattern-based rules, LLM inference is considerably slower. While using a GPU can provide an increase in speeds, it is most likely to never achieve the speed of a pattern-based system.
  • A Black Box: LLMs operate as complex neural networks, making it a challenge to understand exactly why a particular piece of text was (or wasn’t) identified as PII. This lack of transparency can make debugging and ensuring the reliability of the redaction process difficult. Going further, if you are using a third-party model, such as one from the Hugging Face Hub, may not know much about the data used to train the model.
  • Robustness (e.g. Dashes and Spaces:) The robustness of LLMs to minor variations in PII formatting can be surprisingly fragile. For instance, an LLM trained to recognize social security numbers in the “XXX-XX-XXXX” format might fail to identify “XXXXXXXXX” as an SSN, potentially misclassifying it as a zip code if it happens to be five digits long followed by four. Similarly, the presence or absence of spaces in phone numbers or other identifiers can significantly impact the LLM’s ability to recognize them.
  • Validation of PII: Some forms of PII, like credit card numbers, adhere to specific validation algorithms (e.g., the Luhn algorithm). LLMs, while capable of learning patterns, they don’t inherently incorporate such validation logic. This means they might flag sequences that look like credit card numbers but are actually invalid.
  • Vocabulary Limitations: The effectiveness of an LLM is heavily influenced by its training vocabulary. If a specific PII format or a term used in conjunction with PII is not well-represented in the training data, the model’s ability to identify it accurately will be compromised.
  • The Need for Constant Model Retraining: When new types of PII need to be identified, or when existing PII formats evolve, an LLM typically requires retraining on a new dataset. This process can be time-consuming, resource-intensive, and requires specialized expertise. It is likely not a process you want to do often.

The Right Balance – Combining the Two Approaches

While LLMs offer exciting possibilities for understanding the context of PII, their inherent drawbacks make them a less-than-ideal sole solution for PII redaction in many scenarios. A more robust and reliable approach often involves a hybrid strategy. Combining the speed and precision of pattern-based rules for well-defined PII formats with the contextual understanding of LLMs for more nuanced cases can offer a more comprehensive and efficient solution. Learn how Philter employs a hybrid approach to data redaction at https://www.philterd.ai/.

Philter 3.1.0

Philter 3.1.0 is now available.

Philter 3.1.0 is built upon Phileas 2.12.0 which brings:

  • Filter priorities – Each filter can have its own priority that is used as a tie-breaker in cases where text is identified by two filters. For example, if you are using the phone number filter and an ID filter of 10 digit numbers, both filters may detect PII on the same text. In this case, the filter priority will be used to determine the ultimate labeling of the text as either a phone number or an ID number.
  • Zip code validation – The zip code filter can now optionally attempt to validate zip codes. When enabled, if a zip code does not exist in the internal database, the zip code will not be redacted.
  • Each filter can have a custom window size – The window size is roughly the number of words surrounding PII that is used to provide contextual information about the PII. Previously, each filter had to use the same window size. Now, each filter can have the window size set independently.

Phileas 2.12.0

Phileas 2.12.0 has been released. This version of the popular open source redaction library brings:

  • Filter priorities – Each filter can have its own priority that is used as a tie-breaker in cases where text is identified by two filters. For example, if you are using the phone number filter and an ID filter of 10 digit numbers, both filters may detect PII on the same text. In this case, the filter priority will be used to determine the ultimate labeling of the text as either a phone number or an ID number.
  • Zip code validation – The zip code filter can now optionally attempt to validate zip codes. When enabled, if a zip code does not exist in the internal database, the zip code will not be redacted.
  • Each filter can have a custom window size – The window size is roughly the number of words surrounding PII that is used to provide contextual information about the PII. Previously, each filter had to use the same window size. Now, each filter can have the window size set independently.

Look for a new version of Philter soon in the AWS, Google Cloud, and Azure marketplaces soon that is built on Phileas 2.12.0!

Why Using an LLM to Redact PII and PHI is a Bad Idea

We have seen a lot – and you probably have to – posts on various social media and blogging platforms showing how you can redact text using a large language model (LLM). They present a fairly simple solution to the complex problem of redaction. Can we really just let an LLM handle our text redaction and be done with it? The answer is simply no.

Here is one such example: https://ravichinni.medium.com/using-generative-ai-for-content-redaction-46ee61a3a4e6 (Don’t do this.)

Posts like this can make it tempting to consider leveraging an LLM to help identify and redact sensitive information, such as personally identifiable information (PII) and protected health information (PHI). While LLMs have demonstrated impressive capabilities in natural language understanding, they are not well-suited for the critical task of detecting and redacting sensitive data.

This post describes some reasons why relying solely on an LLM for redaction and de-identification is a bad idea, and a hybrid solution, such as the open source Philter software, that utilizes rule-based, dictionary-based, and natural language processing is better suited for redaction and de-identification.

So, before you prompt an LLM to “Redact the PII and PHI in the following text”, be aware of the risks described below.

Redaction Requirements are Often Complex

Very often it is not enough to simply mask PII and PHI with asterisks. Various regulations have different requirements for how PII and PHI should be redacted, and your business needs can lead to complicated redaction policies. For instance, it may be enough to mask the first 5 digits of an SSN or only mask zip codes whose population is less than some threshold. Or, perhaps you need to anonymize all occurrences of a person’s name consistently across multiple documents. Prompting an LLM to successfully meet these redaction requirements is a challenge, if not impossible.

(Philter provides these redaction capabilities through its redaction policies. You can tailor a redaction policy specific to your needs and redact the data just how you need to.)

Decreased Accuracy

LLMs operate probabilistically, meaning they generate outputs based on patterns in their training data rather than deterministic rules. This makes LLMs unreliable for consistently identifying and redacting PII and PHI. An oversight—such as missing a social security number —could lead to data exposure and non-compliance with privacy regulations like HIPAA. Likewise, an LLM might redact information that isn’t sensitive or fail to recognize context-specific PII/PHI, leading to a false sense of security.

(Philter uses a combination of rule-based, dictionary-based, and natural language processing for redaction.)

Inconsistent Performance Across Contexts

The performance of LLMs varies significantly based on context, phrasing, and language structure. Sensitive information may appear in different formats, abbreviations, or contextual clues that an LLM may struggle to recognize.

For instance, an LLM might successfully identify and redact a clearly labeled patient name in one document but fail to recognize the same name in a physician’s notes when it appears alongside medical conditions. In contrast, purpose-built systems can be trained and tested with structured validation methods to ensure comprehensive and reliable redaction.

Risks of Hallucination and Data Leakage

LLMs sometimes “hallucinate”, meaning they generate information that was not in the original text. This poses a serious risk when dealing with sensitive data. If an LLM inadvertently generates or reconstructs PII/PHI that was previously redacted, it could lead to data breaches or compliance violations. Additionally, some LLMs may inadvertently store and reuse information from previous interactions, increasing the risk of accidental exposure if proper safeguards are not in place.

Lack of Explainability

One of the fundamental challenges with LLMs is their black-box nature. Unlike rule-based systems that provide clear logic for why a particular piece of information was redacted, LLMs do not offer transparency into their decision-making process. This lack of explainability makes it difficult to audit redaction decisions, troubleshoot errors, or prove compliance with regulatory requirements.

Organizations need to demonstrate accountability in handling sensitive data, and relying on a model that cannot provide a clear rationale for its decisions makes compliance reporting challenging.

(Philter’s API provides an /explain endpoint that provides a detailed explanation of why each token was identified as PII or PHI to help you understand Philter’s actions.)

Scalability and Cost Considerations

Running LLMs at scale for real-time PII and PHI detection can be very expensive. Many LLMs require significant processing power which increases operational costs. In contrast, traditional rule-based redaction tools are much more efficient, allowing for much faster and much more cost-effective redaction.

Additionally, the cost of remediating errors caused by LLM misclassifications—whether through manual review or regulatory penalties—can be far higher than investing in a more robust, deterministic redaction approach from the start. (Note that Philter is open source software.)

If you don’t make the investment in the necessary hardware to utilize an LLM locally, you will have to resort to third-party hosted LLMs. Sending sensitive text to a third-party for redaction comes with its own set of risks. Is your data being shared? Is it encrypted? Do you know how you’re allowing the third-party to use that data? This can quickly lead to compliance risks described below.

Regulatory and Compliance Risks

Data privacy regulations like GDPR, HIPAA, and CCPA require stringent controls over how PII and PHI are processed and protected. If an LLM fails to properly redact sensitive data, an organization could face severe legal and financial consequences.

Using third-party or cloud-based LLMs introduces additional concerns regarding data residency, storage, and transmission. Many compliance frameworks require that sensitive data not be processed or stored in untrusted environments, and relying on an external AI model may violate these mandates.

A Better Choice: A Hybrid Approach to Redaction

Instead of relying entirely on LLMs for PII and PHI redaction, organizations should leverage solutions that apply deterministic, rule-based systems for the PII and PHI data that follows well-defined patterns. These methods provide greater reliability and transparency.

For cases where more advanced context understanding is needed, a hybrid approach—combining rule-based methods with traditional machine learning models specifically trained for redaction—offers a more accurate and compliant solution. These models can be fine-tuned, tested, and validated against real-world data without the unpredictability of general-purpose LLMs.

(Philter uses rules to identify many kinds of PII and PHI. Items such as email addresses and social security numbers follow well-defined patterns. A rule-based system will be much more efficient and accurate than an LLM at identifying these types of PII and PHI.)

Conclusion

While LLMs are powerful tools for many natural language tasks, they are not the best choice for the critical task of redacting PII and PHI. Their probabilistic nature, risk of errors, lack of transparency, and compliance concerns make them a less than ideal choice for organizations handling sensitive data. A structured and deterministic approach—either through rule-based systems or specialized AI models—is a safer and more efficient choice.

Learn more about Philter at https://www.philterd.ai/philter/ and its approach to redaction and de-identification.

Shielding Your Search: Redacting PII and PHI in OpenSearch with Phinder

In today’s data-driven world, safeguarding Personally Identifiable Information (PII) and Protected Health Information (PHI) is paramount. When leveraging search platforms like OpenSearch, ensuring sensitive data remains confidential is crucial. Enter Phinder, an open-source OpenSearch plugin that leverages the power of the Phileas project to effectively redact and de-identify PII and PHI within your search results.

This post explores how Phinder can bolster your data privacy and security when using OpenSearch.Phinder is available on GitHub at https://github.com/philterd/phinder-pii-opensearch-plugin.

What is Phinder?

Phinder is a specialized OpenSearch plugin designed to seamlessly integrate redaction and de-identification capabilities directly into your search workflow. Built upon the foundation of the open-source Phileas project, Phinder provides a robust and flexible mechanism for identifying and masking sensitive information within your indexed documents. This ensures you can search your data without the risk of exposing PII or PHI, which is essential for compliance with regulations like GDPR, CCPA, and HIPAA.

Phileas: The Engine Behind Phinder

Phinder leverages the Phileas project, a powerful engine for identifying and transforming sensitive data. Phileas offers a wide range of capabilities, including:

  • Named Entity Recognition (NER): Identifying and classifying named entities like people, organizations, locations, and dates.
  • Regular Expressions: Matching patterns for specific data formats like phone numbers, email addresses, and social security numbers.
  • Dictionaries: Using lists of known sensitive terms for redaction.
  • Customizable Rules: Defining your own specific redaction rules based on your unique data and requirements.

By integrating Phileas, Phinder benefits from its sophisticated analysis and transformation capabilities, providing a comprehensive solution for data protection.

Why use Phinder?

  • Enhanced Data Privacy: Phinder gives you granular control over what information is displayed in search results, preventing the accidental exposure of sensitive data.
  • Regulatory Compliance: By redacting PII and PHI, Phinder helps your organization meet the stringent requirements of data privacy and security regulations.
  • Improved Security Posture: Reducing the risk of data breaches associated with sensitive information.
  • Flexible and Customizable: Phinder’s integration with Phileas allows for highly flexible configuration of redaction rules, tailored to your specific needs.
  • Open Source and Community Driven: Being open-source, Phinder is free to use and benefits from community contributions and ongoing improvements.

How to Use Phinder

  1. Installation: The first step is to install the Phinder plugin within your OpenSearch cluster.  Refer to the Phinder documentation on GitHub for detailed installation instructions specific to your OpenSearch version.
  2. Defining Redaction Rules in a Policy (Leveraging Phileas): This is the core of Phinder’s functionality. You’ll leverage Phileas’s capabilities to identify the types of PII and PHI you want to protect (e.g., names, addresses, social security numbers, medical record numbers) and create corresponding rules. You can use regular expressions, dictionaries, or leverage pre-trained NER models provided by Phileas.
  3. Testing and Validation: Once you’ve configured Phinder, thorough testing is essential. Run searches against your data and verify that the sensitive information is being correctly redacted and de-identified.
  4. Integration with OpenSearch Queries: After testing, you can integrate Phinder directly into your OpenSearch queries. This ensures that redaction happens automatically whenever a search is performed.

The following is an example query that redacts email addresses from the description field.

curl -s http://localhost:9200/sample_index/_search -H "Content-Type: application/json" -d'
   {
    "ext": {
       "phinder": {
          "field": "description",
          "policy": "{\"identifiers\": {\"emailAddress\":{\"emailAddressFilterStrategies\":[{\"strategy\":\"REDACT\",\"redactionFormat\":\"{{{REDACTED-%t}}}\"}]}}}"
        }
     },
     "query": {
       "match_all": {}
     }
   }'

Conclusion

Phinder, powered by Phileas, offers a robust and effective solution for protecting sensitive data within your OpenSearch environment. By implementing Phinder and defining appropriate redaction and de-identification rules, you can significantly reduce the risk of exposing PII and PHI, ensuring compliance and enhancing data privacy. Remember to consult the official Phinder documentation on GitHub for the most up-to-date information and detailed instructions. Protecting sensitive data is a continuous process, and Phinder can be a valuable tool in your data privacy strategy.

Phileas 2.10.0

We are excited to announce the release of Phileas 2.10.0!

What’s changed in this version:

* Making FilterResponse not be a final record class by @jzonthemtn in #166
* Removing commons-csv dependency by @jzonthemtn in #174
* Removing guava dependency and adding bloom filter by @jzonthemtn in #172
* Update pdfbox to 3.0.* by @JessieAMorris in #177
* Fixes a bug with the policy service being hard coded to “local” by @JessieAMorris in #178
* Enable outputting the replacement value on PDFs by @JessieAMorris in #179
* Add truncation filter strategy for all filters by @JessieAMorris in #180
* Adding line about snapshots being published nightly. by @jzonthemtn in #182
* #183 Replacing redis test dependency. by @jzonthemtn in #184
* Replace the Lucene-based filter with a fuzzy dictionary filter by @jzonthemtn in #185
* GitHub release: https://github.com/philterd/phileas/releases/tag/2.10.0

Artifacts are available as described in the README.

Phileas in Graylog – Removing PII from Logs

We are very excited to share with you that Graylog has integrated Phileas, the open source PII/PHI redaction engine, into their centralized log management solution. With this new integration, Graylog now has the ability to identify and redact different types of PII (personally identifiable information) present in logs.

The presence of PII in logs is a serious concern. Even careful application developers can find it difficult to prevent all PII from being included in logs. Error messages and stack traces can inadvertently include PII exposing the business to risk and liability.

Phileas is the heart of Philter, an API-based redaction engine. Philter, also open source, provides users with a centralized tool for finding and manipulating PII and PHI in text. With Philter, sensitive information can be redacted, anonymized, or replaced. Philter is available on the AWS, Google Cloud, and Microsoft Azure marketplaces for deployment into your private cloud. Philter requires no outside internet access so your sensitive data never needs to leave your network to be redacted.

Because Phileas is licensed under the business-friendly open source Apache license, organizations are able to bring Phileas’ ability to find and redact PII into their own applications. To learn more about Phileas or to get started integrating Phileas into your applications, visit the Phileas repository on GitHub.

Phileas 2.9.1

We are excited to announce the release of Phileas 2.9.1.

What’s changed in this version:

* LineWidthSplitService is using a new line separator instead of a space
* An empty list of spans from ph-eye does not indicate failure
* Have a default PhEyeConfiguration value in AbstractPhEyeConfiguration so a filter does not have to provide one

GitHub release: https://github.com/philterd/phileas/releases/tag/2.9.1

Artifacts are available in the Philterd repository as described in the README.

Automatically Redacting PII and PHI from Files in Amazon S3 using Amazon Macie and Philter

Amazon Macie is “a data security service that discovers sensitive data using machine learning and pattern matching.” With Amazon Macie you can find potentially sensitive information in files in your Amazon S3 buckets, but what do you do when Amazon Macie finds a file that contains an SSN, phone number, or other piece of sensitive information?

Philter is software that redacts PII, PHI, and other sensitive information from text. Philter runs entirely within your private cloud and does not require any external connectivity. Your data never leaves your private cloud and is not sent to any third-party. In fact, you can run Philter without any external network connectivity and we recommend doing so!

In this blog post we will show how you can use Philter alongside Amazon Macie, Amazon EventBridge, and AWS Lambda to find and redact PII, PHI, or other sensitive information in your files in Amazon S3. If you are setting this up for your organization and need help, feel free to reach out!

How it Works

Here’s how it will work (refer to the diagram below):

  1. Amazon Macie will look for files in Amazon S3 buckets that contain potentially sensitive information.
  2. When Amazon Macie identifies a file, it will be sent as an event to Amazon EventBridge.
  3. An Amazon EventBridge rule that detects events from Amazon Macie will invoke an AWS Lambda function.
  4. The AWS Lambda function will use Philter to redact the file.

Setting it Up

Configuring Amazon Macie

The first thing we will do is enable Amazon Macie. It’s easiest to follow the provided steps to enable Amazon Macie in your account – it’s just a few clicks. Once you have Amazon Macie configured, come back here to continue!

Configure Amazon Macie.

Creating the AWS Lambda Function

Next, we want to create an AWS Lambda function. This function will be invoked whenever a file in an Amazon S3 bucket is found to contain sensitive information. Our function will be provided the name of the bucket and the object’s key. With that information, our function can retrieve the file, use Philter to redact the sensitive information, and either overwrite the existing file or write the redacted file to a new object.

The Lambda function will receive a JSON object that contains the details of the files identified by Amazon Macie. It will look like this:

{
  "version": "0",
  "id": "event ID",
  "detail-type": "Macie Finding",
  "source": "aws.macie",
  "account": "AWS account ID (string)",
  "time": "event timestamp (string)",
  "region": "AWS Region (string)",
  "resources": [
    <-- ARNs of the resources involved in the event -->
  ],
  "detail": {
    <-- Details of a policy or sensitive data finding -->
  },
  "policyDetails": null,
  "sample": Boolean,
  "archived": Boolean
}

You can find more about the schema of the event here. What’s most important to us is the name of the bucket and the key of the object identified by Amazon Macie. In the detail section of the above JSON object, there will be an s3Object that contains that information:

"s3Object":{
  "bucketArn":"arn:aws:s3:::my-bucket",
  "key":"sensitive.txt",
  "path":"my-bucket/sensitive.txt",
  "extension":"txt",
  "lastModified":"2023-10-05T01:32:21.000Z",
  "versionId":"",
    "serverSideEncryption":{
    "encryptionType":"AES256",
    "kmsMasterKeyId":"None"
  },
  "size":807,
  "storageClass":"STANDARD",
  "tags":[
  ],
  "publicAccess":false,
  "etag":"accdb2c550e3aa13610cbd87b91e3ec7"
}

This information gives the location of the identified file! It is s3://my-bucket/sensitive.txt. Now we can use Philter to redact this file!

You have a few choices here. You can have your AWS Lambda function grab that file from S3, redact it using Philter, and then overwrite the existing file. Or, you can choose to write it to a new file in S3 and preserve the original file. Which you do is up to you and your business requirements!

Redacting the File with Philter

To use Philter you must have an instance of it running! You can quickly launch Philter as an Amazon EC2 instance via the AWS Marketplace. In under 5 minutes you will have a running Philter instance ready to redact text via its API.

With Philter’s API, you can use any programming language you like. There are client SDKs available for Java.NET, and Go, but the Philter API is simple and easily callable from other languages like Python. You just need to be able to access Philter’s API from your Lambda function at an endpoint like https://<philter-ip>:8080.

You just need to decide how you want to redact the file. Redaction in Philter is done via a policy and you can set your policy based on your business needs. Perhaps you want to mask social security numbers, shift dates, redact email addresses, and generate random person’s names. You can create a Philter policy to do just that and apply it when calling Philter’s API. Learn more about policies or to see some sample policies.

Once you have your AWS Lambda function and Philter policy the way you want it, you can deploy the Lambda function:

aws lambda create-function --function-name redact-with-philter \
  --runtime python3.11 --handler lambda_function.lambda_handler \
  --role arn:aws:iam::accountId:role/service-role/my-lambda-role \
  --zip-file fileb://code.zip

Just update the values in that command as needed. Don’t forget to set your AWS account ID in the role’s ARN!

Configuring Amazon EventBridge

To create the Amazon EventBridge rule:

aws events put-rule --name MacieFindings --event-pattern "{\"source\":[\"aws.macie\"]}"

MacieFindings is the name that you want to give the rule. The response will be an ARN – note it because you will need it.

Now we want to specify the AWS Lambda function that will be invoked by our EventBridge rule:

aws events put-targets \
  --rule MacieFindings \
  --targets Id=1,Arn=arn:aws:lambda:regionalEndpoint:accountID:function:my-findings-function

Just replace the values in the function’s ARN with the details of your AWS Lambda function. Lastly, we just need to give EventBridge permissions to invoke the Lambda function:

aws lambda add-permission \
  --function-name redact-with-philter \
  --statement-id Sid \
  --action lambda:InvokeFunction \
  --principal events.amazonaws.com \
  --source-arn arn:aws:events:regionalEndpoint:accountId:rule:MacieFindings

Again, update the ARN as appropriate.

Now, when Amazon Macie runs and finds potentially sensitive information in an object in one of your Amazon S3 buckets, an event will be sent to EventBridge, where the rule we created will incoke our Lambda function. The file will be sent to Philter where it will be redacted. The redacted text will then be returned to the Lambda function.

Summary

In this blog post we have provided the framework for using Philter alongside Amazon Macie, Amazon EventBridge, and AWS Lambda to redact PII, PHI, and other sensitive information from files in Amazon S3 buckets.

If you need help setting this up please reach out! We can help you through the steps.

Philter is available from the AWS Marketplace. Not using AWS? Philter is also available from the Google Cloud Marketplace and the Microsoft Azure Marketplace.

Philter as an AI Policy Layer

A policy layer is an important part of every source of AI-generated text.

An AI policy layer is an important part of every source of AI-generated text because it inspects the AI-generated text to prevent sensitive information from being exposed. A policy layer can help remove information such as names, addresses, and telephone numbers from responses.

In this blog post we will describe the function of an AI policy layer and how Philter is well-suited for the role. Philter is available on the AWS Marketplace, Google Cloud Marketplace, and the Microsoft Azure Marketplace.

What is an AI policy layer and why is it needed?

As Cassie Kozyrkov wrote in her blog post linked below, “If you care about AI safety, you’ll insist that every AI-based system should have policy layers built on top of it. Think of policy layers as the AI version of human etiquette.” – AI Bias: Good intentions can lead to nasty results

An AI policy layer is a part of your AI architecture that sits between your chat bot (or other source of AI-generated text) and your end-user. The role of an AI policy layer is to inspect the AI-generated text for sensitive information and remove it before sending the text to the user.

An AI policy layer is needed because it can be extremely difficult to know what data an AI model was trained on. Even when due diligence is done and care is taken, sensitive information can find its way into training data and it can be hard to detect simply due to the vast size of the training data.

How can Philter be used as an AI policy layer?

Philter was designed to integrate into virtually all types of applications. Philter’s API is very simple and can be called from any application. With its text-in and redacted text-out operation, Philter can receive your AI generated text, inspect it for sensitive information based on your configuration, and redact any that is found.

Can the AI policy layer be customized to my industry?

Yes! How Philter finds and redacts sensitive information is defined in a file called a filter profile. A filter profile can be thought of as a policy because it lets you specify what types of sensitive information should be redacted. You can create as many filter profiles as you need.