Featured Post

Phileas in Graylog - Removing PII from Logs

We are very excited to share with you that Graylog has integrated Phileas , the open source PII/PHI redaction engine, into their centralize...

Redacting Text in Amazon Kinesis Data Firehose

Amazon Kinesis Firehose is a managed streaming service designed to take large amounts of data from one place to another. For example, you can take data from sources such as Amazon CloudWatch, AWS IoT, and custom applications using the AWS SDK to destinations Amazon S3, Amazon Redshift, Amazon Elasticsearch, and other services. In this post we will use Amazon S3 as the firehose’s destination.

In some cases you may need to manipulate the data as it goes through the firehose to remove sensitive information. In this blog post we will show how Amazon Kinesis Firehose and AWS Lambda can be used in conjunction with Philter to remove sensitive information (PII and PHI) from the text as it travels through the firehose.

Philter is software that redacts PII, PHI, and other sensitive information from text. Philter runs entirely within your private cloud and does not require any external connectivity. Your data never leaves your private cloud and is not sent to any third-party. In fact, you can run Philter without any external network connectivity and we recommend doing so!

Prerequisites

Your must have a running instance of Philter. If you don’t already have a running instance of Philter you can launch one through the AWS Marketplace. There are CloudFormation and Terraform scripts for launching a single instance of Philter or a load-balanced auto-scaled set of Philter instances.

It’s not required that the instance of Philter be running in AWS but it is required that the instance of Philter be accessible from your AWS Lambda function. Running Philter and your AWS Lambda function in your own VPC allows your Lambda function to communicate locally with Philter from the function. This keeps your sensitive information from being sent over the public internet and keeps the network traffic inside your VPC.

Setting up the Amazon Kinesis Firehose Transformation

There is no need to duplicate an excellent blog post on creating an Amazon Kinesis Firehose Data Transformation with AWS Lambda. Instead, refer to the linked page and substitute the Python 3 code below for the code in that blog post.

Configuring the Firehose and the Lambda Function

To start, create an AWS Firehose and configure an AWS Lambda transformation. When creating the AWS Lambda function, select Python 3.7 and use the following code:

from botocore.vendored import requests

import base64
def handler(event, context):

output = []

for record in event['records']:

   payload=base64.b64decode(record["data"]
   headers = {'Content-type': 'text/plain'}

   r = requests.post("https://PHILTER_IP:8080/api/filter", verify=False, data=payload, headers=headers, timeout=20)
   filtered = r.text

   output_record = { 'recordId': record['recordId'], 'result': 'Ok', 'data': base64.b64encode(filtered.encode('utf-8') + b'\n').decode('utf-8') }

   output.append(output_record)

return output

The following Kinesis Firehose test event can be used to test the function:

{
  "invocationId":"invocationIdExample",
  "deliveryStreamArn":"arn:aws:kinesis:EXAMPLE",
  "region":"us-east-1",
  "records":[
    {
      "recordId":"49546986683135544286507457936321625675700192471156785154",
      "approximateArrivalTimestamp":1495072949453,
      "data":"R2VvcmdlIFdhc2hpbmd0b24gd2FzIHByZXNpZGVudCBhbmQgaGlzIHNzbiB3YXMgMTIzLTQ1LTY3ODkgYW5kIGhlIGxpdmVkIGF0IDkwMjEwLiBQYXRpZW50IGlkIDAwMDc2YSBhbmQgOTM4MjFhLiBIZSBpcyBvbiBiaW90aW4uIERpYWdub3NlZCB3aXRoIEEwMTAwLg=="
    },
  {
    "recordId":"49546986683135544286507457936321625675700192471156785154",
    "approximateArrivalTimestamp":1495072949453,
    "data":"R2VvcmdlIFdhc2hpbmd0b24gd2FzIHByZXNpZGVudCBhbmQgaGlzIHNzbiB3YXMgMTIzLTQ1LTY3ODkgYW5kIGhlIGxpdmVkIGF0IDkwMjEwLiBQYXRpZW50IGlkIDAwMDc2YSBhbmQgOTM4MjFhLiBIZSBpcyBvbiBiaW90aW4uIERpYWdub3NlZCB3aXRoIEEwMTAwLg=="
    }
  ]
}

This test event contains 2 messages and the data for each is base 64 encoded, which is the value “He lived in 90210 and his SSN was 123–45–6789.” When the test is executed the response will be:

[
  "He lived in {{{REDACTED-zip-code}}} and his SSN was {{{REDACTED-ssn}}}.",
  "He lived in {{{REDACTED-zip-code}}} and his SSN was {{{REDACTED-ssn}}}."
]

When running the test, the AWS Lambda function will extract the data from the requests in the firehose and submit each to Philter for filtering. The responses from each request will be returned from the function as a JSON list. Note that in our Python function we are ignoring Philter’s self-signed certificate. It is recommended that you use a valid signed certificate for Philter.

When data is now published to the Amazon Kinesis Data Firehose stream, the data will be processed by the AWS Lambda function and Philter prior to exiting the firehose at its configured destination.

Processing Data

We can use the AWS CLI to publish data to our Amazon Kinesis Firehose stream called sensitive-text:

aws firehose put-record --delivery-stream-name sensitive-text --record "He lived in 90210 and his SSN was 123-45-6789."

Check the destination S3 bucket and you will have a single object with the following line:

He lived in {{{REDACTED-zip-code}}} and his SSN was {{{REDACTED-ssn}}}.

Conclusion

In this blog post we have created an Amazon Kinesis Data Firehose pipeline that uses an AWS Lambda function to remove PII and PHI from the text in the streaming pipeline.

Philter is available from the AWS Marketplace. Not using AWS? Philter is also available from the Google Cloud Marketplace and the Microsoft Azure Marketplace.

Airlock Provides Protection Against Disclosure of Sensitive Information in AI-Generated Text

 

Airlock is available in the AWS, Google Cloud, and Microsoft Azure cloud marketplaces for turnkey deployment.

In the age of artificial intelligence, the use of AI-generated text has become increasingly prevalent in many industries. However, with the rise of this technology comes the risk of sensitive information being disclosed unintentionally. To address this issue, the team at Philterd, LLC, has developed Airlock, software designed to prevent the disclosure of sensitive information in AI-generated text.

Airlock utilizes advanced algorithms and machine learning techniques to scan and analyze AI-generated text for any potential sensitive information. This includes personal data and other sensitive details that could pose a risk if disclosed. The software can automatically redact or modify the identified information, providing guardrails for AI applications.

“We are excited to make Airlock available. The inadvertent disclosure of sensitive information, such as PII and PHI, is an important consideration that should not be overlooked when creating AI-enabled applications.”, said Jeff Zemerick, founder of Philterd, LLC. “AI-generated text brings a new dimension to safe-guarding PII and PHI and we look forward to helping users with this challenge.”

The need for such a software has become more apparent in recent years, with numerous incidents of sensitive information being leaked through AI-generated text. This has not only caused harm to individuals and businesses but has also raised concerns about the ethical use of AI. With Airlock, these concerns can be addressed, and the risk of sensitive information disclosure can be significantly reduced. Airlock builds on Philterd’s open source de-identification and redaction software.

Airlock is available on the Amazon Web Services, Google Cloud, and Microsoft Azure marketplaces for deployment into users’ cloud environments. To learn more about Airlock and its features, visit https://www.philterd.ai or contact support@philterd.ai.

About Philterd, LLC

Philterd specializes in helping keep your sensitive information safe. Learn more at www.philterd.ai.

Philter as an AI Policy Layer

A policy layer is an important part of every source of AI-generated text.

An AI policy layer is an important part of every source of AI-generated text because it inspects the AI-generated text to prevent sensitive information from being exposed. A policy layer can help remove information such as names, addresses, and telephone numbers from responses.

In this blog post we will describe the function of an AI policy layer and how Philter is well-suited for the role. Philter is available on the AWS Marketplace, Google Cloud Marketplace, and the Microsoft Azure Marketplace.

What is an AI policy layer and why is it needed?

As Cassie Kozyrkov wrote in her blog post linked below, "If you care about AI safety, you’ll insist that every AI-based system should have policy layers built on top of it. Think of policy layers as the AI version of human etiquette." - AI Bias: Good intentions can lead to nasty results

An AI policy layer is a part of your AI architecture that sits between your chat bot (or other source of AI-generated text) and your end-user. The role of an AI policy layer is to inspect the AI-generated text for sensitive information and remove it before sending the text to the user.

An AI policy layer is needed because it can be extremely difficult to know what data an AI model was trained on. Even when due diligence is done and care is taken, sensitive information can find its way into training data and it can be hard to detect simply due to the vast size of the training data.

How can Philter be used as an AI policy layer?

Philter was designed to integrate into virtually all types of applications. Philter's API is very simple and can be called from any application. With its text-in and redacted text-out operation, Philter can receive your AI generated text, inspect it for sensitive information based on your configuration, and redact any that is found.

Can the AI policy layer be customized to my industry?

Yes! How Philter finds and redacts sensitive information is defined in a file called a filter profile. A filter profile can be thought of as a policy because it lets you specify what types of sensitive information should be redacted. You can create as many filter profiles as you need.

Automatically Redacting PII and PHI from Files in Amazon S3 using Amazon Macie and Philter

Amazon Macie is "a data security service that discovers sensitive data using machine learning and pattern matching." With Amazon Macie you can find potentially sensitive information in files in your Amazon S3 buckets, but what do you do when Amazon Macie finds a file that contains an SSN, phone number, or other piece of sensitive information?

Philter is software that redacts PII, PHI, and other sensitive information from text. Philter runs entirely within your private cloud and does not require any external connectivity. Your data never leaves your private cloud and is not sent to any third-party. In fact, you can run Philter without any external network connectivity and we recommend doing so!

In this blog post we will show how you can use Philter alongside Amazon Macie, Amazon EventBridge, and AWS Lambda to find and redact PII, PHI, or other sensitive information in your files in Amazon S3. If you are setting this up for your organization and need help, feel free to reach out!

How it Works

Here's how it will work (refer to the diagram below):

  1. Amazon Macie will look for files in Amazon S3 buckets that contain potentially sensitive information.

  2. When Amazon Macie identifies a file, it will be sent as an event to Amazon EventBridge.

  3. An Amazon EventBridge rule that detects events from Amazon Macie will invoke an AWS Lambda function.

  4. The AWS Lambda function will use Philter to redact the file.

Setting it Up

Configuring Amazon Macie

The first thing we will do is enable Amazon Macie. It's easiest to follow the provided steps to enable Amazon Macie in your account - it's just a few clicks. Once you have Amazon Macie configured, come back here to continue!

Configure Amazon Macie.

Creating the AWS Lambda Function

Next, we want to create an AWS Lambda function. This function will be invoked whenever a file in an Amazon S3 bucket is found to contain sensitive information. Our function will be provided the name of the bucket and the object's key. With that information, our function can retrieve the file, use Philter to redact the sensitive information, and either overwrite the existing file or write the redacted file to a new object.

The Lambda function will receive a JSON object that contains the details of the files identified by Amazon Macie. It will look like this:

{
  "version": "0",
  "id": "event ID",
  "detail-type": "Macie Finding",
  "source": "aws.macie",
  "account": "AWS account ID (string)",
  "time": "event timestamp (string)",
  "region": "AWS Region (string)",
  "resources": [
    <-- ARNs of the resources involved in the event -->
  ],
  "detail": {
    <-- Details of a policy or sensitive data finding -->
  },
  "policyDetails": null,
  "sample": Boolean,
  "archived": Boolean
}

You can find more about the schema of the event here. What's most important to us is the name of the bucket and the key of the object identified by Amazon Macie. In the detail section of the above JSON object, there will be an s3Object that contains that information:

"s3Object":{
  "bucketArn":"arn:aws:s3:::my-bucket",
  "key":"sensitive.txt",
  "path":"my-bucket/sensitive.txt",
  "extension":"txt",
  "lastModified":"2023-10-05T01:32:21.000Z",
  "versionId":"",
    "serverSideEncryption":{
    "encryptionType":"AES256",
    "kmsMasterKeyId":"None"
  },
  "size":807,
  "storageClass":"STANDARD",
  "tags":[
  ],
  "publicAccess":false,
  "etag":"accdb2c550e3aa13610cbd87b91e3ec7"
}

This information gives the location of the identified file! It is s3://my-bucket/sensitive.txt. Now we can use Philter to redact this file!

You have a few choices here. You can have your AWS Lambda function grab that file from S3, redact it using Philter, and then overwrite the existing file. Or, you can choose to write it to a new file in S3 and preserve the original file. Which you do is up to you and your business requirements!

Redacting the File with Philter

To use Philter you must have an instance of it running! You can quickly launch Philter as an Amazon EC2 instance via the AWS Marketplace. In under 5 minutes you will have a running Philter instance ready to redact text via its API.

With Philter's API, you can use any programming language you like. There are client SDKs available for Java, .NET, and Go, but the Philter API is simple and easily callable from other languages like Python. You just need to be able to access Philter's API from your Lambda function at an endpoint like https://<philter-ip>:8080.

You just need to decide how you want to redact the file. Redaction in Philter is done via a policy and you can set your policy based on your business needs. Perhaps you want to mask social security numbers, shift dates, redact email addresses, and generate random person's names. You can create a Philter policy to do just that and apply it when calling Philter's API. Learn more about policies or to see some sample policies.

Once you have your AWS Lambda function and Philter policy the way you want it, you can deploy the Lambda function:

aws lambda create-function --function-name redact-with-philter \
  --runtime python3.11 --handler lambda_function.lambda_handler \
  --role arn:aws:iam::accountId:role/service-role/my-lambda-role \
  --zip-file fileb://code.zip

Just update the values in that command as needed. Don't forget to set your AWS account ID in the role's ARN!

Configuring Amazon EventBridge

To create the Amazon EventBridge rule:

aws events put-rule --name MacieFindings --event-pattern "{\"source\":[\"aws.macie\"]}"

MacieFindings is the name that you want to give the rule. The response will be an ARN - note it because you will need it.

Now we want to specify the AWS Lambda function that will be invoked by our EventBridge rule:

aws events put-targets \
  --rule MacieFindings \
  --targets Id=1,Arn=arn:aws:lambda:regionalEndpoint:accountID:function:my-findings-function

Just replace the values in the function's ARN with the details of your AWS Lambda function. Lastly, we just need to give EventBridge permissions to invoke the Lambda function:

aws lambda add-permission \
  --function-name redact-with-philter \
  --statement-id Sid \
  --action lambda:InvokeFunction \
  --principal events.amazonaws.com \
  --source-arn arn:aws:events:regionalEndpoint:accountId:rule:MacieFindings

Again, update the ARN as appropriate.

Now, when Amazon Macie runs and finds potentially sensitive information in an object in one of your Amazon S3 buckets, an event will be sent to EventBridge, where the rule we created will incoke our Lambda function. The file will be sent to Philter where it will be redacted. The redacted text will then be returned to the Lambda function.

Summary

In this blog post we have provided the framework for using Philter alongside Amazon Macie, Amazon EventBridge, and AWS Lambda to redact PII, PHI, and other sensitive information from files in Amazon S3 buckets.

If you need help setting this up please reach out! We can help you through the steps.

Philter is available from the AWS Marketplace. Not using AWS? Philter is also available from the Google Cloud Marketplace and the Microsoft Azure Marketplace.

Phileas in Graylog - Removing PII from Logs

We are very excited to share with you that Graylog has integrated Phileas, the open source PII/PHI redaction engine, into their centralized log management solution. With this new integration, Graylog now has the ability to identify and redact different types of PII (personally identifiable information) present in logs.

The presence of PII in logs is a serious concern. Even careful application developers can find it difficult to prevent all PII from being included in logs. Error messages and stack traces can inadvertently include PII exposing the business to risk and liability.

Phileas is the heart of Philter, an API-based redaction engine. Philter, also open source, provides users with a centralized tool for finding and manipulating PII and PHI in text. With Philter, sensitive information can be redacted, anonymized, or replaced. Philter is available on the AWS, Google Cloud, and Microsoft Azure marketplaces for deployment into your private cloud. Philter requires no outside internet access so your sensitive data never needs to leave your network to be redacted.

Because Phileas is licensed under the business-friendly open source Apache license, organizations are able to bring Phileas' ability to find and redact PII into their own applications. To learn more about Phileas or to get started integrating Phileas into your applications, visit the Phileas repository on GitHub.