Redacting Text in Amazon Kinesis Data Firehose

Amazon Kinesis Firehose is a managed streaming service designed to take large amounts of data from one place to another. For example, you can take data from sources such as Amazon CloudWatch, AWS IoT, and custom applications using the AWS SDK to destinations Amazon S3, Amazon Redshift, Amazon Elasticsearch, and other services. In this post we will use Amazon S3 as the firehose’s destination.

In some cases you may need to manipulate the data as it goes through the firehose to remove sensitive information. In this blog post we will show how Amazon Kinesis Firehose and AWS Lambda can be used in conjunction with Philter to remove sensitive information (PII and PHI) from the text as it travels through the firehose.

Philter is software that redacts PII, PHI, and other sensitive information from text. Philter runs entirely within your private cloud and does not require any external connectivity. Your data never leaves your private cloud and is not sent to any third-party. In fact, you can run Philter without any external network connectivity and we recommend doing so!

Prerequisites

Your must have a running instance of Philter. If you don’t already have a running instance of Philter you can launch one through the AWS Marketplace. There are CloudFormation and Terraform scripts for launching a single instance of Philter or a load-balanced auto-scaled set of Philter instances.

It’s not required that the instance of Philter be running in AWS but it is required that the instance of Philter be accessible from your AWS Lambda function. Running Philter and your AWS Lambda function in your own VPC allows your Lambda function to communicate locally with Philter from the function. This keeps your sensitive information from being sent over the public internet and keeps the network traffic inside your VPC.

Setting up the Amazon Kinesis Firehose Transformation

There is no need to duplicate an excellent blog post on creating an Amazon Kinesis Firehose Data Transformation with AWS Lambda. Instead, refer to the linked page and substitute the Python 3 code below for the code in that blog post.

Configuring the Firehose and the Lambda Function

To start, create an AWS Firehose and configure an AWS Lambda transformation. When creating the AWS Lambda function, select Python 3.7 and use the following code:

from botocore.vendored import requests

import base64
def handler(event, context):

output = []

for record in event['records']:

   payload=base64.b64decode(record["data"]
   headers = {'Content-type': 'text/plain'}

   r = requests.post("https://PHILTER_IP:8080/api/filter", verify=False, data=payload, headers=headers, timeout=20)
   filtered = r.text

   output_record = { 'recordId': record['recordId'], 'result': 'Ok', 'data': base64.b64encode(filtered.encode('utf-8') + b'\n').decode('utf-8') }

   output.append(output_record)

return output

The following Kinesis Firehose test event can be used to test the function:

{
  "invocationId":"invocationIdExample",
  "deliveryStreamArn":"arn:aws:kinesis:EXAMPLE",
  "region":"us-east-1",
  "records":[
    {
      "recordId":"49546986683135544286507457936321625675700192471156785154",
      "approximateArrivalTimestamp":1495072949453,
      "data":"R2VvcmdlIFdhc2hpbmd0b24gd2FzIHByZXNpZGVudCBhbmQgaGlzIHNzbiB3YXMgMTIzLTQ1LTY3ODkgYW5kIGhlIGxpdmVkIGF0IDkwMjEwLiBQYXRpZW50IGlkIDAwMDc2YSBhbmQgOTM4MjFhLiBIZSBpcyBvbiBiaW90aW4uIERpYWdub3NlZCB3aXRoIEEwMTAwLg=="
    },
  {
    "recordId":"49546986683135544286507457936321625675700192471156785154",
    "approximateArrivalTimestamp":1495072949453,
    "data":"R2VvcmdlIFdhc2hpbmd0b24gd2FzIHByZXNpZGVudCBhbmQgaGlzIHNzbiB3YXMgMTIzLTQ1LTY3ODkgYW5kIGhlIGxpdmVkIGF0IDkwMjEwLiBQYXRpZW50IGlkIDAwMDc2YSBhbmQgOTM4MjFhLiBIZSBpcyBvbiBiaW90aW4uIERpYWdub3NlZCB3aXRoIEEwMTAwLg=="
    }
  ]
}

This test event contains 2 messages and the data for each is base 64 encoded, which is the value “He lived in 90210 and his SSN was 123–45–6789.” When the test is executed the response will be:

[
  "He lived in {{{REDACTED-zip-code}}} and his SSN was {{{REDACTED-ssn}}}.",
  "He lived in {{{REDACTED-zip-code}}} and his SSN was {{{REDACTED-ssn}}}."
]

When running the test, the AWS Lambda function will extract the data from the requests in the firehose and submit each to Philter for filtering. The responses from each request will be returned from the function as a JSON list. Note that in our Python function we are ignoring Philter’s self-signed certificate. It is recommended that you use a valid signed certificate for Philter.

When data is now published to the Amazon Kinesis Data Firehose stream, the data will be processed by the AWS Lambda function and Philter prior to exiting the firehose at its configured destination.

Processing Data

We can use the AWS CLI to publish data to our Amazon Kinesis Firehose stream called sensitive-text:

aws firehose put-record --delivery-stream-name sensitive-text --record "He lived in 90210 and his SSN was 123-45-6789."

Check the destination S3 bucket and you will have a single object with the following line:

He lived in {{{REDACTED-zip-code}}} and his SSN was {{{REDACTED-ssn}}}.

Conclusion

In this blog post we have created an Amazon Kinesis Data Firehose pipeline that uses an AWS Lambda function to remove PII and PHI from the text in the streaming pipeline.

Philter is available from the AWS Marketplace. Not using AWS? Philter is also available from the Google Cloud Marketplace and the Microsoft Azure Marketplace.

Phileas — The Open Source PII and PHI redaction engine

I am delighted to announce the project that provides the core PII and PHI redaction capabilities is now open source! Introducing Phileas, the PII and PHI redaction engine! Phileas is now available under the Apache license on GitHub.

Both Philter and Phirestream use Phileas to identify and redact sensitive information like PII and PHI. Phileas does all of the heavy lifting, while Philter and Phirestream make its functionality user-friendly and provide the NLP models.

Everyone is welcome to look at the code that powers Philter and Phirestream, use it, and contribute! In the next few weeks we will be adding better developer documentation to help you utilize Phileas in your applications. For the past 5 years, Phileas was only an internal project used by Philter and Phirestream, so please hang with us while we smooth out the edges and add user-facing documentation!

Philter and Phirestream will remain on the AWS, Azure, and Google Cloud marketplaces. We will continue to provide commercial support for those products. New versions of Philter and Phirestream will use the open source Phileas project.

We decided to open source Phileas because, firstly, we believe in open source. We also want to give our users the ability to look into how Philter and Phirestream work. Identifying and redacting sensitive information is a challenge with important implications! We want our users to have a better understanding of how these products work and to have a more open line of communication as to what features are implemented next. In that regard, we will be migrating our tasks over from our private Jira to GitHub issues in the next few days as well.

What is format-preserving encryption?

In cryptography, you have plain text and cipher text. An encryption algorithm transforms the plain text into the cipher text. The cipher text won’t look anything like the plain text, in terms of characters and length. There are many different kinds of encryption algorithms, serving many different purposes. The cipher text for each of these algorithms will all be different.

Let’s take the case of a credit card number, a common piece of sensitive information that is often encrypted. A credit card number is 16 digits long. Encrypting the credit card number with the industry standard AES-128-CBC algorithm will produce a cipher text much longer than the credit card number. If we are storing the credit card number in a database column configured for length 16, the cipher text will be too long to be stored in the database column.

Format-preserving encryption is a method of encryption that causes the cipher text to retain the same format as the plain text. For example, encrypting a credit card number with a format-preserving encryption algorithm will result in a cipher text of 16 characters in length, but will look nothing else like the original credit card number. Typically, only numeric, alphabetic, or alphanumeric characters can be used with format-preserving encryption.

The cipher text can be decrypted into the original plain text if the original credit card numbers are needed.

Learn more about format-preserving encryption.

Format-Preserving Encryption in Philter

Philter 2.1.0 adds format-preserving encryption as a filter strategy for bank numbers, bitcoin addresses, credit cards, drivers license numbers, IBAN codes, passport numbers, SSNs/TINs, package tracking numbers, and VINs. By specifying FPE_ENCRYPT_REPLACE as the filter strategy for one of those items of PII, Philter will encrypt the PII using format-preserving encryption.

Philter will replace the original PII with its encrypted version, and since format-preserving encryption was used, the replacement (encrypted) value will appear in the same format. This is useful when it is important that PII be encrypted but its length not be modified.

If you are not concerned about encrypting the original value, you can use the RANDOM_REPLACE filter strategy to replace PII with random values also in the same format as the original PII. Just remember that random replacement is not encryption and is not reversible. Use random replacement when using documents for machine learning or other processes where the original values are not important.

To enable format-preserving encryption for a type of sensitive information, simply add it to the filter profile. The following is an example filter profile that uses format-preserving encryption for credit card numbers. Just replace the key and tweak values with your own values.

{ "name": "credit-cards", "identifiers": { "creditCardNumbers": { "creditCardNumberFilterStrategies": [ { "strategy": "FPE_ENCRYPT_REPLACE", "key": "...", "tweak: "..." } ] } } }

Learn more about format-preserving encryption in Philter’s User Guide. Also, Philter has several other filter strategies to give full control over how your data is redacted.