Phileas 2.12.0

Phileas 2.12.0 has been released. This version of the popular open source redaction library brings:

  • Filter priorities – Each filter can have its own priority that is used as a tie-breaker in cases where text is identified by two filters. For example, if you are using the phone number filter and an ID filter of 10 digit numbers, both filters may detect PII on the same text. In this case, the filter priority will be used to determine the ultimate labeling of the text as either a phone number or an ID number.
  • Zip code validation – The zip code filter can now optionally attempt to validate zip codes. When enabled, if a zip code does not exist in the internal database, the zip code will not be redacted.
  • Each filter can have a custom window size – The window size is roughly the number of words surrounding PII that is used to provide contextual information about the PII. Previously, each filter had to use the same window size. Now, each filter can have the window size set independently.

Look for a new version of Philter soon in the AWS, Google Cloud, and Azure marketplaces soon that is built on Phileas 2.12.0!

Shielding Your Search: Redacting PII and PHI in OpenSearch with Phinder

In today’s data-driven world, safeguarding Personally Identifiable Information (PII) and Protected Health Information (PHI) is paramount. When leveraging search platforms like OpenSearch, ensuring sensitive data remains confidential is crucial. Enter Phinder, an open-source OpenSearch plugin that leverages the power of the Phileas project to effectively redact and de-identify PII and PHI within your search results.

This post explores how Phinder can bolster your data privacy and security when using OpenSearch.Phinder is available on GitHub at https://github.com/philterd/phinder-pii-opensearch-plugin.

What is Phinder?

Phinder is a specialized OpenSearch plugin designed to seamlessly integrate redaction and de-identification capabilities directly into your search workflow. Built upon the foundation of the open-source Phileas project, Phinder provides a robust and flexible mechanism for identifying and masking sensitive information within your indexed documents. This ensures you can search your data without the risk of exposing PII or PHI, which is essential for compliance with regulations like GDPR, CCPA, and HIPAA.

Phileas: The Engine Behind Phinder

Phinder leverages the Phileas project, a powerful engine for identifying and transforming sensitive data. Phileas offers a wide range of capabilities, including:

  • Named Entity Recognition (NER): Identifying and classifying named entities like people, organizations, locations, and dates.
  • Regular Expressions: Matching patterns for specific data formats like phone numbers, email addresses, and social security numbers.
  • Dictionaries: Using lists of known sensitive terms for redaction.
  • Customizable Rules: Defining your own specific redaction rules based on your unique data and requirements.

By integrating Phileas, Phinder benefits from its sophisticated analysis and transformation capabilities, providing a comprehensive solution for data protection.

Why use Phinder?

  • Enhanced Data Privacy: Phinder gives you granular control over what information is displayed in search results, preventing the accidental exposure of sensitive data.
  • Regulatory Compliance: By redacting PII and PHI, Phinder helps your organization meet the stringent requirements of data privacy and security regulations.
  • Improved Security Posture: Reducing the risk of data breaches associated with sensitive information.
  • Flexible and Customizable: Phinder’s integration with Phileas allows for highly flexible configuration of redaction rules, tailored to your specific needs.
  • Open Source and Community Driven: Being open-source, Phinder is free to use and benefits from community contributions and ongoing improvements.

How to Use Phinder

  1. Installation: The first step is to install the Phinder plugin within your OpenSearch cluster.  Refer to the Phinder documentation on GitHub for detailed installation instructions specific to your OpenSearch version.
  2. Defining Redaction Rules in a Policy (Leveraging Phileas): This is the core of Phinder’s functionality. You’ll leverage Phileas’s capabilities to identify the types of PII and PHI you want to protect (e.g., names, addresses, social security numbers, medical record numbers) and create corresponding rules. You can use regular expressions, dictionaries, or leverage pre-trained NER models provided by Phileas.
  3. Testing and Validation: Once you’ve configured Phinder, thorough testing is essential. Run searches against your data and verify that the sensitive information is being correctly redacted and de-identified.
  4. Integration with OpenSearch Queries: After testing, you can integrate Phinder directly into your OpenSearch queries. This ensures that redaction happens automatically whenever a search is performed.

The following is an example query that redacts email addresses from the description field.

curl -s http://localhost:9200/sample_index/_search -H "Content-Type: application/json" -d'
   {
    "ext": {
       "phinder": {
          "field": "description",
          "policy": "{\"identifiers\": {\"emailAddress\":{\"emailAddressFilterStrategies\":[{\"strategy\":\"REDACT\",\"redactionFormat\":\"{{{REDACTED-%t}}}\"}]}}}"
        }
     },
     "query": {
       "match_all": {}
     }
   }'

Conclusion

Phinder, powered by Phileas, offers a robust and effective solution for protecting sensitive data within your OpenSearch environment. By implementing Phinder and defining appropriate redaction and de-identification rules, you can significantly reduce the risk of exposing PII and PHI, ensuring compliance and enhancing data privacy. Remember to consult the official Phinder documentation on GitHub for the most up-to-date information and detailed instructions. Protecting sensitive data is a continuous process, and Phinder can be a valuable tool in your data privacy strategy.

Phileas 2.10.0

We are excited to announce the release of Phileas 2.10.0!

What’s changed in this version:

* Making FilterResponse not be a final record class by @jzonthemtn in #166
* Removing commons-csv dependency by @jzonthemtn in #174
* Removing guava dependency and adding bloom filter by @jzonthemtn in #172
* Update pdfbox to 3.0.* by @JessieAMorris in #177
* Fixes a bug with the policy service being hard coded to “local” by @JessieAMorris in #178
* Enable outputting the replacement value on PDFs by @JessieAMorris in #179
* Add truncation filter strategy for all filters by @JessieAMorris in #180
* Adding line about snapshots being published nightly. by @jzonthemtn in #182
* #183 Replacing redis test dependency. by @jzonthemtn in #184
* Replace the Lucene-based filter with a fuzzy dictionary filter by @jzonthemtn in #185
* GitHub release: https://github.com/philterd/phileas/releases/tag/2.10.0

Artifacts are available as described in the README.

Phileas in Graylog – Removing PII from Logs

We are very excited to share with you that Graylog has integrated Phileas, the open source PII/PHI redaction engine, into their centralized log management solution. With this new integration, Graylog now has the ability to identify and redact different types of PII (personally identifiable information) present in logs.

The presence of PII in logs is a serious concern. Even careful application developers can find it difficult to prevent all PII from being included in logs. Error messages and stack traces can inadvertently include PII exposing the business to risk and liability.

Phileas is the heart of Philter, an API-based redaction engine. Philter, also open source, provides users with a centralized tool for finding and manipulating PII and PHI in text. With Philter, sensitive information can be redacted, anonymized, or replaced. Philter is available on the AWS, Google Cloud, and Microsoft Azure marketplaces for deployment into your private cloud. Philter requires no outside internet access so your sensitive data never needs to leave your network to be redacted.

Because Phileas is licensed under the business-friendly open source Apache license, organizations are able to bring Phileas’ ability to find and redact PII into their own applications. To learn more about Phileas or to get started integrating Phileas into your applications, visit the Phileas repository on GitHub.

Phileas 2.9.1

We are excited to announce the release of Phileas 2.9.1.

What’s changed in this version:

* LineWidthSplitService is using a new line separator instead of a space
* An empty list of spans from ph-eye does not indicate failure
* Have a default PhEyeConfiguration value in AbstractPhEyeConfiguration so a filter does not have to provide one

GitHub release: https://github.com/philterd/phileas/releases/tag/2.9.1

Artifacts are available in the Philterd repository as described in the README.

Phileas — The Open Source PII and PHI redaction engine

I am delighted to announce the project that provides the core PII and PHI redaction capabilities is now open source! Introducing Phileas, the PII and PHI redaction engine! Phileas is now available under the Apache license on GitHub.

Both Philter and Phirestream use Phileas to identify and redact sensitive information like PII and PHI. Phileas does all of the heavy lifting, while Philter and Phirestream make its functionality user-friendly and provide the NLP models.

Everyone is welcome to look at the code that powers Philter and Phirestream, use it, and contribute! In the next few weeks we will be adding better developer documentation to help you utilize Phileas in your applications. For the past 5 years, Phileas was only an internal project used by Philter and Phirestream, so please hang with us while we smooth out the edges and add user-facing documentation!

Philter and Phirestream will remain on the AWS, Azure, and Google Cloud marketplaces. We will continue to provide commercial support for those products. New versions of Philter and Phirestream will use the open source Phileas project.

We decided to open source Phileas because, firstly, we believe in open source. We also want to give our users the ability to look into how Philter and Phirestream work. Identifying and redacting sensitive information is a challenge with important implications! We want our users to have a better understanding of how these products work and to have a more open line of communication as to what features are implemented next. In that regard, we will be migrating our tasks over from our private Jira to GitHub issues in the next few days as well.

What is format-preserving encryption?

In cryptography, you have plain text and cipher text. An encryption algorithm transforms the plain text into the cipher text. The cipher text won’t look anything like the plain text, in terms of characters and length. There are many different kinds of encryption algorithms, serving many different purposes. The cipher text for each of these algorithms will all be different.

Let’s take the case of a credit card number, a common piece of sensitive information that is often encrypted. A credit card number is 16 digits long. Encrypting the credit card number with the industry standard AES-128-CBC algorithm will produce a cipher text much longer than the credit card number. If we are storing the credit card number in a database column configured for length 16, the cipher text will be too long to be stored in the database column.

Format-preserving encryption is a method of encryption that causes the cipher text to retain the same format as the plain text. For example, encrypting a credit card number with a format-preserving encryption algorithm will result in a cipher text of 16 characters in length, but will look nothing else like the original credit card number. Typically, only numeric, alphabetic, or alphanumeric characters can be used with format-preserving encryption.

The cipher text can be decrypted into the original plain text if the original credit card numbers are needed.

Learn more about format-preserving encryption.

Format-Preserving Encryption in Philter

Philter 2.1.0 adds format-preserving encryption as a filter strategy for bank numbers, bitcoin addresses, credit cards, drivers license numbers, IBAN codes, passport numbers, SSNs/TINs, package tracking numbers, and VINs. By specifying FPE_ENCRYPT_REPLACE as the filter strategy for one of those items of PII, Philter will encrypt the PII using format-preserving encryption.

Philter will replace the original PII with its encrypted version, and since format-preserving encryption was used, the replacement (encrypted) value will appear in the same format. This is useful when it is important that PII be encrypted but its length not be modified.

If you are not concerned about encrypting the original value, you can use the RANDOM_REPLACE filter strategy to replace PII with random values also in the same format as the original PII. Just remember that random replacement is not encryption and is not reversible. Use random replacement when using documents for machine learning or other processes where the original values are not important.

To enable format-preserving encryption for a type of sensitive information, simply add it to the filter profile. The following is an example filter profile that uses format-preserving encryption for credit card numbers. Just replace the key and tweak values with your own values.

{ "name": "credit-cards", "identifiers": { "creditCardNumbers": { "creditCardNumberFilterStrategies": [ { "strategy": "FPE_ENCRYPT_REPLACE", "key": "...", "tweak: "..." } ] } } }

Learn more about format-preserving encryption in Philter’s User Guide. Also, Philter has several other filter strategies to give full control over how your data is redacted.