Using an LLM or Pattern-based Rules for PII/PHI Redaction

In our data-driven world, being able to protect Personally Identifiable Information (PII) and Protected Health Information (PHI) is imperative. Whether you’re securing customer data, complying with regulations like GDPR or HIPAA, or simply aiming for responsible data handling, the need to effectively redact sensitive information is crucial.

Today, there are two primary approaches: leveraging the power of Large Language Models (LLMs) and employing traditional pattern-based rules. While LLMs have understandably received significant attention for their impressive natural language understanding, it’s essential to compare their capabilities against the tried-and-true methods of pattern matching.

In this blog post, we will take a look at the benefits of pattern-based rules and some drawbacks of an LLM-based approach.

Pattern-based Rules

Pattern-based rules operate on a straightforward principle: defining specific regular expressions or keyword lists to identify and redact PII. Think of it as providing a precise set of instructions to locate and mask patterns that match social security numbers, phone numbers, email addresses, and other sensitive data.

A pattern-based approach has several advantages:

  • Speed and Efficiency: Pattern matching is generally very fast, processing large volumes of text quickly without significant computational overhead.
  • Transparency and Control: You have direct control over the rules, making it easy to understand why a particular piece of text was flagged as PII. Debugging and refining these rules are straightforward and simple. If something expected is not matched, it is relatively easy to figure out why.
  • Computational Efficiency: Pattern-based methods require minimal computational resources, running effectively on standard CPUs without the need for specialized hardware like a GPU.
  • Precise Matching: When rules are well-defined, they offer high precision in identifying specific PII formats. For instance, a regex for a credit card number can incorporate the Luhn algorithm for basic validation.
  • Flexibility for Specific Formats: Handling variations like phone numbers with or without parentheses or different date formats can be explicitly addressed through rule creation.

Drawbacks of LLM-Based PII Redaction

Using an LLM for PII redaction might seem like a good option – just download a model, give it your text (and maybe a prompt), and get back the redacted text. A search for “pii” on the Hugging Face model hub returns several models trained to identify PII.

But, before you go that route, be aware of some the drawbacks:

  • Tokenization and Sentence Splitting: Before an LLM can analyze text, the input needs to be broken down into smaller units called tokens (which are often just words). This pre-processing step can sometimes lead to the fragmentation of PII that spans across token boundaries or sentence splits, potentially hindering accurate identification. Additionally, accurately splitting sentences in some text can be quite a challenge on its own. Content such as medical notes or technical information may not even be comprised of complete sentences.
  • Ambiguity of Tokens: A fundamental limitation of current LLM tokenization is the lack of inherent support for tokens that could belong to multiple PII categories. For example, a sequence of digits might represent a zip code, a portion of a social security number, or part of a phone number. The LLM might struggle to definitively classify such ambiguous tokens without sufficient contextual clues, leading to PII not being redacted.
  • Resource Intensive: Running LLMs demands substantial computational resources. They typically require powerful GPUs significant hard disk space to store the model weights. This can translate to increased infrastructure costs and more complex deployments.
  • Redaction Speed: Compared to the rapid execution of pattern-based rules, LLM inference is considerably slower. While using a GPU can provide an increase in speeds, it is most likely to never achieve the speed of a pattern-based system.
  • A Black Box: LLMs operate as complex neural networks, making it a challenge to understand exactly why a particular piece of text was (or wasn’t) identified as PII. This lack of transparency can make debugging and ensuring the reliability of the redaction process difficult. Going further, if you are using a third-party model, such as one from the Hugging Face Hub, may not know much about the data used to train the model.
  • Robustness (e.g. Dashes and Spaces:) The robustness of LLMs to minor variations in PII formatting can be surprisingly fragile. For instance, an LLM trained to recognize social security numbers in the “XXX-XX-XXXX” format might fail to identify “XXXXXXXXX” as an SSN, potentially misclassifying it as a zip code if it happens to be five digits long followed by four. Similarly, the presence or absence of spaces in phone numbers or other identifiers can significantly impact the LLM’s ability to recognize them.
  • Validation of PII: Some forms of PII, like credit card numbers, adhere to specific validation algorithms (e.g., the Luhn algorithm). LLMs, while capable of learning patterns, they don’t inherently incorporate such validation logic. This means they might flag sequences that look like credit card numbers but are actually invalid.
  • Vocabulary Limitations: The effectiveness of an LLM is heavily influenced by its training vocabulary. If a specific PII format or a term used in conjunction with PII is not well-represented in the training data, the model’s ability to identify it accurately will be compromised.
  • The Need for Constant Model Retraining: When new types of PII need to be identified, or when existing PII formats evolve, an LLM typically requires retraining on a new dataset. This process can be time-consuming, resource-intensive, and requires specialized expertise. It is likely not a process you want to do often.

The Right Balance – Combining the Two Approaches

While LLMs offer exciting possibilities for understanding the context of PII, their inherent drawbacks make them a less-than-ideal sole solution for PII redaction in many scenarios. A more robust and reliable approach often involves a hybrid strategy. Combining the speed and precision of pattern-based rules for well-defined PII formats with the contextual understanding of LLMs for more nuanced cases can offer a more comprehensive and efficient solution. Learn how Philter employs a hybrid approach to data redaction at https://www.philterd.ai/.