ASAPP has developed a unique redaction service for text and transcribed voice data: ASAPP’s proprietary redaction service redacts all flavors of sensitive data, including PII, PCI and other types of Customer Data such as end-customer identifiers.

Our service utilizes a blend of leading redaction techniques, yielding best in class accuracy, which exceeds many “redaction-focused” solutions on the market. Our redaction cocktail, consists of:

  • P-Filtering: Redacts based on pre-trained Unigram Language Model anticipating how frequent certain words are imputed
  • Named Entity Recognition (NER) : Provides high accuracy with regards to names, cities, countries and other geographical indicator
  • Regex based redaction: Our redaction engine identifies and redacts any text which matches a preset regular expression. Our default set of regex rules includes the patterns commonly used for credit cards, SSNs, emails, etc. to PCI information. We can further customize our redaction to meet the most stringent data obfuscation needs our customers might have

The Redaction Service can be used explicitly (i.e., call its endpoints) or be set up as a proxy for another service.

See the tables below for reg-ex and PII based redaction rules.

Redaction Rules

The following table outlines available redaction rules implemented by ASAPP today and denotes which rules are implemented by default.

If needed, rules can be specifically applied to either participant in a conversation (agent or customer) - by default, they are applied to utterances from all participants.

Regex-based Redaction

These are the redaction rules implemented by py-redaction. They are written in python and implement each ruled as described in the last column:

RuleTargetMechanicsDescription
digits 1-5digitalregexRedacts sequences of N digits
digits >=6digitalregexRedacts sequences of N digits accounting for common separators
digits7plusdigital/voiceregexRedacts sequences of 7+ digits accounting for common separators
digitsN-datesdigitalregexRedacts sequences of exactly N digits accounting for common separators, except if they match a valid date
digitsTranscribeddigitalregexRedacts any sequence of numbers accounting for regular and spoken-form numbers
digits9ssnvoiceregexRedacts 9 digit SSN accounting for separators used in SSN
creditCarddigitalLuhn check + regexRuns a Luhn check (all valid CCNs pass it) on number sequences and checks that the number pattern belongs to a known C issuer.
truncateDigitsNvoiceregexRedacts sequences of N numbers accounting for spoken-form numbers
datedigitalregexRedacts various date formats
profanitydigital/voiceregex + known termsRedacts words and phrases present in a list of known bad words
voiceCCTruncatevoiceregexRedacts any sequence of numbers with a length within the length of a valid credit card number (11-17) accounting for spoken-form numbers

Read redaction policies to see which default rules are always required and what data elements are redacted for each ASAPP product.

When searching candidates for redaction, ASAPP will “skip over” spaces, hyphens, and periods whenever such “skipping” would yield a numeric string equal to or longer than six numeric characters in length.

This means that redaction rules for four or five digit sequences require that those digits be contiguous, whereas rules for longer sequences work even if there are spaces, hyphens, or periods in-between the digits.

For example, our algorithm considers the following strings to be equivalent:

1234 56 789 123-45-6789 1.234-56 789 123456789

Each of these strings matches the pattern of a 9-Digit Numeric, so each would be redacted per the standard policy outlined above. 

N-Digit Numeric rules detect numbers that are spelled (e.g. five rather than 5) in addition to detecting numerical digits. This capability is designed for detecting transcribed numbers in voice calls that have not been automatically converted to a numerical digit format by the ASR.

PII Specific Rules

Rule

Target

Mechanics

Description

nerTags

digita/boice

NER

Uses Machine learning algorithms to detect NER entities like NAME, ADDRESS

email

digital

regex

Redacts any well-formed email address (abc@gmail.com)

pfilterEmailVoice

voice

regex + pfilter

Applies pfilter to the sentence which matches email like format

phoneNumber

digital

regex

Redacts sequences of digits that could be phone numbers accounting for common phone number formats (parenthesis, dashes, etc.)

phoneNumberVoice

voice

regex

Redacts word containing all digits of text that contains phone number keywords (e.g., “phone number”, “callback number”)

passcode

digital/voice

regex

Redact words containing all digits from a passcode keyword (e.g. “code”, “pin”, etc)

pfilterPassword

digital/voice

regex + pfilter

Redacts every word surrounding password keywords. Redacts every combination of:

  1. Letters with digits or special chars (a.b, hello1, hello$), OR

  2. All digits (123), OR

  3. The frequency of that word is low for pfilter

spelledOut

voice

regex

Redacts spelled out characters (>1) which can be some form of PII.

spelledOutSingle

voice

regex

Redacts single characters if there is only one character present in a sentence. Most likely customer dictating their PII.

ssnVoice

voice

regex

Redacts digits surrounding SSN keywords.

crediCardVoice

voice

regex

Redacts all digits surrounding card keywords.

dob

digital/voice

regex

Redact valid dates surrounding keywords like “birth”, “date of birth”, “born”

Immediate and Delayed Redaction

For Messaging and Voice Applications only, ASAPP offers a delayed redaction capability in addition to immediate redaction.

Immediate redaction can happen either on the front-end, if the channel is leveraging an ASAPP SDK, or on the back-end through the ASAPP redaction service. If a channel does not leverage the ASAPP SDK, the front-end will still show the raw unredacted values to the customer; however, all data processed by the ASAPP back-end will still be redacted.

For delayed redaction, ASAPP removes sensitive data after a fixed time period (1 hour by default) so that data can be temporarily displayed to the agent. ASAPP stores this temporarily viewable data in a single cache. Each cache entry has its own Time-To-Live (TTL) starting from when it is first stored. At expiration time, ASAPP automatically removes the data from the cache and it is no longer available.

Delayed redaction is applicable for Messaging and Voice Applications only. It is not available for AutoCompose, AutoSummary, AutoTranscribe, or JourneyInsight.

Context-Aware Redaction

For AutoTranscribe and Voice Application products, ASAPP employs redaction triggered by specific words in transcribed utterances. After a trigger word is detected, redaction is employed for a pre-configured time window.

This redaction capability allows organizations to target certain contexts rather than applying a rule for the entirety of a conversation, which would otherwise lead to over-redaction of useful information.

By default, ASAPP employs two context-aware rules to all conversation transcripts created with AutoTranscribe and the Voice Application:

  1. Credit Card Numbers: After words related to credit cards and payments are detected, all numerical digits are redacted for a 180-second window.
  2. Social Security Numbers: After words related to social security numbers are detected, all numerical digits are redacted for a 180-second window.

Context-aware redaction is only offered for AutoTranscribe and Voice Application implementations at this time.

How ASAPP Redacts the Non-Numeric Strings

ASAPP redacts by:

Dates

Date formats introduce additional difficulties when it comes to redaction, but these can still fundamentally still be treated with regular expressions over several different possible formats. ASAPP typically handles the following formats:

Numeric only date formats:

12/15/2018
12-15-2018
12/1/2018
12-1-2018
12/15/18
12-15-18
12.15.18
12.15.2018
12 15 2018
12 15 18 \*
2018-12-15
  • European variants are also handled, e.g., 15/12/2018

Mixed Numeric / Textual date formats

Dec 15 2018
December 15 2018
Dec 15 18
December 15 2018
Dec 15th 2018
Dec 15th 18
15th Dec 2018
15th December 2018
15th Dec 18
15th of December 2018
15th of Dec 2018
15th of Dec in 2018

Emails

The specifications for email address formats are specified across a number of RFCs (specifically RFC 5321, RFC 6531, and RFC 6532). These formats have surprisingly loose standards on what constitutes a correct and valid email address, these standards are further loosened by RFC 6531/6532 to allow for internationalized addresses. The one aspect of the specification that is clear is that mail addresses must take the form:

"local-part"@domain"

The “local-part” is the section of the string that allows the greatest variation, allowing for characters like ”+“. Hence identifying email addresses based on the “local-part” only becomes infeasible.

The ”@” symbol itself is far more common in modern digital communication. In order to avoid over-redacting important information that the customer conveys --- which information may be needed in subsequent support chats --- ASAPP focuses on removing obvious email addresses such as:

local-part@somedomain.com
local-part@somedomain.edu
local-part@somedomain.net
local-part@somedomain.org
local-part@somedomain.gov

ASAPP also identifies and redacts improperly formed email addresses such as:

local-part@somedomaincom
local-part@somedomainedu
local-part@somedomainnet
local-part@somedomainorg
local-part@somedomaingov

In addition, ASAPP identifies and redacts email addresses that, while incorrectly formed, refer to well-known email providers:

local-part@hotmail
local-part@gmail
local-part@yahoo
local-part@verizon
local-part@outlook
local-part@inbox
local-part@icloud
local-part@live
local-part@comcast

NOTE: For voice calls, consistently identifying spoken email addresses and transcribing them to match formats outlined above is an ongoing research area.

As a result, the email addresses rule is not a recommended default in implementations for the voice channel.

Passwords, Passcodes, and PIN phrases

Identifying passwords is a difficult problem because secure passwords are highly random and variable. Common password guidelines do provide some assistance, as they recommend and at times require to have a mix of letters, numbers and special characters. ASAPP algorithms look for such combinations of characters along with looking for key phrases to intelligently identify and redact passwords.

PINs are commonly 4-digit numbers and can be recognized using a purely numeric based system wherever they appear.

ASAPP also looks for words and phrases like:

password
pssword
Passcode
Psscode
Pin
User id
Userid
usrid
Username
usrname

Many such words and phrases occur in utterances such as, “I forgot my password, can you help me reset it?”, which may impede communication with the agent and slow the process of solving the customer’s issue if we completely remove all of the words following the password.

Instead, ASAPP’s algorithm tokenizes and removes all non alphabetic characters from the utterance when any of the above phrases appear.