By Ben Workman and Kurt Williams
An ounce of gold today is worth about $1750.00. To get that gold companies process an astonishing 2 to 91 tons of rock. People panning for gold in rivers might be put off by that statistic. Switching metaphors, the “signal to noise ratio” of gold to rock means you have to be really good at filtering sediment.
Mining for Gold Text
Artificial Intelligence (AI) for sifting through text is called Natural Language Processing (NLP) and thanks to the massive amount of text produced on the Internet each day NLP has become a celebrity amongst machine learning algorithms.
Unfortunately many projects using NLP have failed to meet expectations. The facts might be boring. The graphs look dubious. Inexplicable stuff clusters its way to the top of the charts like an autotuned boy band with a questionable link to the Kardashians. What gives?
More Noise than Signal
Noisy text is the likely culprit. Cleaning noise prior to processing is often ignored in Natural Language Processing (NLP) projects, but it’s essential.
Text sources often contain duplicates, emojis, headers, multiple authors, and mixed languages. We may expect the fancy AI algorithms to magically ignore garbage. Unfortunately this isn’t the case. The AI blissfully processes the gold with the rock and spits out poor results without any indication something is wrong. Signal and noise become camouflaged in the output.
Unless text is written by people with degrees in Literature, it likely contains noise. At Halosight, we’ve discovered that more than 80% of data feeds can be noise. This leads to significant signal loss that hides insights.
For example, an email may contain a header block, salutation, body, legal and a disclaimer nobody really wants to include or read. It may even have threaded duplicate replies. The junk obscures the meaning. Wouldn’t it be nice if you could teach a computer to recognize junk?
Teach Computers to Filter Junk
Luckily, computers can be taught to sift the rock away from the gold. Modern approaches based on AI can attack the junk using several strategies.
Why It's Important
Content Type Identification
Spot content type to match analytic outcomes and specialized noise filtering
Sifting emails from non-email sources allows de-threading, header/footer removal, and signature block handling
NLP is usually language dependent. Sometimes it will give poor output or may even fail.
Look out for embedded language. It’s common for more than one language to appear in one document.
Email Boilerplate and Threads
Frequency analysis gets thrown off by boilerplate signature blocks and ads.
No standard exists for threaded emails or signature block structure.
Multiple speakers are common in chats, emails, and case histories. Meaning is dependent on differentiating employees from customers.
Metadata around the document can help figure out which author is which.
Noise Canceling AI
Most communication formats are simple but lack standards. Variation is high. Textual clutter is similar to the background sound filtered by noise canceling headphones. Smart filtering removes the clutter and leaves music intact.
Simplistic approaches like keywords and search expressions yield mediocre results. Layering artificial intelligence adds flexibility and power to filtration.
What To Look For in a Solution
The right solution will free you to focus on your business problem instead of data hygiene. Ask the following questions when evaluating NLP data cleansing:
Content Types - Can it spot types of text and change strategies?
Threads - Can it handle threaded conversations?
Speakers/Authors - Do your documents contain multiple authors?
Scale - Can the pipeline scale to the volume and variety of data?
Languages - How does it deal with language and embedded languages?
Side Effects - Will the cleaned output be compatible with your UI?
Purpose Built - Is it flexible and targeted specifically for cleansing scenarios?
Configurable - Can the workflow be tailored to your use case?
Textual data cleaning doesn’t get as much attention, but it’s critically important to maximize the results from any text-based data mining project. Prioritize cleansing or tap a vendor who can help. It’s the single best thing you can do to find the surprising insights you seek.