• Kurt Williams

Striking Gold

By Ben Workman and Kurt Williams


An ounce of gold today is worth about $1750.00. To get that gold companies process an astonishing 2 to 91 tons of rock. People panning for gold in rivers might be put off by that statistic. Switching metaphors, the “signal to noise ratio” of gold to rock means you have to be really good at filtering sediment.



Mining for Gold Text


Artificial Intelligence (AI) for sifting through text is called Natural Language Processing (NLP) and thanks to the massive amount of text produced on the Internet each day NLP has become a celebrity amongst machine learning algorithms.


Unfortunately many projects using NLP have failed to meet expectations. The facts might be boring. The graphs look dubious. Inexplicable stuff clusters its way to the top of the charts like an autotuned boy band with a questionable link to the Kardashians. What gives?


 

More Noise than Signal


Noisy text is the likely culprit. Cleaning noise prior to processing is often ignored in Natural Language Processing (NLP) projects, but it’s essential.


Text sources often contain duplicates, emojis, headers, multiple authors, and mixed languages. We may expect the fancy AI algorithms to magically ignore garbage. Unfortunately this isn’t the case. The AI blissfully processes the gold with the rock and spits out poor results without any indication something is wrong. Signal and noise become camouflaged in the output.





Unless text is written by people with degrees in Literature, it likely contains noise. At Halosight, we’ve discovered that more than 80% of data feeds can be noise. This leads to significant signal loss that hides insights.


For example, an email may contain a header block, salutation, body, legal and a disclaimer nobody really wants to include or read. It may even have threaded duplicate replies. The junk obscures the meaning. Wouldn’t it be nice if you could teach a computer to recognize junk?



 

Teach Computers to Filter Junk


Luckily, computers can be taught to sift the rock away from the gold. Modern approaches based on AI can attack the junk using several strategies.


Strategy

Why It's Important

Pro Tips

Content Type Identification

Spot content type to match analytic outcomes and specialized noise filtering

Sifting emails from non-email sources allows de-threading, header/footer removal, and signature block handling

Language Identification

NLP is usually language dependent. Sometimes it will give poor output or may even fail.

Look out for embedded language. It’s common for more than one language to appear in one document.

Email Boilerplate and Threads

Frequency analysis gets thrown off by boilerplate signature blocks and ads.

No standard exists for threaded emails or signature block structure.

Author Detection

Multiple speakers are common in chats, emails, and case histories. Meaning is dependent on differentiating employees from customers.

Metadata around the document can help figure out which author is which.


 

Noise Canceling AI


Most communication formats are simple but lack standards. Variation is high. Textual clutter is similar to the background sound filtered by noise canceling headphones. Smart filtering removes the clutter and leaves music intact.


Simplistic approaches like keywords and search expressions yield mediocre results. Layering artificial intelligence adds flexibility and power to filtration.




 

What To Look For in a Solution


The right solution will free you to focus on your business problem instead of data hygiene. Ask the following questions when evaluating NLP data cleansing:

  • Content Types - Can it spot types of text and change strategies?

  • Threads - Can it handle threaded conversations?

  • Speakers/Authors - Do your documents contain multiple authors?

  • Scale - Can the pipeline scale to the volume and variety of data?

  • Languages - How does it deal with language and embedded languages?

  • Side Effects - Will the cleaned output be compatible with your UI?

  • Purpose Built - Is it flexible and targeted specifically for cleansing scenarios?

  • Configurable - Can the workflow be tailored to your use case?


Textual data cleaning doesn’t get as much attention, but it’s critically important to maximize the results from any text-based data mining project. Prioritize cleansing or tap a vendor who can help. It’s the single best thing you can do to find the surprising insights you seek.




30 views0 comments

Recent Posts

See All