Enterprise search has dominated the way people locate information, but substantial changes in technology powering search has increased at a rapid pace. Organizations must now embrace this new technology that is redefining the market around Search, one that Gartner terms the “Insight Engine.”

In part 2 of this 2-part blog series on Insight Engines (you can read part 1 here), we’ll cover the basic components inherent in building an Insight Engine.

Building an Insight Engine


1. Collecting Raw Data: The Foundation of Your Insight Engine

There are two kinds of raw data – Structured and Unstructured. Structured data requires a pre-defined data model. Unstructured data is everything else such as HR documents, pictures of the company picnic, memos on proper use of company resources, etc.

Unstructured Data

Analysts at Gartner estimate that upward of 80% of enterprise data today is unstructured. Unstructured data is exactly how it sounds. Unstructured. It won’t fit neatly into a database field or spreadsheet. High value unstructured data is ubiquitous on the Internet. The challenge is getting it into the Insight Engine, but it can be done.

In our example, this would include an HR policy document on vacation time and possibly a SharePoint repository with project management documents or spreadsheets.

Structured Data

Structured data requires a pre-defined data model. It is the type of data that would normally be put into an enterprise database. The data is easy to find, analyze, and share with other systems (like your Insight Engine), but it isn’t complete. To find the answer to Jane’s question in part 1 of the blog series, this could include providing the Insight Engine with access to a structured project management system and an HR system that tracks how many hours Jane has already taken off for the year.

2. Data Storage: The Storage Locker For Your Insight Engine

Unstructured data holds a lot of value. The value is not always immediately recognized and sometimes valuable data is not extractable today but could be next year. As a result, Unstructured data should be stored as close to its raw form as possible.

Accessing all that raw data can be slow, but the open-source tool ElasticSearch can provide fast access. ElasticSearch stores the original document and adds a searchable reference.

Structured data, on the other hand, is designed for compact storage or fast access. You can generally leave it where it is and access it as needed from the Insight Engine.

3. Data Pipeline: Your Insight Engine’s Housekeeper

A Data Pipeline is used to ingest all data sources, validate the data, transform the data to a common format, and then provide that data to downstream processes such as Big Data analytics tools, Machine Learning algorithms, or an Insight Engine.

Traditional: Sanitizing & Formatting the Raw Data to a Common Format

Under a traditional architecture, Unstructured Data would be vacuumed up and enter the pipeline, be pre-processed and compressed, and then stored as compressed Structured Data for consumption by Big Data techniques or Machine Learning.

This works well until it doesn’t. Whether the input data is Structured or Unstructured, if the structure or components of the vacuumed data changes, then the pre-processing step breaks and everything downstream stops. Every good data source will change over time, so continuous maintenance of the pipeline code is necessary.

Insight Engines: Natural Language Processing on Raw Data

Directly using Unstructured Data is a key strength for Insight Engines. The pipeline still has a job to do, but it is minimal. Use of the raw data provides robustness to future data source changes and flexibility to incorporate additions from data sources immediately. The use of raw data also provides the ability to perform new analyses without the hindrance of waiting on new code for the Data Pipeline.

Natural Language Processing (NLP) of raw data allows for fast idea iteration and rapid prototyping. This aspect of Insight Engines will drive product innovation at an even faster pace than today.

4. Data Enrichment: The Neighbor’s House of Your Insight Engine

Data enrichment is the process of adding additional, relevant data to the raw data. Data enrichment can come from internal analyses like topic mapping and sentiment analysis or it can come external sources such as Wikipedia, Quora, or Google.

Jane, hoping to take a day off before leaving on a work trip, asks “What’s my schedule tomorrow?” She gets back her meeting schedule, a delay for the flight she has tomorrow evening to Chicago, and the weather forecast for Chicago for the upcoming week. Neither the flight or weather information came from a company data source.

An example is useful in understanding what data enrichment entails. The most common example is a Lead Capture form on your website.

A visitor enters the first name, last name, and an email address. In one case, the first name is Don’t and the last name is BotherMe. Obviously this person doesn’t want to be contacted. An automated lead scoring data enrichment task would give this person a low score. But maybe it isn’t garbage because they provided a valid email address which can be used to find their real name, address, and company they work for from a variety of external data sources.

The next visitor provides 100% legitimate data but misspells their email address. If the misspelling is in the email provider, i.e. ggmail instead of gmail, then data enrichment can fix this. The lead still gets the email they were expecting instead of potentially being unhappy due to lack of the expected welcome email.

Or consider an Amazon review. It includes the username, date, a rating on a 5 star scale, text, and potentially video and images. The review is scraped from Amazon’s site (this is an example and not compliant with Amazon’s Terms of Service). Once scraped the text is separated from the HTML. From the HTML the username, date, rating, and links to any images or videos are extracted. The rating can be converted from 5 stars to 100 points in order to facilitate one-to-one comparisons with reviews on other sites.

This process of cleansing and extracting the salient pieces are the beginning of the data enrichment task. Once the review is boiled down to its essence, a semantic analysis can be performed and a score attached. Named-entity Recognition (NER) analysis can be applied to determine the product and product features specifically mentioned. Information about the reviewer can be scraped and then linked to this review as well. Reviewer data such as how many reviews this user has, their average review rating, and, if resources allow, a semantic score for every review the user has ever written can be calculated. Their 3 star review takes on a different context if 3 stars is the best they ever give and their average is 2.

Feeling ambitious? Bring in data from Twitter and Reddit then combine it all with customer support transcripts for a more complete picture of consumer sentiment about a specific product feature.

5. Machine Learning: The Smart Home For Your Insight Engine

Within the field of Machine Learning there are three formal types of learning: 1) Unsupervised, 2) Supervised, and 3) Reinforcement. Machine Learning is used to make the Insight Engine smarter, faster, and more personalized with every query.

In Unsupervised learning we don’t know the answer before we ask the computer for that answer. It can be thought of as dumping a lot of data into the machine, creating a few representative prototypes, each prototype best represents a large group of those inputs, and then we ask the computer which prototype is the best match to our current input data. Personalization is typically achieved through Unsupervised learning. NER is done most commonly using Unsupervised learning with some promising new techniques utilizing Supervised Learning.

Most of the NLP tasks were originally “solved” via Unsupervised learning techniques. However, as training Supervised learning models has gotten faster, new techniques are being applied to NLP tasks and producing better results than their Unsupervised counterparts.

Supervised Learning is utilized when we have a data set with inputs and correct responses. After teaching the computer to match inputs to correct responses, we then use the trained computer model to extrapolate on new input data. Like NER, Automatic document summarization is another NLP area where an Unsupervised technique exists as does a Supervised technique.

Reinforcement Learning is intended to mimic the way animal brains learn based on positive and negative reinforcement. It is a relatively new field that has not yet found an application in Insight Engines or Natural Language Processing.

The choice of whether to use Unsupervised or Supervised learning is often decided by the available computing resources. Supervised learning requires more resources than Unsupervised.

Could an Insight Engine Pass the Turing Test?

Personalized, Interactive, Natural Language Search

While Natural Language Search can return an incredible array of relevant information, a person answering the question would use context to tailor the response. That employee looking into health benefits happens to be a thirty-five-year-old female with two kids and a husband. She doesn’t need to know what the deductible is for a single male under the age of thirty.

In Machine Learning this is called Personalization. Think of Personalization as a filter applied to the broad conceptual results of an Insight Engine. Personalization also allows for the elimination of the previously mentioned information silos.

Machine Learning Verging on Artificial General Intelligence

Artificial General Intelligence (AGI) is an intelligence on par with a human not in a single task but any intellectual task. Machine Learning is a necessary step in achieving AGI and some Machine Learning models approach AGI.

The Turing Test suggests that we’ve achieved AGI when we can have a conversation with a computer which is indistinguishable from a conversation with a human. Natural Language Search begins to test the boundary of AGI because it can answer a question posed in natural language through the assimilation of disparate data sources, not from a scripted response.

Currently, these results remain distinguishable from a human conversation. Insight Engine driven ChatBots will progress towards being indistinguishable. If the conversations remain distinguishable, it will be because the ChatBot responses are more concise and thorough than any human would be.

turing test screenshot

So, What Is an Insight Engine?

An Insight Engine is more than just next-gen Google or a private search engine. It is a completely different way of doing search that combines machine learning, big data, natural language processing, and even sentiment analysis to understand context and the intent of the searcher – even when they use imprecise search queries such as “it”, “that”, and “tomorrow”.

An Insight Engine is used by organizations to better support their customers and/or employees, and is most effective when it can access pretty much all the information available to the organization – including supplemental public or third-party data. Most importantly, and uniquely, an Insight Engine can incorporate both structured and unstructured data to quickly and accurately provide the answers users are looking for.

With endless possibilities, how would you use an Insight Engine today?

Share this:

Insight Engines and the Future of Search

Download the PDF Version of this 2-Part Series

Download Now

Leave a Reply

Your email address will not be published. Required fields are marked *