Search for what you mean not what you say
Web search engines are remarkably good at what they do. Full-text search engines actually pre-date the web, but they really came into their own with the need to find information across an ever expanding array of content on the internet. One of the first internet text crawlers that let users find any word in any web page was WebCrawler. It appeared as a harbinger of uncurated, content-based search without the need for categories or topic trees to guide you to the information you’re looking for.
Nowadays, nearly all internet search engines work this way, and improvements on this basic approach have given us the current ability to find just about anything we need on the web. Note, however, that this state-of-the-art search technology finds words in documents that match the words in search queries. It’s called ‘keyword matching’ or lexical search. Advanced search engines also look for variations and synonyms of search words so that when you search, the results represent a relevant set of content that often fits exactly what you had in mind.
Enterprise Search Challenges
At the scale of the internet, you can’t do better than Google at search. But in building an enterprise service, there are a different set of challenges. The amount of content in organizations is much smaller, of course, so it’s easier to index and operate on, but it’s also more narrow in range making finer distinctions more important. Keyword search does a reasonable job, but it requires that your query words exist in the documents. Keyword search has no understanding of what you are searching for; it can only match words.
In the enterprise setting, your search results can be greatly improved if the search engine understands your intent and can use the context of terms within a document. This approach is called ‘semantic search’ and can give you more relevant results since semantic search is based on the meanings of words rather than just the words themselves.
How Does Semantic Search Work?
The latest technology in semantic search is a technique called word embeddings. Word embeddings are mathematical structures that represent a collection of words. This structure captures the context of a word or phrase plus its semantic and syntactic relation to other words. Each word is represented as a vector, so that a computer can do calculations with it.
If you look up the definition of a vector, you’ll find that it’s a mathematical object that has both magnitude and direction. That’s not very helpful to understand how a word can be a vector unless you also understand that all the words across a big collection of documents are also represented as vectors, and they all exist within a vector space. The length (magnitude) and direction of the vector represent a particular word. Words that have similar meanings have similar vector representations and therefore end up close to each other in the vector space.
The vector space is key to how semantic search works. It’s relatively easy to imagine a vector in two-dimensional space. You’ve, no doubt, seen many examples of arrows in graphs that depict some function or value with respect to the two axes (or dimensions) of the graph. Word embeddings are similar but instead of two dimensions, they can have dozens or hundreds of dimensions. That’s not something we can visualize or even imagine easily, but machines can use multi-dimensional spaces to calculate things like word similarity. The figure that follows shows words projected into a three-dimensional space. Three dimensions is still well short of word embedding models, but it gives you a sense of how operations can occur in multiple dimensions.
Words of a Feather
The vector representation has to be learned by a computer by considering all of the words across all of the documents to understand the context each word appears in. There are different approaches for projecting words into a vector space, but they all rely on the fact that in natural language, words with similar meanings tend to appear in similar contexts. This general idea is called distributional semantics and it dates back to the middle of last century. One of the leaders in the field, J.R. Firth, illustrated the idea in a famous quote from the 1950s, “You shall know a word by the company it keeps.”
Why Is It Important to Understand How Semantic Search Works?
If you’re looking to implement solutions in your organization to enable you to leverage your existing knowledge more effectively, it’s important to understand the value delivered by keyword-based solutions vs. those delivering true semantic search capabilities. We’ve designed DraftSpark to use a combination of both keyword matching and semantic search, blending the two to get the best of both worlds.
In addition, in DraftSpark, we use pre-trained word embedding models that have been built by analyzing massive amounts of text from a variety of online sources. Our current research effort is focused on extending large pre-trained models with the content of individual organizations who use DraftSpark. Combining pre-trained and specific models will allow us to incorporate the general knowledge from pre-trained models with content from a particular organization. Our goal is to produce even more precise and exact matches that are well calibrated to each organization’s information and content needs.