1. Query Understanding

Query Understanding can be broadly broken down into three classes of problems. First is semantic extraction of text query which includes query intent understanding, query classification, and query tagging. The second class is query enhancement for optimizing the quality of returned documents in terms of relevance to the query this mainly includes rewriting of query and spell correction. Lastly, the third class deals with user assistance by leveraging auto-completing and query suggestions.

Before we delve into developing a deeper understanding of the process of query understanding since queries are the entry point for users to search systems. It is essential to have a high-level understanding of the search system life cycle. Figure 1 illustrates the process of the search system with core steps of crawling → preprocessing → indexing → searching.

Figure 1

1.1 Query Understanding in Search Process

Once a given text query is submitted the first responsibility of the query understanding pipeline is to semantically analyze the query by classifying it using a pre-defined taxonomy and query tagging by extracting any entities/concepts present in the query. Given a query "manchester united shop" it can be classified into categories like [football] with a probability of 0.8, [sports] with a probability of 0.6, etc. Also, using semantic tagging we can identify manchester united as a [football_team]. Another class of query understanding components we discussed earlier was query enhancement whereby we use techniques like query expansion, rewriting, and/or spell correction to minimize the problems like vocabulary mismatch between documents present in the search collection and user query. Finally, some search systems also return query suggestions alongside relevant documents this phenomenon.

Figure 2

1.3 Query Classification

Query classification is the process of assigning a search query to given a target taxonomy to optimize the relevance of returned documents. This is mainly done by framing this as a multilabel classification problem. One of the notable datasets for this task is from KDD Cup 2005 competition which required to classify 80,000 user search queries into 67 categories based on taxonomy.

Now, query classification can be further sub-divided into different types of categories. Query performance classification requires the system to classify queries based on the difficulty of the query, for instance, some categories of queries may bring up results for which we may not have user feedback or the relevant judgments. This category of query classification can further be divided into two sub-categories of system query difficulty (difficulty of a query for a search system to run over a given collection) and collection query difficulty (tells difficulty of a query w.r.t a search collection)

Query classification isn't limited to only dimensions discussed above but also includes additional dimensions such as geography, and temporal dimension. Some useful resources for performing query classification include click logs, search corpus, and historical queries. Historically, both supervised and unsupervised techniques have been put to use to solve this task.

Action items

1.4 Query Segmentation and Tagging

The basic bag-of-words model can be considered the most simplistic version segmenting queries into individual tokens. However, the BOW approach fails to take into account the semantic and syntactic information about the query, for instance, in query "South Korea" words "South" and "Korea" would convey their own meaning if not dealt together as a single word which refers to a country. Moreover, more advanced forms of query segmentation involve tagging named entities present inside the query.

Query segmentation can be categorized into three types of approaches, including heuristic, supervised, and unsupervised learning approaches. Heuristic approaches rely on statistics computed based on an external resource, supervised approaches frame query segmentation as a token classification task where each token model predicts whether there is a break or not.

Finally, identifying named entities in the search queries can help enhance the quality of relevance of returned documents by the search system. The type of entities also varies based on the type of search system, for instance, an e-commerce search engine may have entities such as color, brand, size, model, etc. Additionally, queries can be syntactically tagged as well such as doing POS tagging on queries, however, this is a challenging task since most queries aren't complete sentences but rather keywords.

Figure 3

1.5 Query Intent Understanding

Intent understanding is in the rudimentary stages of research and refers to identifying the need for information which turns into search. There have been some standard intent taxonomies proposed such as the Broder, which distinguishes between the user's navigational intention such as for a query "eBay" means the user may be interested going to the website "www.ebay.com". Whereas, "weather saarbrücken" user wants to know the temperature of the city Saarbrücken instead of navigating to any website.