Our Blogs

08 December

Question Answering Solution using NLP

Imagine you had a folder containing thousands of sales brochures that have been saved over a period.  Suppose, one day you wanted to get some information, maybe regarding the warranty provided for a product, from a file in that folder. There may be many files from the same company about similar products, so it may take a while to get to the right file and find the relevant passage. This is where modern question-answering (QA) systems come to help.  These are automated systems that can extract answers from some source documents to queries provided by a user in natural language. QA systems hold a prominent place in the history of artificial intelligence with early systems capable of answering queries about the American baseball league (BASEBALL, Chomsky et. al, 1961) and soil samples (LUNAR, Woods,1973). 

Question answering broadly lies within the fields of Information retrieval (IR) and Natural Language Processing (NLP) in computer science. Like many other subfields of NLP, it has also been revolutionized by advances in deep neural network-based modeling in the last decade. The ready availability of models pre-trained on humongous amounts of data (BERT was trained on Wikipedia and Books Corpus containing about 3.3 billion words) and open-source libraries allow data scientists to quickly build powerful QA systems with few lines of code. But how do these systems work? Let us take a closer look.


Components of a QA system:

Following are the basic components found in many modern QA pipelines. 


As in any ML pipeline, the first step is to pre-process the data to make it amenable for feeding into a suitable ML model. In the example we started with, the texts in the sales brochures have to be extracted, cleaned, and stored in a database. For extracting texts different text extraction packages are used which can extract text from a variety of file types like Pdf, Docx, and even images. After extraction and cleaning, the text is broken into small passages as most models have limitations on the size of the input they can accept. The embeddings necessary for performing an embedding-based retrieval, discussed in the next section, are also generated at this step. Finally, the passages of text together with their embeddings are stored in a database; a popular choice for example is   Elasticsearch which allows a fast and highly scalable search. All these steps have to be completed before the system can start responding to queries made by the user. 


The workflow after a user submits a query can be broken down into two steps. In the first step, a retriever tries to pick passages from the database which are most likely to contain the answer to the user’s query. A simple idea is to try to find document passages containing terms also appearing in the query. With refinements, this forms the basis of TF-IDF-based retrievers like BM-25. But what if the relevant target passage does not contain exact words appearing in the query but others that are semantically similar?   Research in the past decade has shown that one can generate rich vector representations of words that capture both their semantics and context.  This has been achieved through a combination of novel neural network architectures, ingenuous training objectives, and the ability to train large models on huge amounts of data.  One can download such pre-trained models and use them to generate dense vector representations or “embeddings” of document passages or queries. At runtime, a user query is converted to a vector and the retriever finds passages with an embedding vector close to that of the query with respect to a suitable metric like cosine similarity. A few passages with a high similarity score are then passed to the Reader. 


Reader models are trained to predict the span of an answer in a passage when a query and a context passage containing the answer are fed to the model. A widely used class of models is based on Google's BERT (Bidirectional Encoder Representations from Transformers), which has dominated the field since its release in 2018.  A BERT model pre-trained on masked language modeling and next-sentence prediction tasks can be further fine-tuned for downstream tasks such as question-answering using a QA dataset like SQUAD.  Such fine-tuned versions of BERT are in turn available through the amazing Transformers library from Hugging Face. 

Once we have candidate passages from the retriever corresponding to a query, we can then use them as context to get answer predictions from the reader. 

If all this seems a little abstract, no worries!  We will see all the components in action below.

QA made easy 

Haystack by Deepset is a fantastic open-source library that has modules for most of the common components employed in a QA system. In the following, we will demonstrate a simple QA system that can answer queries about texts scraped from Wikipedia pages of London, Berlin, and Covid-19. The texts have already been saved in different files in the “doc_dir” directory. We will start by pre-processing the texts and converting them into the dictionary format required by Haystack. The converted dictionaries are then written to an Elasticsearch database. 

 Next, we define the retriever and reader models to be used and update embeddings of the documents in Elasticsearch. The retriever and reader are combined into a single extractive QA pipeline,

We are now ready for submitting queries through the QA pipeline, 

And voila! 

Note that in the above example the retriever sends the top eight passages with the highest similarity score to the reader. The reader in turn outputs the top two answer spans from the passages. 



In this article, we have briefly described modern question-answering (QA) systems. We have also shown how a simple question-answering pipeline can be built using Haystack. Our focus in this blog was on extractive QA. There are also QA systems that can generate answers given a context, but that is a story for another day!   



  1. https://huggingface.co/tasks/question-answering
  2. https://haystack.deepset.ai/