Building a RAG Q&A System with Dataworkz: A Step-by-Step Guide

In today’s digital era, customer self-service is becoming increasingly important, and having a robust Q&A system is a key component of delivering a seamless experience. At Dataworkz, we’re taking it a step further by creating a Retrieval Augmented Generation (RAG) Q&A system for our customers. 

This document will guide you through the process of building a RAG application for PubMed data using Dataworkz and powered by MongoDB Atlas. Here is a video of Dataworkz RAG Builder in action.

Uploading Files to GCS (or blob storage of your choice):

  • Connect to your GCS console:
    • Choose the bucket you want to upload the object to. Create a new folder to store the files, click upload files, and in the selection box, add the files you would like to upload.
    • Choose open and the files will be added to the folder

Configure the MongoDB Connector in Dataworkz:

To connect to your MongoDB database using Dataworkz:

  • Go to the Dataworkz dashboard and click on “Configuration” in the top-right corner.
  • Click on “Databases” and then click on “MongoDB.”
  • Click the “+” icon to add a new configuration.
  • Enter a name for the storage configuration and the MongoDB hostname or IP address.
  • Enter the MongoDB username and password.
  • Click “Test Connection” to test the connection.
  • If successful, click “Save” to save the configuration.

Configure the GCS connection in Dataworkz:

Now, you need to configure the GCS connector to connect to your instance. Here’s how:

  • Go to the Dataworkz dashboard and click on “Configuration” in the top-right corner.
  • Click on “Cloud data platforms” and select GCS
  • Hit the “+” icon to open the configuration window shown below
  • Enter the storage name, Project ID, Client ID, Client Email, and Account Private Key ID, Account Private Key
  • Test the connection to make sure the configuration is successful, then save

Step 1: Pre-processing Data

  • Access Pre-processing Settings in Dataworkz:
    • Click on the gear icon in the top right, navigate to the AI Applications section, and select Pre-processing.
  • Create a Pre-processing Job:

Press the ‘+’ button and name your pre-processing job.

Configuring Sources

  • PDF Files:
    • For PDFS, select the file type – PDF.
    • Choose the storage you have uploaded files to
    • Find the pdf file that you have uploaded to your cloud storage, make sure to select all files.
  • Website URL(HTML):
    • For URL links, select the file type – html.
    • Add URLs of interest, check the sub-crawling and javascript enabled boxes.

Submit Pre-processing:

  • Next, choose the pre-processing target location and hit submit to perform pre-processing. 
  • Verify the saved data in the dataset tab.

Step 2: Building the Q&A System

  • Access the Dataworkz RAG builder:
    • Click on the gear icon, navigate to AI Applications, and select Q&A.

  • Create a Q&A System:
    • Click on the ‘+’ button and give the Q&A system a name.

  • Configure Source:
    • First, configure the source using the datasets created during pre-processing.
    • Add question(text) and external link(PDF or URL) columns based on the source files.
  • Choose Embedding Model:
    • Select the embedding model for your Q&A system. You can choose from one of the following options: 
      • Choose a model that is hosted privately – this includes all the popular ones from the MTEB leaderboard.
      • Choose a model hosted in HuggingFace
      • Bring your own
  • Configure Vector Storage (Using MongoDB):
    • Name your vector storage and choose cosine similarity metric.
    • Set the threshold, delimiter, chunk size, and overlap.

    • Save the configuration.
  • Choose Language Models:
    • Select language models you want to test (e.g., Llama 2, ChatGPT, Mistral, etc…).
    • Provide a custom prompt for the system.

  • Save and Submit:
    • Save and submit to create the Q&A system. The Dataworkz RAG builder will create the relevant index and chunks in MongoDB. The cloud infrastructure to create embeddings and chunks is scaled on-demand.  

Step 3: Using the Q&A System

  • Testing the System:
    • Ask a question in the chat box, and the system will return an answer along with the location of the data.
  • Reviewing Results:
    • The user can check the source links to see where the information generated has originated from.

Step 4: Testing and Optimization

  • Test Results:
    • Test the system by changing prompts and formatting metadata.
  • Formatting Metadata:
    • Experiment with different ways to make RAG sources more relevant, like versioning for the mongodb documents.
  • Adding Data from Other Sources:
    • Enhance the system by adding data from Google Sheets, Docs, or other internal sources.

By following these steps, you can create a powerful RAG Q&A system that not only answers queries effectively but also ensures accuracy and relevance across various data sources. Dataworkz empowers you to provide a top-notch self-service experience for any user.

If this is something you would like to see live and demoed on your data, you can send an email to or contact us on

For future blogs, keep an eye out for our next blog on Advance RAG topics, RAG monitoring, and AI guardrails.  

Scroll to Top