Building a RAG Q&A System with Dataworkz: A Step-by-Step Guide

In today’s digital era, customer self-service is becoming increasingly important, and having a robust Q&A system is a key component of delivering a seamless experience. At Dataworkz, we’re taking it a step further by creating a Retrieval Augmented Generation (RAG) Q&A system for our customers. 

This document will guide you through the process of building a RAG application for PubMed data using Dataworkz and powered by MongoDB Atlas.   

Uploading Files to GCS (or blob storage of your choice):

  • Connect to your GCS console:
    • Choose the bucket you want to upload the object to. Create a new folder to store the files, click upload files, and in the selection box, add the files you would like to upload.
    • Choose open and the files will be added to the folder

Configure the MongoDB Connector in Dataworkz:

To connect to your MongoDB database using Dataworkz:

  • Go to the Dataworkz dashboard and click on “Configuration” in the top-right corner.
  • Click on “Databases” and then click on “MongoDB.”
  • Click the “+” icon to add a new configuration.
  • Enter a name for the storage configuration and the MongoDB hostname or IP address.
  • Enter the MongoDB username and password.
  • Click “Test Connection” to test the connection.
  • If successful, click “Save” to save the configuration.

Configure the GCS connection in Dataworkz:

Now, you need to configure the GCS connector to connect to your instance. Here’s how:

  • Go to the Dataworkz dashboard and click on “Configuration” in the top-right corner.
  • Click on “Cloud data platforms” and select GCS
  • Hit the “+” icon to open the configuration window shown below
  • Enter the storage name, Project ID, Client ID, Client Email, and Account Private Key ID, Account Private Key
  • Test the connection to make sure the configuration is successful, then save

Step 1: Pre-processing Data

  • Access Pre-processing Settings in Dataworkz:
    • Click on the gear icon in the top right, navigate to the AI Applications section, and select Pre-processing.
  • Create a Pre-processing Job:

Press the ‘+’ button and name your pre-processing job.

Configuring Sources

  • PDF Files:
    • For PDFS, select the file type – PDF.
    • Choose the storage you have uploaded files to
    • Find the pdf file that you have uploaded to your cloud storage, make sure to select all files.
  • Website URL(HTML):
    • For URL links, select the file type – html.
    • Add URLs of interest, check the sub-crawling and javascript enabled boxes.

Submit Pre-processing:

  • Next, choose the pre-processing target location and hit submit to perform pre-processing. 
  • Verify the saved data in the dataset tab.

Step 2: Building the Q&A System

  • Access the Dataworkz RAG builder:
    • Click on the gear icon, navigate to AI Applications, and select Q&A.

  • Create a Q&A System:
    • Click on the ‘+’ button and give the Q&A system a name.

  • Configure Source:
    • First, configure the source using the datasets created during pre-processing.
    • Add question(text) and external link(PDF or URL) columns based on the source files.
  • Choose Embedding Model:
    • Select the embedding model for your Q&A system. You can choose from one of the following options: 
      • Choose a model that is hosted privately – this includes all the popular ones from the MTEB leaderboard.
      • Choose a model hosted in HuggingFace
      • Bring your own
  • Configure Vector Storage (Using MongoDB):
    • Name your vector storage and choose cosine similarity metric.
    • Set the threshold, delimiter, chunk size, and overlap.

    • Save the configuration.
  • Choose Language Models:
    • Select language models you want to test (e.g., Llama 2, ChatGPT, Mistral, etc…).
    • Provide a custom prompt for the system.

  • Save and Submit:
    • Save and submit to create the Q&A system. The Dataworkz RAG builder will create the relevant index and chunks in MongoDB. The cloud infrastructure to create embeddings and chunks is scaled on-demand.  

Step 3: Using the Q&A System

  • Testing the System:
    • Ask a question in the chat box, and the system will return an answer along with the location of the data.
  • Reviewing Results:
    • The user can check the source links to see where the information generated has originated from.

Step 4: Testing and Optimization

  • Test Results:
    • Test the system by changing prompts and formatting metadata.
  • Formatting Metadata:
    • Experiment with different ways to make RAG sources more relevant, like versioning for the mongodb documents.
  • Adding Data from Other Sources:
    • Enhance the system by adding data from Google Sheets, Docs, or other internal sources.

By following these steps, you can create a powerful RAG Q&A system that not only answers queries effectively but also ensures accuracy and relevance across various data sources. Dataworkz empowers you to provide a top-notch self-service experience for any user.

If this is something you would like to see live and demoed on your data, you can send an email to info@dataworkz.io or contact us on www.dataworkz.com/contact-us/

For future blogs, keep an eye out for our next blog on Advance RAG topics, RAG monitoring, and AI guardrails.  

Scroll to Top