Research to Production: Relative Answer Quality (RAQ) and NVIDIA NIM

Research to Production: Relative Answer Quality (RAQ) and NVIDIA NIM

Research to Production: Relative Answer Quality (RAQ) and NVIDIA NIM

A Step-by-Step Approach to LLM Evaluation and Deployment with Relative Answer Quality (RAQ) and NVIDIA NIM

This article was co-authored by Luís Roque and Rafael Guedes

Introduction

The successful release of ChatGPT in 2022 made people realize that generative AI can have numerous advantages not only for individuals who want to automate manual and time-consuming tasks on their own but also for companies that seek to enhance customer experience and optimize operations.

The increased demand for generative AI solutions has led several companies to invest in the research and development of open-source solutions, such as Mixtral from Mistral AI or LLaMA 3 from Meta. The heavy investment in GenAI resulted in the wide availability of these models to the public, which shifted the focus of most companies from developing in-house Large Language Models (LLMs) to deploying these open-source versions.

Regarding the productization of these models, two questions come to mind: which model should be used, and how can it be deployed at scale and safely? This article aims to answer these questions. We propose a method called RAQ (Relative Answer Quality) that assesses different LLMs by comparing and ranking their answers using an independent LLM. The key differentiator of our approach is that it can be easily integrated into any workflow and can assess any set of LLMs on any domain/subject/use case. RAQ can be used in an organization with a dataset with questions and correct answers on a specific subject, but it can also be applied when no such dataset exists. Another advantage of our proposed method is that the dataset size can be quantified. We use statistical tests to determine if the differences in ranks are significant, allowing the ML practitioner to confidently determine if one LLM is better than another. The rank is essentially based on their proximity to the ground truth answer, regardless of the domain/subject/use case of the data being evaluated.

To answer the second question, we demonstrate how easy it is to transition from research to a production-level solution using NVIDIA’s new NIM microservices. NIM allows the deployment of LLMs at scale, first with rapid prototyping using an NVIDIA hosted service and, for production, self-hosted deployment on any private cloud or physical hardware. The inference engines for the combination of accelerated hardware setups and leading open models are already optimized out of the box. Thus, it enables us to set up the infrastructure with minimal effort. Additionally, it provides flexibility in fine-tuning the models with LoRA.

As always, the code is available on our GitHub.

RAQ: Relative Answer Quality

As mentioned before, there was a huge investment in the research and development of open-source LLMs after the release of ChatGPT. In fact, nearly every week, a new model or a variant of a model is being released. With so many options, how can one choose the best model for their use case?

Quantitative metrics such as inference time are important and should be considered when selecting your LLM. Nevertheless, an LLM with high inference speed and low quality of response might not be a good option.

Assessing an LLM qualitatively on a specific use case is a manual and time-consuming task because one needs to evaluate the model based on hundreds of questions. This is highly subjective and requires reading hundreds of answers and comparing them with the ground truth answer. Therefore, to overcome this problem and reduce the manual work associated with assessing the quality of response, we propose RAQ.

The RAQ framework relies on an independent LLM that receives the questions, ground truth answers, and answers of a set of LLMs. It uses the independent LLM to rank them based on how good they are when compared to the ground truth. An extension of RAQ can be developed when no dataset with questions and answers exists. In that case, we can use another independent LLM to create those for us. There is also the hybrid setup, where a small dataset exists but not sufficient for us to run RAQ. In this case, we can leverage that small dataset as a seed to create similar examples using an independent LLM, following a semi-synthetic approach.

The independent LLMs can contain biases and other potential problems. In order to overcome that, we propose two options: i) selecting the best available open-source or close-source model or ii) selecting a pool of the best-in-class models and running RAQ for each independent LLM. The latter requires an additional step before applying the general RAQ framework. The results need to be aggregated by computing the median and standard deviation of the ranks beforehand. One critical point for RAQ is to ensure that the independent LLMs or set of LLMs are not present in the set of LLMs we are evaluating. Otherwise, we would introduce a strong bias towards that particular model or set of models.

Finally, RAQ can also include additional metrics to provide a complete comparison of the set of LLMs. For example, we can compute the words per second and average answer length. These allow additional data points on model performance and verbosity.

RAQ in action

We start by creating the following prompt: we ask the independent LLM to compare and rank the LLM IDs based on the quality of their answers.

Based on the correct answer: {Ground Truth Answer}, rank the IDs of the following answers from the most to the least correct one:
ID: 1 Answer: {LLM 1 Answer}
ID: 2 Answer: {LLM 2 Answer}
ID: 3 Answer: {LLM 3 Answer}
...
 

We run this process multiple times and record all the rankings provided. After collecting the rankings for all questions, we perform Dunn’s Multiple Comparison Test [2]. It helps us to understand if there are significant differences between the ranks of all LLMs or if they perform similarly. This test performs pairwise comparisons between each independent group and tells which groups are statistically significantly different at a pre-defined significant level, usually 5%.


Figure 2: Example of Ad Saturation Curve and Ad Lag effect (image by author)
Figure 2: RAQ process mechanics (image by author)

NVIDIA NIM: What is it?

NVIDIA NIMs [1] are containers that provide inference APIs off the shelf for AI models. It is a cloud-native microservices solution designed to make the deployment process easier and less time-consuming. It removes the complexity of connecting the AI models to existing enterprise infrastructure.

Companies usually deploy LLMs in three ways: on their own physical hardware, on a cloud, or through third-party hosted APIs. The first two options offer advantages such as data privacy, security, and model flexibility. Nevertheless, they require very specialized resources to avoid inefficiencies, such as underutilized hardware or application performance problems. On the other hand, the latter solves the performance issue but does not ensure data privacy, security, and model flexibility. It would also be a core dependency, bringing additional risk to the setup.

NIM bridges these gaps in two ways. First, they provide their own third-party APIs that you can use. The main difference is that they make available community-supported models and ensure data privacy and security. They only provide the microservices to run these models; thus, there is no incentive to use any data for training purposes, contrary to what you find in cases such as OpenAI, Anthropic, Google, etc. The second option is to use NIM to deploy models on your dedicated infrastructure. In this case, NIM makes available optimized inference engines for each model and hardware setup. This means that the team at NVIDIA has already done the critical work on the infrastructure side to ensure low latency and high throughput. Additionally, it provides the flexibility to select any model available in their catalog and to deploy it with minimal changes. Finally, these models can also be fine-tuned with LoRA.

Below, we introduce the process to get an LLM from the NIM catalog up and running on your own infrastructure:

  • Install Docker (https://docs.docker.com/engine/install/)
  • Install NVIDIA Container Toolkit (https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html)
  • Login to NVIDIA docker login [nvcr.io]()
  • Set your NVIDIA NGC API key export NGC_API_KEY=
  • Define the container name, image name, and local path to download the model. If you want to deploy another model, for example, LLaMA 3 70B, you just need to change the IMG_NAME to nvcr.io/nim/meta/meta-llama3-70b-instruct: < version >
# Choose a container name for bookkeeping
export CONTAINER_NAME=meta-llama3-8b-instruct

# Choose a LLM NIM Image from NGC
export IMG_NAME="nvcr.io/nim/meta/${CONTAINER_NAME}:1.0.0"

# Choose a path on your system to cache the downloaded models
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
 
  • Run the following docker command.
# Start the LLM NIM
docker run -it --rm --name=$CONTAINER_NAME \\
  --runtime=nvidia \\
  --gpus all \\
  --shm-size=16GB \\
  -e NGC_API_KEY \\
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \\
  -u $(id -u) \\
  -p 8000:8000 \\
  $IMG_NAME
 
Figure 2: Example of Ad Saturation Curve and Ad Lag effect (image by author)
Table 1: Explanation of the docker command to start the NIM container.

We are all set. We can now make requests to LLaMA 3 8B that we just deployed.

import requests
from pprint import pp

endpoint = ''
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json'
}
messages = [
    {"role": "user",
    "content": "Write a short message explaining why AI is important."}
]
data = {
    'model': 'meta/llama3-8b-instruct',
    'messages': messages,
    'max_tokens': 100,
    'temperature': 1,
    'n': 1,
    'stream': False,
    'stop': 'string',
    'frequency_penalty': 0.0
}
response = requests.post(endpoint, headers=headers, json=data)
pp(response.json())
 

Artificial Intelligence (AI) is revolutionizing the world and has the potential to significantly impact our daily lives. By processing vast amounts of data, AI can automate repetitive tasks, improve decision-making, and provide personalized experiences. It can also help us tackle complex challenges such as climate change, healthcare, and education by analyzing patterns and identifying solutions. Moreover, AI can enhance productivity, efficiency, and accuracy in various industries, from customer service to transportation.

Industry-standard APIs

Another interesting aspect of NIM is its integration with popular LLM packages, such as LangChain and LlamaIndex. Furthermore, any package compatible with OpenAI’s API can be easily integrated with NIM by changing the base URL.

The following example uses LangChain and OpenAI to ask the same question as before. Note that we are providing a localhost URL since we are querying the LLaMA 3 8B model that we just deployed. In this case, we create an LLMChain with a simple template that receives a question and passes it to LLaMA.

from langchain import PromptTemplate
from langchain.chains import LLMChain
from langchain_openai import ChatOpenAI

template = """
    Question: {question}
    Answer:
"""
prompt = PromptTemplate(
    template=template, input_variables=["context", "question"]
)
llm = ChatOpenAI(base_url="",
        model="meta/llama3-8b-instruct",
        api_key="not-used",
        temperature=0.1,
        max_tokens=100,
        top_p=1.0)
query_llm = LLMChain(
            llm=llm,
            prompt=prompt,
        )
answer = query_llm.invoke(
    {"question": "Write a short message explaining why AI is important."}
)
 

Artificial Intelligence (AI) is crucial in today’s world because it has the potential to revolutionize the way we live, work, and interact with each other. AI can automate repetitive and mundane tasks, freeing up human resources to focus on more creative and strategic work. It can also improve healthcare outcomes, enhance customer service, and optimize business operations. Moreover, AI can help us tackle complex global challenges such as climate change, poverty, and inequality by providing data-driven insights and solutions.

Domain-specific models

Like HuggingFace, the NVIDIA API catalog is extensive, with models to tackle problems in different domains, such as language, speech, vision, and gaming.

In this article, we focus on four available language models:

  • mistralai /mistral-7b-instruct-v0.3
  • mistralai /mixtral-8x22b-instruct-v0.1
  • meta/llama3–70b-instruct
  • meta/llama3–8b-instruct

Mistral vs Meta: a comparison between Mistral 7B, Mixtral 8x22B, Llama 3 8B, and 70B

This section tests the four models in a question-answering dataset under the License CC BY-SA 4.0, SQuAD. This reading comprehension dataset consists of questions about a set of Wikipedia articles. Based on context, the model should be able to retrieve the correct answer to a question. The three more important fields for our use case are:

  • question - the question a model should answer.
  • context - background information from which the model needs to extract the answer.
  • answers - the text answer to the question.

To perform the evaluation, we use RAQ, as described above. In this instance, we use GPT-3.5 as the independent LLM that ranks the set of LLMs that we are interested in. It ranks their answers from best (rank=1) to worst (rank=4) based on the ground truth answer.

RAQ applies a statistical test called Dunn’s Multiple Comparison Test to assess if there is a statistically significant difference between the ranks of the set of LLMs.

Finally, RAQ also compares words per second and average answer length, which allows additional data points on model performance and verbosity.

We start by setting up an env file under env/ with OpenAI and NCG API keys:

  • var.env file
OPENAI_KEY=
NGC_API_KEY=YOUR_NGC_API_KEY
 

Then, we import all the libraries, load the API Keys, and define the models we want to use from the NVIDIA catalog.

import os

import matplotlib.pyplot as plt
import pandas as pd
import scikit_posthocs as sp
import seaborn as sns
import utils
from datasets import load_dataset
from dotenv import load_dotenv
from generator import Generator
load_dotenv('env/var.env')
# models
llama8b = Generator(model='meta/llama3-8b-instruct', ngc_key=os.getenv("NGC_API_KEY"))
mistral7b = Generator(model="mistralai/mistral-7b-instruct-v0.3", ngc_key=os.getenv("NGC_API_KEY"))
llama70b = Generator(model="meta/llama3-70b-instruct", ngc_key=os.getenv("NGC_API_KEY"))
mixtral = Generator(model="mistralai/mixtral-8x22b-instruct-v0.1", ngc_key=os.getenv("NGC_API_KEY"))
 

The Generator class is responsible for loading the model and creating the Prompt Template powered by LangChain. It formats the query and the context before passing it to the LLM to get a response.

from langchain import PromptTemplate
from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain

class Generator:
    """Generator, aka LLM, to provide an answer based on some question and context"""
    def __init__(self, model: str, ngc_key: str) -> None:
        # template
        self.template = """
            Use the following pieces of context to give a succinct and clear answer to the question at the end:
            {context}
            Question: {question}
            Answer:
        """
        # llm
        self.llm = ChatOpenAI(
          base_url="",
            api_key=ngc_key,
            model=model,
            temperature=0.1)
        # create prompt template
        self.prompt = PromptTemplate(
            template=self.template, input_variables=["context", "question"]
        )
    def generate_answer(self, context: str, question: str) -> str:
        """
        Get the answer from llm based on context and user's question
        Args:
            context (str): most similar document retrieved
            question (str): user's question
        Returns:
            str: llm answer
        """
        query_llm = LLMChain(
            llm=self.llm,
            prompt=self.prompt,
            llm_kwargs={"max_tokens": 2000},
        )
        answer = query_llm.invoke(
            {"context": context, "question": question}
        )
        
        return answer['text']
 

With the LLMs loaded, we fetch the SQuAD dataset from HuggingFace and shuffle it to ensure enough variety in the question themes.

squad = load_dataset("squad", split="train")
squad = squad.shuffle()
 

Now, we can apply the RAQ loop over 100 questions and contexts and record the above metrics.

for i in range(100):
    context = squad[i]['context']
    query = squad[i]['question']
    answer = squad[i]['answers']['text'][0]

    # llama 8b
    answer_llama, words_per_second, words = utils.get_llm_response(llama8b, context, query)
    llama8b_metrics["words_per_second"].append(words_per_second)
    llama8b_metrics["words"].append(words)
    # mistral 7b 
    answer_mistral, words_per_second, words = utils.get_llm_response(mistral7b, context, query)
    mistral7b_metrics["words_per_second"].append(words_per_second)
    mistral7b_metrics["words"].append(words)
    # llama 70b
    answer_llama70b, words_per_second, words = utils.get_llm_response(llama70b, context, query)
    llama70b_metrics["words_per_second"].append(words_per_second)
    llama70b_metrics["words"].append(words)
    # mixtral
    answer_mixtral, words_per_second, words = utils.get_llm_response(mixtral, context, query)
    mixtral_metrics["words_per_second"].append(words_per_second)
    mixtral_metrics["words"].append(words)
    # GPT-3.5 rank
    llm_answers_dict = {'llama8b': answer_llama, 'mistral7b': answer_mistral, 'llama70b': answer_llama70b, 'mixtral': answer_mixtral}
    rank = utils.get_gpt_rank(answer, llm_answers_dict, os.getenv("OPENAI_API_KEY"))
    llama8b_metrics["rank"].append(rank.index('1')+1)
    mistral7b_metrics["rank"].append(rank.index('2')+1)
    llama70b_metrics["rank"].append(rank.index('3')+1)
    mixtral_metrics["rank"].append(rank.index('4')+1)

 

The function get_llm_response receives the loaded LLM, the context, and the question and returns the LLM answer as well as the quantitative metrics.

def get_llm_response(model: Generator, context: str, query: str) -> Tuple[str, int, int]:
    """
    Generates an answer from a given LLM based on context and query
    returns the answer and the number of words per second and the total number of words
    Args:
        model (Generator): LLM
        context (str): context data
        query (str): question
    Returns:
        Tuple[str, int, int]: answer, words_per_second, words
    """

    init_time = time.time()
    answer_llm = model.get_answer(context, query)
    total_time = time.time()-init_time
    words_per_second = len(re.sub("[^a-zA-Z']+", ' ', answer_llm).split())/total_time
    words = len(re.sub("[^a-zA-Z']+", ' ', answer_llm).split())
    return answer_llm, words_per_second, words
 

On the other hand, the function get_gpt_rank implements the RAQ core logic. It is responsible for receiving the ground truth answer and each of the LLM answers and sending a request to GPT-3.5 to rank them based on correctness.

def get_gpt_rank(true_answer: str, llm_answers: dict, openai_key: str) -> list:
    """
    Implements RAQ core: based on the true answer, it uses GPT-3.5 to rank the answers of the LLMs
    Args:
        true_answer (str): correct answer
        llm_answers (dict): LLM answers
        openai_key (str): open ai key
    Returns:
        list: rank of LLM IDs
    """
    
    # get a formatted output from OpenAI
    functions = define_open_ai_function()
    gpt_query = f"""Based on the correct answer: {true_answer}, rank the IDs of the following four answers from the most to the least correct one:
        ID: 1 Answer: {re.sub("[^a-zA-Z0-9']+", ' ', llm_answers['llama8b'])}
        ID: 2 Answer: {re.sub("[^a-zA-Z0-9']+", ' ', llm_answers['mistral7b'])}
        ID: 3 Answer: {re.sub("[^a-zA-Z0-9']+", ' ', llm_answers['llama70b'])}
        ID: 4 Answer: {re.sub("[^a-zA-Z0-9']+", ' ', llm_answers['mixtral'])}"""
    completion = OpenAI(api_key=openai_key).chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": gpt_query}],
        functions=functions,
        function_call={"name": "return_rank"},
    )
    response_message = completion.choices[0].message.function_call.arguments
    rank = ast.literal_eval(response_message)["rank"].split(",")
    if len(rank) == 1:
        rank = list(rank[0])
    return rank
 

From Figure 3, it is evident that LLaMA 3 8B is the fastest LLM, producing an average of approximately 43 words per second. In terms of answer length, Mistral 7B produces longer answers, with an average answer length of 24 words, significantly more than LLaMA 70B, which produces only 8 words. Finally, based on the independent LLM rankings, Mistral 7B had the best average rank of approximately 2.25, while LLaMA 3 8B was the worst-performing LLM with an average rank of approximately 2.8.


Figure 3: Metrics comparison between all LLMs (image by author)
Table 1: Explanation of the docker command to start the NIM container.

Table 2 displays the results of the Dunn post-hoc test, comparing the performance of different language models. Each cell indicates whether the difference in performance between the respective models is statistically significant at a 5% significance level. “Significant” denotes a statistically significant difference (p-value ≤ 0.05), while “Not Significant” indicates no statistically significant difference (p-value > 0.05).

For the selected significance level, the Dunn test result shows that Mistral 7 B’s performance is significantly different from LLaMA 3 8 B’s but not significantly different from the other LLMs. One strategy to increase the likelihood of detecting significant differences in the remaining comparisons is to increase the test’s sample size (the number of examples used to rank the models). With a larger sample size, we may obtain smaller p-values if there are indeed differences between the ranks.

p_values = sp.posthoc_dunn([mistral7b_metrics['rank'], llama8b_metrics['rank'], llama70b_metrics['rank'], mixtral_metrics['rank']], p_adjust='holm')
p_values > 0.05
 
Table 2: Significance of differences in ranks among the set of LLMs
Figure 2: Example of Ad Saturation Curve and Ad Lag effect (image by author)

FAs stated earlier, the benefit of RAQ is that it can be used to assess any set of LLMs on any domain/subject/use case rather than relying on traditional benchmarks. This means that, depending on the data set used, different models will show up as leaders during this evaluation.

Conclusion

The fast progress in developing more powerful LLMs and making them easily accessible for everyone brings new challenges on the adoption side. Companies are looking for new ways of integrating them into internal and external tools. Nevertheless, there is a big trade-off between control, privacy, flexibility, and security with the current approaches to select and deploy these models.

In this article, we introduce RAQ, a novel framework to assess and compare the quality of the answers of a set of LLMs. It makes the selection of an LLM for a new use case to be objective scalable, and flexible. The flexibility comes from the fact that it can be applied when an organization has previous private examples with which they want to test the set of LLMs. But it can work even when this dataset is unavailable.

Having selected the LLM, we explored NIM as a solution for deploying the model at scale. It ensures privacy, security, and scalability out of the box, regardless of whether the model is deployed in physical hardware, a private cloud, or as a hosting service.

We applied RAQ and used NIM to compare the performance of the Mistral and Meta models (small and large versions). We used a reading comprehension dataset to illustrate how to use RAQ. Mistral 7B was the best model in our setup, generating significantly better answers than LLaMA 3 8B. The latter was actually the fastest model, producing more words per second. This shows that selecting a model usually involves a trade-off between quality and speed.

About me

Serial entrepreneur and leader in the AI space. I develop AI products for businesses and invest in AI-focused startups. Founder @ ZAAI | LinkedIn | X/Twitter

References

[1] https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/

[2] Dunn, O. J. (1964) Multiple comparisons using rank sums. Technometrics. 6, 241–252. doi:10.1080/00401706.1964.10490181.

All images are by the authors unless noted otherwise.