The Natural Conversation Benchmarks (NC-Bench) aim to answer the question: How well can generative AI converse like humans do? In other words, the benchmarks begin to measure the general conversational competence of large language models (LLMs). They do this by testing models' ability to generate an appropriate type of conversational action, or dialogue act, in response to a particular sequence of actions. The sequences of conversational actions, or patterns, are adapted from conversation science, specifically the model of sequence organization in the field of conversation analysis (Schegloff, 2007) and the pattern library of IBM Natural Conversation Framework (book). Models are tested by generating the next line in a transcript. NC-Bench is a lightweight method that is easily extensible to more conversation patterns.
First, clone the repository to your local machine:
git clone https://github.ibm.com/conversational-ux/conversation-benchmark.git
cd conversation-benchmarkCreate a new conda or Python virtual environment. Install all dependencies using pip:pip:
pip install -r requirements.txtThe dataset required for benchmarking can be found in the data/ directory. There is nothing to do here, the data file will already exist. This is a JSONL file where each entry is a JSON with fields for the ID, task type, and input text. For example:
{
"id": 0,
"task": "definition request",
"chat_prompt": [
{"role": "system", "content": "Let’s do some roleplaying today! We’re just gonna have an ordinary conversation. But keep your turn short, about one sentence."},
{"role": "user", "content": "What's the difference between an index fund and a mutual fund?"},
{"role": "assistant", "content": "Index funds are passively managed, while most other mutual funds are actively managed."},
{"role": "user", "content": "What does passively managed mean?"}
]
}The chat_prompt field is the input text formatted to be used for the HuggingFace chat template using the tokenizer.apply_chat_template(...) API. The freeform_prompt is the raw text without any chat templates that can be used by models that lack a chat template.
To run the model predictions using a local HuggingFace model, use the provided script:
# Generates outputs using a Hugging Face model for benchmark tasks.
python pred_hf.py --model_path /path/to/model \
--prompts_file ./data/converation_competence.jsonl \
--generations_file /path/to/model/generations.jsonl--model_path:Path to your model.--prompts_file:Path to the input prompt data (in JSONL format).--generations_file:Path to where the generated outputs will be saved (in JSONL format).--batch_size:Number of samples to process per batch (default is 8).--max_new_tokens:Maximum number of new tokens to generate (default is 128).--no_chat_template:Flag to disable the chat template (off by default).
To run the model predictions using an OpenAI model, use the provided script:
# Set OpenAI API key
export OPENAI_API_KEY="..."
# Generates outputs using GPT-4o for benchmark tasks.
python pred_gpt4o.py --model gpt-4o-2024-08-06 \
--prompts_file ./data/converation_competence.jsonl \
--generations_file /path/to/model/generations.jsonl--model:Specify the GPT-4o model version to use (default is gpt-4o-2024-08-06).--prompts_file:Path to the input prompt data (in JSONL format).--generations_file:Path to where the generated outputs will be saved (in JSONL format).--max_new_tokens:Maximum number of new tokens to generate (default is 128).
To run the model predictions Research Inference and Training Services (RITS), use the provided script:
# Set RITS API key
export RITS_API_KEY="..."
# Generates outputs using RITS for benchmark tasks.
python pred_rits.py --model ibm-granite/granite-13b-chat-v2 \
--endpoint https://inference-3scale-apicast-production.apps.rits.fmaas.res.ibm.com/granite-13b-chat-v2 \
--prompts_file ./data/converation_competence.jsonl \
--generations_file /path/to/model/generations.jsonl \--model:Specify the model version to use.--endpoint:Model inference endpoint.--prompts_file:Path to where the input data is located (in JSONL format).--generations_file:Path to where the generated outputs will be saved (in JSONL format).--max_new_tokens:Maximum number of new tokens to generate (default is 128).
To run the judgments on the model generations with a local HuggingFace model, use the provided script:
# Evaluates the generated outputs using a Hugging Face model as the judge.
python judge_hf.py --judge_path /path/to/judge \
--prompts_file ./data/converation_competence.jsonl \
--generations_file /path/to/model/generations.jsonl \
--judgments_file /path/to/judgments.jsonl--judge_path:Path to the model (e.g. Llama-3.3-70b-Instruct).--prompts_file:Path to the input prompt data (in JSONL format).--generations_file:Path to the generated outputs (in JSONL format).--judgments_file:Path where the judgment outputs will be saved (in JSONL format).--max_new_tokens:Maximum number of new tokens to evaluate (default is 128).--batch_size:Number of samples to process per batch (default is 4).
To run the judgments on the model generations with an OpenAI model, use the provided script:
# Set OpenAI API key
export OPENAI_API_KEY="..."
# Evaluates the generated outputs using GPT-4o as the judge.
python judge_gpt4o.py --judge gpt-4o-2024-08-06 \
--prompts_file ./data/converation_competence.jsonl \
--generations_file /path/to/model/generations.jsonl \
--judgments_file /path/to/judgments.jsonl--judge:Specify the GPT-4o model version to use (default is gpt-4o-2024-08-06).--prompts_file:Path to the input prompt data (in JSONL format).--generations_file:Path to the generated outputs (in JSONL format).--judgments_file:Path where the judgment outputs will be saved (in JSONL format).--max_new_tokens:Maximum number of new tokens to evaluate (default is 128).
After the evaluation process is complete, aggregate the results by running the provided script. This will produce an output file containing the chosen metric(s).
- Input: The judgments file in JSONL format, such as
judgments/gpt-4o/llama-3.2-1b-instruct.jsonl. - Output: The aggregated results saved in a JSON file, such as
evaluations/gpt-4o/llama-3.2-1b-instruct.json.
Note: The aggregated output should be saved in JSON format rather than JSONL since it contains the final aggregated metric(s).
To run the aggregation process, use the following command:
python aggregator.py --judgments_dir ./results/judgments/gpt-4o --evaluations_dir ./results/evaluations/gpt-4o--judgments_dir:The path to the directory containing the judgments.--evaluations_dir:The path to the directory where the aggregated results (in JSON format) will be saved.
If you use this dataset in your research, please cite it as follows:
@misc{moore2026ncbench,
title={NC-Bench: An LLM Benchmark for Evaluating Conversational Competence},
author={Robert J. Moore and Sungeun An and Farhan Ahmed and Jay Pankaj Gala},
year={2026},
eprint={2601.06426},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2601.06426},
}