Model Readiness Spec Tests; for Prolonged and High-Demand Use #193

stisiTT · 2025-04-07T20:10:57Z

PR: Add Spec Tests Workflow - Comprehensive Model Readiness Testing

Overview

This PR introduces the Spec Tests workflow, a comprehensive testing framework designed to validate model readiness and performance across different parameter configurations. The spec_tests workflow provides both targeted single-parameter testing and exhaustive multi-parameter matrix testing to ensure models meet performance specifications before deployment.

What is the Spec Tests Workflow?

The Spec Tests workflow is a model validation system that:

Tests model performance across various input/output token configurations
Validates concurrency handling at different load levels
Measures performance metrics like TTFT (Time to First Token) and throughput
Compares against performance targets defined in model configurations
Generates comprehensive test reports for model readiness assessment

Key Features

🎯 Two Testing Modes:

Single Mode: Test specific parameter combinations with explicit control
Multiple Mode: Comprehensive cross-product matrix testing across all valid parameter combinations

🔧 Flexible Parameter Control:

Input/output token lengths
Concurrency levels (batch sizes)
Number of prompts (user simulation)
Context length limits
Custom parameter combinations via workflow-args

📊 Performance Validation:

Automatic extraction of performance targets from model configs
Comparison against theoretical and reference benchmarks
Support for different target levels (theoretical, reference, etc.)

⏱️ Endurance Testing:

24-hour continuous testing mode for stability validation
Stress testing with high concurrency and prompt volumes

How It Works

Architecture

spec_tests/
├── run_spec_tests.py           # Main entry point
├── spec_tests_config.py        # Parameter space generation
├── spec_tests_benchmarking_script.py  # Benchmarking engine
└── run/
    ├── spec_tests.py           # Core workflow orchestrator
    ├── spec_test_tasks.py      # Parameter generation logic
    ├── spec_test_prompt.py     # Prompt generation
    ├── spec_test_run.py        # Test execution
    └── spec_tests_env_vars.py  # Environment configuration

Parameter Generation Logic

Single Mode:

Uses explicit parameters from command line arguments
Calculates input/output sizes based on max_context_length
Defaults: max_context_length=8192, max_concurrent=1, num_prompts=1

Multiple Mode:

Generates cross-product of all valid parameter combinations
Uses model-specific constraints (max_context, max_concurrency)
Includes validated combinations from performance reference data
Context sizes: max, ~50%, ~10% of model's max_context
Output sizes: 128, 64 tokens (fixed)
Concurrency: 1, 2, ~50% of max, validated values (max concurrency temporarily disabled)
Prompt patterns: single (1), load-matched (=concurrency) (5x concurrency temporarily disabled)

Usage Examples

Basic Usage

Single Parameter Test

Test a specific configuration with explicit parameters:

python run.py --model "Llama-3.1-8B-Instruct" --workflow spec_tests --device n300 --run-mode single --docker-server

Custom Context Length

Test with a specific context length:

python run.py --model "Llama-3.1-8B-Instruct" --workflow spec_tests --device n300 --run-mode single --max-context-length 4096 --docker-server

Load Testing with Concurrency

Test with specific concurrency and prompt count:

python run.py --model "Llama-3.1-8B-Instruct" --workflow spec_tests --device n300 --run-mode single --max-context-length 2048 --workflow-args "max_concurrent=16 num_prompts=64" --docker-server

Comprehensive Matrix Testing

Run all valid parameter combinations:

python run.py --model "Llama-3.1-8B-Instruct" --workflow spec_tests --device n300 --run-mode multiple --docker-server

Advanced Usage

High-Load Stress Testing

python run.py --model "Qwen2.5-72B-Instruct" --workflow spec_tests --device t3k --run-mode single --max-context-length 8192 --workflow-args "max_concurrent=32 num_prompts=160" --docker-server

Custom Input/Output Sizes

python run.py --model "Llama-3.1-8B-Instruct" --workflow spec_tests --device n300 --run-mode single --workflow-args "input_size=1024 output_size=512 max_concurrent=8 num_prompts=24" --docker-server

Endurance Testing (24 hours)

python run.py --model "Llama-3.1-8B-Instruct" --workflow spec_tests --device n300 --endurance-mode --docker-server

Model-Specific Examples

Large Model Testing (70B+)

# Qwen2.5-72B on T3K with reduced context for faster testing
python run.py --model "Qwen2.5-72B-Instruct" --workflow spec_tests --device t3k --run-mode single --max-context-length 2048 --workflow-args "max_concurrent=16 num_prompts=32" --docker-server

# Full parameter matrix for comprehensive validation
python run.py --model "Qwen2.5-72B-Instruct" --workflow spec_tests --device t3k --run-mode multiple --docker-server

Parameter Reference

Core Parameters

Parameter	Description	Default	Example
`--run-mode`	Testing mode: `single` or `multiple`	`multiple`	`--run-mode single`
`--max-context-length`	Maximum context length in tokens	8192	`--max-context-length 4096`
`--endurance-mode`	Run continuously for 24 hours	false	`--endurance-mode`

Workflow Args (via `--workflow-args`)

Parameter	Description	Default	Example
`max_concurrent`	Concurrent requests (batch size)	1	`max_concurrent=16`
`num_prompts`	Total number of prompts (users)	1	`num_prompts=64`
`input_size`	Input token length	auto-calculated	`input_size=1024`
`output_size`	Output token length	128	`output_size=256`

Usage Patterns

Single Test: max_concurrent=1 num_prompts=1
Load Test: max_concurrent=16 num_prompts=16
Stress Test: max_concurrent=32 num_prompts=160
Burst Test: max_concurrent=8 num_prompts=100

Output and Reports

Test Execution Output

=== Parameter Space Information ===
Model ID: id_tt-transformers_Llama-3.1-8B-Instruct_n300
Device: n300
Max Context Limit: 128000
Max Concurrency Limit: 32
Max Context Length: 4096
Validated Combinations: 2
Total Spec Test Parameters: 1
Performance Target Levels: ['theoretical']
===================================

Generated Test Combinations (Multiple Mode)

The workflow automatically generates test combinations based on:

Context Sizes: [128000, 64000, 12800] (max, 50%, 10%)
Output Sizes: [128, 64] (fixed)
Input Sizes: Calculated as context_size - output_size
Concurrency: [1, 2, 16] (baseline, small, mid-scale - max concurrency temporarily disabled)
Prompt Counts: [1, match_concurrency] (baseline, load - 5x concurrency temporarily disabled)

Performance Metrics

Each test measures:

TTFT (Time to First Token): Latency to first response token
Throughput: Tokens per second
Request Rate: Requests per second
Error Rates: Failed requests percentage
Resource Utilization: Memory, compute usage

Integration with Model Configurations

The spec_tests workflow automatically integrates with model configurations to:

Extract Performance Targets: Reads theoretical and reference performance targets
Validate Parameter Bounds: Ensures tests stay within model constraints
Use Validated Combinations: Incorporates pre-tested parameter sets
Apply Device Constraints: Respects device-specific limitations

Example Model Config Integration

# From model_performance_reference.json
"Llama-3.1-8B": {
    "n300": [
        {
            "isl": 128,
            "osl": 128, 
            "max_concurrency": 1,
            "num_prompts": 8,
            "targets": {
                "theoretical": {
                    "ttft_ms": 16,
                    "tput_user": 66
                }
            }
        }
    ]
}

Benefits

For Development Teams

Early Detection: Catch performance regressions before deployment
Parameter Optimization: Find optimal configurations for different use cases
Load Planning: Understand capacity limits and scaling characteristics

For QA/Testing

Comprehensive Coverage: Test all valid parameter combinations automatically
Reproducible Results: Consistent test parameters and environments
Performance Baselines: Establish and track performance targets

For DevOps/Infrastructure

Capacity Planning: Understand resource requirements at scale
SLA Validation: Verify models meet performance commitments
Deployment Readiness: Comprehensive pre-deployment validation

Technical Implementation

Key Components

SpecTestParamSpace: Generates parameter combinations based on model constraints
SpecTestTask: Orchestrates parameter generation for single/multiple modes
SpecTestPrompt: Creates test prompts with specified token characteristics
SpecTestRun: Executes benchmarks and collects performance metrics
SpecTestsEnvVars: Manages environment configuration and secrets

Workflow Integration

Integrates with existing run.py CLI interface
Uses standard workflow patterns (docker-server, logging, etc.)
Supports all workflow arguments and overrides
Compatible with existing model configuration system

This spec_tests workflow provides a robust foundation for model validation and performance testing, ensuring models meet quality and performance standards before deployment to production environments.

tstescoTT

Thanks for making the PR, here are my general comments:

I think the tests should use the utils prompt client instead of the benchmarking tooling for making the client calls. This is so we have more fine-grain control over the prompts. The prompt params can use the same format e.g. a subset of https://github.com/tenstorrent/tt-inference-server/blob/tstesco/benchmarks-targets/workflows/utils.py#L228
Using model_config max_concurrency_map and max_context_map, to define the prompt params is good, that should stay as is.
The tests section of the release report needs to show a clear pass/fail for each test. For example Performance Benchmark Targets section in #215 (comment). For the tests this would be not crashing/hanging in most cases and use the implementation for hang detection tracked in #212.
The raw data from the tests will also need to be saved to a test run file for reports generation.

Let me know what you think.

stisiTT · 2025-04-25T17:29:37Z

Thanks! Will start making these changes on my next round of availability.

tstescoTT · 2025-04-28T19:07:25Z

Discussed that addressing #213 here is likely a requirement so that the tests are reliably hitting the boundaries of model context.

…ult to 1200s to allow for larger models

…cations

* remove redundant model_name that can be inferred * remove redundant run.sh script * use model_config in setup_hosts.py, reduce SetupConfig size * refactor workflows/utils.py to handle static functions and make imports easier * WIP run_docker.py * handle meta evals having different evals per version

Next steps should be to move everything to one file after rebasing online dev.

Next steps are to tidy up the code for readibility

That last commit message was perfectly cromulent

5x max_concurrency added

…c_Tests

…pec_Tests

- Add workflow-args parsing in run_workflows.py to pass parameters like max_concurrent and num_prompts to spec tests - Convert snake_case to kebab-case for command line compatibility - Fix benchmarking script usage stats null check - Remove unnecessary initial test run from benchmark script

…andling - Remove unused variable `ttft` and replace it with a flag to track the first chunk received. - Update token processing logic to handle cases where the text may be empty. - Ensure success status is set based on whether a valid chunk was received. - Adjust latency calculation to use the most recent timestamp.

Review is 2 months old and lots of issues here have been addressed in updated scope

# Conflicts: # workflows/run_reports.py

Updated `generate_cleaned_random_prompts_using_server` to support both server-side and client-side tokenization, improving efficiency by preloading the tokenizer. Adjusted `generate_stable_prompt_tokens` to accept a preloaded tokenizer, ensuring consistent detokenization based on the selected method.

# Conflicts: # run.py # workflows/run_reports.py # workflows/run_workflows.py

…dated argument parsing to require model specification from JSON and adjusted related classes to accommodate this change. Removed deprecated get_model_id function and streamlined model configuration handling across various components.

Copilot

Pull Request Overview

This PR introduces the comprehensive Spec Tests workflow for validating model readiness and performance across different parameter configurations. The workflow provides both single-parameter testing and exhaustive multi-parameter matrix testing with performance validation against established targets.

Adds complete spec_tests workflow implementation with parameter space generation and benchmarking capabilities
Introduces new workflow types, venv configurations, and report generation for spec tests
Implements server-side tokenization utilities and cleaned prompt generation for stable testing
Temporarily disables most model specifications to focus testing on primary configurations

Reviewed Changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
workflows/workflow_venvs.py	Adds venv setup for spec_tests workflow with required dependencies
workflows/workflow_types.py	Defines SPEC_TESTS workflow type and associated venv configurations
workflows/workflow_config.py	Registers spec_tests workflow configuration and integration
workflows/run_workflows.py	Updates workflow orchestration to support spec_tests workflow type
workflows/run_reports.py	Adds spec test report generation and markdown table formatting
workflows/model_spec.py	Temporarily comments out most model specifications for focused testing
utils/prompt_configs.py	Adds TODO comment about authorization bug in default model configuration
utils/prompt_client.py	Implements server-side tokenization/detokenization API endpoints
utils/cleaned_prompt_generation.py	New utility for generating stable prompt tokens with encoding/decoding cycles
tests/tests_config.py	Comprehensive parameter space generation logic for test configurations
tests/test_workflows.py	Updates workflow tests to include SPEC_TESTS workflow validation
tests/run_tests.py	Main test runner with argument parsing and model validation
tests/init.py	Exports Tests class for workflow integration
spec_tests/	Complete spec_tests implementation with config, benchmarking, and execution logic
run.py	Adds CLI arguments for run mode, context length, and endurance testing

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

utils/prompt_configs.py

Copilot · 2025-08-13T19:11:41Z

spec_tests/spec_tests_config.py

+            concurrency_values.append(half_concurrency)
+
+        # Add max concurrency for stress testing
+        # TEMPORARILY DISABLED: concurrency_values.append(self.max_concurrency_limit)


The comment about temporarily disabling max concurrency should include a reason why it's disabled and when it should be re-enabled.

Copilot · 2025-08-13T19:11:41Z

spec_tests/spec_tests_config.py

+        # The actual prompt values will be determined during cross product generation
+        # based on the specific concurrency value in each combination
+        # For now, return placeholders that indicate the pattern
+        # TEMPORARILY DISABLED 5x concurrency: return [1, -1, -5]  # -1 = match concurrency, -5 = 5x concurrency


Similar to above, this temporary disabling comment should explain the reason for disabling 5x concurrency testing and the conditions for re-enabling it.

Suggested change

# TEMPORARILY DISABLED 5x concurrency: return [1, -1, -5] # -1 = match concurrency, -5 = 5x concurrency

# 5x concurrency testing is temporarily disabled due to known instability and excessive resource usage

# (see issue #1234). Re-enable by adding -5 back to the return list once the underlying issues are resolved

# and the system can reliably handle high concurrency stress tests.

# Previously: return [1, -1, -5] # -1 = match concurrency, -5 = 5x concurrency

spec_tests/run/spec_test_run.py

workflows/run_workflows.py

…revent overflow and update workflow execution to include spec tests. Adjusted benchmark script path for consistency and updated test assertions to reflect the new workflow structure.

# Conflicts: # workflows/model_spec.py

stisiTT requested a review from tstescoTT April 7, 2025 20:10

stisiTT self-assigned this Apr 7, 2025

stisiTT linked an issue Apr 7, 2025 that may be closed by this pull request

Model Spec tests #122

Open

8 tasks

stisiTT requested a review from mvanniasingheTT April 7, 2025 20:11

tstescoTT previously requested changes Apr 25, 2025

View reviewed changes

stisiTT force-pushed the samt/run_tests branch from 9a3d8ef to 44c8f23 Compare May 14, 2025 22:50

stisiTT force-pushed the samt/run_tests branch from 4423642 to 463197e Compare May 22, 2025 14:01

tstescoTT and others added 21 commits May 22, 2025 14:06

pass in timeout for start up healthcheck for benchmarking script defa…

1214875

…ult to 1200s to allow for larger models

set 1200 second default for wait_for_healthy

e8ec457

WIP run.py and setup scripts

7629e7f

wip adding workflows dir, wip run.py

5d6c0a4

WIP run_local.py and clean up

c9e55dc

add model_config.py and workflow_config.py to separate static specifi…

b2edc26

…cations

clean up: remove unused files, add SPDX headers

a2e0b7e

Continuous Batching, Users, and Batch_size added as input options

0538b2f

Convert to Python script and began consolidating environment variables

54a2026

Next steps should be to move everything to one file after rebasing online dev.

Exp moving everything into 1 file

103ee16

Moved to single file passes initial tests

ed29e0c

Next steps are to tidy up the code for readibility

Redundant code removed, simplified arguments

2b7b309

Logging bug fixed

f5309ad

Logging bug persists, checkpoint before refactor

f1241c8

Code Refactor

0bdd4be

Code Refactor - rename to run_tests.py

8fb7389

But fixes and function consolidation

4d7334b

Bug Fixes

60daaba

That last commit message was perfectly cromulent

License label and formatting changes

4d3700b

Import fixes

ab82c73

stisiTT added 11 commits June 11, 2025 20:51

Continuous Batching

7ea89c1

5x max_concurrency added

Reports update for continuous batching

962f922

Right justified number placement for better readability

ee2555f

Refactoring name of Model Readiness Tests to Spec_Tests

b3b828f

Adding new files for Refactoring name of Model Readiness Tests to Spe…

16f7768

…c_Tests

Merge branch 'dev' into samt/run_tests

dddce4e

Deleting old files for Refactoring name of Model Readiness Tests to S…

ee71b5b

…pec_Tests

Committing before new benchmarking server is created.

4d666bc

Spec Tests Benchmarking Script

54e54ec

aiohttp used for concurrency instead of threading.

a419768

stisiTT removed the request for review from mvanniasingheTT June 19, 2025 16:50

stisiTT added the enhancement New feature or request label Jun 19, 2025

stisiTT requested review from tstescoTT, bgoelTT and arobergeTT June 24, 2025 14:55

Merge branch 'dev' into samt/run_tests

5789892

Merge branch 'refs/heads/dev' into samt/run_tests

49329b7

# Conflicts: # workflows/run_reports.py

stisiTT changed the title ~~Model Readiness Tests for Prolonged and High-Demand Use~~ Model Readiness Spec Tests; for Prolonged and High-Demand Use Jul 8, 2025

stisiTT added 5 commits July 28, 2025 14:51

Merge branch 'dev' into samt/run_tests

80d4ef1

Merge branch 'dev' into samt/run_tests

9d5b1f1

Merge branch 'dev' into samt/run_tests

97e4039

# Conflicts: # run.py # workflows/run_reports.py # workflows/run_workflows.py

stisiTT requested a review from Copilot August 13, 2025 19:10

Copilot AI reviewed Aug 13, 2025

View reviewed changes

stisiTT added 2 commits August 14, 2025 01:17

Enhance spec tests configuration by adding input size validation to p…

868e844

…revent overflow and update workflow execution to include spec tests. Adjusted benchmark script path for consistency and updated test assertions to reflect the new workflow structure.

Merge branch 'refs/heads/dev' into samt/run_tests

dac6c3d

# Conflicts: # workflows/model_spec.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Model Readiness Spec Tests; for Prolonged and High-Demand Use #193

Model Readiness Spec Tests; for Prolonged and High-Demand Use #193

Uh oh!

stisiTT commented Apr 7, 2025 •

edited

Loading

Uh oh!

tstescoTT left a comment

Uh oh!

stisiTT commented Apr 25, 2025

Uh oh!

tstescoTT commented Apr 28, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI Aug 13, 2025

Uh oh!

Copilot AI Aug 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

-        # TEMPORARILY DISABLED 5x concurrency: return [1, -1, -5]  # -1 = match concurrency, -5 = 5x concurrency
+        # 5x concurrency testing is temporarily disabled due to known instability and excessive resource usage
+        # (see issue #1234). Re-enable by adding -5 back to the return list once the underlying issues are resolved
+        # and the system can reliably handle high concurrency stress tests.
+        # Previously: return [1, -1, -5]  # -1 = match concurrency, -5 = 5x concurrency

Model Readiness Spec Tests; for Prolonged and High-Demand Use #193

Are you sure you want to change the base?

Model Readiness Spec Tests; for Prolonged and High-Demand Use #193

Uh oh!

Conversation

stisiTT commented Apr 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR: Add Spec Tests Workflow - Comprehensive Model Readiness Testing

Overview

What is the Spec Tests Workflow?

Key Features

How It Works

Architecture

Parameter Generation Logic

Usage Examples

Basic Usage

Single Parameter Test

Custom Context Length

Load Testing with Concurrency

Comprehensive Matrix Testing

Advanced Usage

High-Load Stress Testing

Custom Input/Output Sizes

Endurance Testing (24 hours)

Model-Specific Examples

Large Model Testing (70B+)

Parameter Reference

Core Parameters

Workflow Args (via --workflow-args)

Usage Patterns

Output and Reports

Test Execution Output

Generated Test Combinations (Multiple Mode)

Performance Metrics

Integration with Model Configurations

Example Model Config Integration

Benefits

For Development Teams

For QA/Testing

For DevOps/Infrastructure

Technical Implementation

Key Components

Workflow Integration

Uh oh!

tstescoTT left a comment

Choose a reason for hiding this comment

Uh oh!

stisiTT commented Apr 25, 2025

Uh oh!

tstescoTT commented Apr 28, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stisiTT commented Apr 7, 2025 •

edited

Loading

Workflow Args (via `--workflow-args`)