Skip to content

Model Readiness Spec Tests; for Prolonged and High-Demand Use #193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 103 commits into
base: dev
Choose a base branch
from

Conversation

stisiTT
Copy link

@stisiTT stisiTT commented Apr 7, 2025

PR: Add Spec Tests Workflow - Comprehensive Model Readiness Testing

Overview

This PR introduces the Spec Tests workflow, a comprehensive testing framework designed to validate model readiness and performance across different parameter configurations. The spec_tests workflow provides both targeted single-parameter testing and exhaustive multi-parameter matrix testing to ensure models meet performance specifications before deployment.

What is the Spec Tests Workflow?

The Spec Tests workflow is a model validation system that:

  • Tests model performance across various input/output token configurations
  • Validates concurrency handling at different load levels
  • Measures performance metrics like TTFT (Time to First Token) and throughput
  • Compares against performance targets defined in model configurations
  • Generates comprehensive test reports for model readiness assessment

Key Features

🎯 Two Testing Modes:

  • Single Mode: Test specific parameter combinations with explicit control
  • Multiple Mode: Comprehensive cross-product matrix testing across all valid parameter combinations

🔧 Flexible Parameter Control:

  • Input/output token lengths
  • Concurrency levels (batch sizes)
  • Number of prompts (user simulation)
  • Context length limits
  • Custom parameter combinations via workflow-args

📊 Performance Validation:

  • Automatic extraction of performance targets from model configs
  • Comparison against theoretical and reference benchmarks
  • Support for different target levels (theoretical, reference, etc.)

⏱️ Endurance Testing:

  • 24-hour continuous testing mode for stability validation
  • Stress testing with high concurrency and prompt volumes

How It Works

Architecture

spec_tests/
├── run_spec_tests.py           # Main entry point
├── spec_tests_config.py        # Parameter space generation
├── spec_tests_benchmarking_script.py  # Benchmarking engine
└── run/
    ├── spec_tests.py           # Core workflow orchestrator
    ├── spec_test_tasks.py      # Parameter generation logic
    ├── spec_test_prompt.py     # Prompt generation
    ├── spec_test_run.py        # Test execution
    └── spec_tests_env_vars.py  # Environment configuration

Parameter Generation Logic

Single Mode:

  • Uses explicit parameters from command line arguments
  • Calculates input/output sizes based on max_context_length
  • Defaults: max_context_length=8192, max_concurrent=1, num_prompts=1

Multiple Mode:

  • Generates cross-product of all valid parameter combinations
  • Uses model-specific constraints (max_context, max_concurrency)
  • Includes validated combinations from performance reference data
  • Context sizes: max, ~50%, ~10% of model's max_context
  • Output sizes: 128, 64 tokens (fixed)
  • Concurrency: 1, 2, ~50% of max, validated values (max concurrency temporarily disabled)
  • Prompt patterns: single (1), load-matched (=concurrency) (5x concurrency temporarily disabled)

Usage Examples

Basic Usage

Single Parameter Test

Test a specific configuration with explicit parameters:

python run.py --model "Llama-3.1-8B-Instruct" --workflow spec_tests --device n300 --run-mode single --docker-server

Custom Context Length

Test with a specific context length:

python run.py --model "Llama-3.1-8B-Instruct" --workflow spec_tests --device n300 --run-mode single --max-context-length 4096 --docker-server

Load Testing with Concurrency

Test with specific concurrency and prompt count:

python run.py --model "Llama-3.1-8B-Instruct" --workflow spec_tests --device n300 --run-mode single --max-context-length 2048 --workflow-args "max_concurrent=16 num_prompts=64" --docker-server

Comprehensive Matrix Testing

Run all valid parameter combinations:

python run.py --model "Llama-3.1-8B-Instruct" --workflow spec_tests --device n300 --run-mode multiple --docker-server

Advanced Usage

High-Load Stress Testing

python run.py --model "Qwen2.5-72B-Instruct" --workflow spec_tests --device t3k --run-mode single --max-context-length 8192 --workflow-args "max_concurrent=32 num_prompts=160" --docker-server

Custom Input/Output Sizes

python run.py --model "Llama-3.1-8B-Instruct" --workflow spec_tests --device n300 --run-mode single --workflow-args "input_size=1024 output_size=512 max_concurrent=8 num_prompts=24" --docker-server

Endurance Testing (24 hours)

python run.py --model "Llama-3.1-8B-Instruct" --workflow spec_tests --device n300 --endurance-mode --docker-server

Model-Specific Examples

Large Model Testing (70B+)

# Qwen2.5-72B on T3K with reduced context for faster testing
python run.py --model "Qwen2.5-72B-Instruct" --workflow spec_tests --device t3k --run-mode single --max-context-length 2048 --workflow-args "max_concurrent=16 num_prompts=32" --docker-server

# Full parameter matrix for comprehensive validation
python run.py --model "Qwen2.5-72B-Instruct" --workflow spec_tests --device t3k --run-mode multiple --docker-server

Parameter Reference

Core Parameters

Parameter Description Default Example
--run-mode Testing mode: single or multiple multiple --run-mode single
--max-context-length Maximum context length in tokens 8192 --max-context-length 4096
--endurance-mode Run continuously for 24 hours false --endurance-mode

Workflow Args (via --workflow-args)

Parameter Description Default Example
max_concurrent Concurrent requests (batch size) 1 max_concurrent=16
num_prompts Total number of prompts (users) 1 num_prompts=64
input_size Input token length auto-calculated input_size=1024
output_size Output token length 128 output_size=256

Usage Patterns

Single Test: max_concurrent=1 num_prompts=1
Load Test: max_concurrent=16 num_prompts=16
Stress Test: max_concurrent=32 num_prompts=160
Burst Test: max_concurrent=8 num_prompts=100

Output and Reports

Test Execution Output

=== Parameter Space Information ===
Model ID: id_tt-transformers_Llama-3.1-8B-Instruct_n300
Device: n300
Max Context Limit: 128000
Max Concurrency Limit: 32
Max Context Length: 4096
Validated Combinations: 2
Total Spec Test Parameters: 1
Performance Target Levels: ['theoretical']
===================================

Generated Test Combinations (Multiple Mode)

The workflow automatically generates test combinations based on:

  • Context Sizes: [128000, 64000, 12800] (max, 50%, 10%)
  • Output Sizes: [128, 64] (fixed)
  • Input Sizes: Calculated as context_size - output_size
  • Concurrency: [1, 2, 16] (baseline, small, mid-scale - max concurrency temporarily disabled)
  • Prompt Counts: [1, match_concurrency] (baseline, load - 5x concurrency temporarily disabled)

Performance Metrics

Each test measures:

  • TTFT (Time to First Token): Latency to first response token
  • Throughput: Tokens per second
  • Request Rate: Requests per second
  • Error Rates: Failed requests percentage
  • Resource Utilization: Memory, compute usage

Integration with Model Configurations

The spec_tests workflow automatically integrates with model configurations to:

  1. Extract Performance Targets: Reads theoretical and reference performance targets
  2. Validate Parameter Bounds: Ensures tests stay within model constraints
  3. Use Validated Combinations: Incorporates pre-tested parameter sets
  4. Apply Device Constraints: Respects device-specific limitations

Example Model Config Integration

# From model_performance_reference.json
"Llama-3.1-8B": {
    "n300": [
        {
            "isl": 128,
            "osl": 128, 
            "max_concurrency": 1,
            "num_prompts": 8,
            "targets": {
                "theoretical": {
                    "ttft_ms": 16,
                    "tput_user": 66
                }
            }
        }
    ]
}

Benefits

For Development Teams

  • Early Detection: Catch performance regressions before deployment
  • Parameter Optimization: Find optimal configurations for different use cases
  • Load Planning: Understand capacity limits and scaling characteristics

For QA/Testing

  • Comprehensive Coverage: Test all valid parameter combinations automatically
  • Reproducible Results: Consistent test parameters and environments
  • Performance Baselines: Establish and track performance targets

For DevOps/Infrastructure

  • Capacity Planning: Understand resource requirements at scale
  • SLA Validation: Verify models meet performance commitments
  • Deployment Readiness: Comprehensive pre-deployment validation

Technical Implementation

Key Components

  1. SpecTestParamSpace: Generates parameter combinations based on model constraints
  2. SpecTestTask: Orchestrates parameter generation for single/multiple modes
  3. SpecTestPrompt: Creates test prompts with specified token characteristics
  4. SpecTestRun: Executes benchmarks and collects performance metrics
  5. SpecTestsEnvVars: Manages environment configuration and secrets

Workflow Integration

  • Integrates with existing run.py CLI interface
  • Uses standard workflow patterns (docker-server, logging, etc.)
  • Supports all workflow arguments and overrides
  • Compatible with existing model configuration system

This spec_tests workflow provides a robust foundation for model validation and performance testing, ensuring models meet quality and performance standards before deployment to production environments.

@stisiTT stisiTT requested a review from tstescoTT April 7, 2025 20:10
@stisiTT stisiTT self-assigned this Apr 7, 2025
@stisiTT stisiTT linked an issue Apr 7, 2025 that may be closed by this pull request
8 tasks
@stisiTT stisiTT requested a review from mvanniasingheTT April 7, 2025 20:11
Copy link
Contributor

@tstescoTT tstescoTT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the PR, here are my general comments:

  1. I think the tests should use the utils prompt client instead of the benchmarking tooling for making the client calls. This is so we have more fine-grain control over the prompts. The prompt params can use the same format e.g. a subset of https://github.com/tenstorrent/tt-inference-server/blob/tstesco/benchmarks-targets/workflows/utils.py#L228
  2. Using model_config max_concurrency_map and max_context_map, to define the prompt params is good, that should stay as is.
  3. The tests section of the release report needs to show a clear pass/fail for each test. For example Performance Benchmark Targets section in #215 (comment). For the tests this would be not crashing/hanging in most cases and use the implementation for hang detection tracked in #212.
  4. The raw data from the tests will also need to be saved to a test run file for reports generation.

Let me know what you think.

@stisiTT
Copy link
Author

stisiTT commented Apr 25, 2025

Thanks! Will start making these changes on my next round of availability.

@tstescoTT
Copy link
Contributor

Discussed that addressing #213 here is likely a requirement so that the tests are reliably hitting the boundaries of model context.

tstescoTT and others added 21 commits May 22, 2025 14:06
* remove redundant model_name that can be inferred

* remove redundant run.sh script

* use model_config in setup_hosts.py, reduce SetupConfig size

* refactor workflows/utils.py to handle static functions and make imports easier

* WIP run_docker.py

* handle meta evals having different evals per version
Next steps should be to move everything to one file after rebasing online dev.
Next steps are to tidy up the code for readibility
That last commit message was perfectly cromulent
@stisiTT stisiTT removed the request for review from mvanniasingheTT June 19, 2025 16:50
@stisiTT stisiTT added the enhancement New feature or request label Jun 19, 2025
…andling

- Remove unused variable `ttft` and replace it with a flag to track the first chunk received.
- Update token processing logic to handle cases where the text may be empty.
- Ensure success status is set based on whether a valid chunk was received.
- Adjust latency calculation to use the most recent timestamp.
@stisiTT stisiTT dismissed tstescoTT’s stale review June 26, 2025 14:27

Review is 2 months old and lots of issues here have been addressed in updated scope

# Conflicts:
#	workflows/run_reports.py
@stisiTT stisiTT changed the title Model Readiness Tests for Prolonged and High-Demand Use Model Readiness Spec Tests; for Prolonged and High-Demand Use Jul 8, 2025
stisiTT added 5 commits July 28, 2025 14:51
Updated `generate_cleaned_random_prompts_using_server` to support both server-side and client-side tokenization, improving efficiency by preloading the tokenizer.

Adjusted `generate_stable_prompt_tokens` to accept a preloaded tokenizer, ensuring consistent detokenization based on the selected method.
# Conflicts:
#	run.py
#	workflows/run_reports.py
#	workflows/run_workflows.py
…dated argument parsing to require model specification from JSON and adjusted related classes to accommodate this change. Removed deprecated get_model_id function and streamlined model configuration handling across various components.
@stisiTT stisiTT requested a review from Copilot August 13, 2025 19:10
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces the comprehensive Spec Tests workflow for validating model readiness and performance across different parameter configurations. The workflow provides both single-parameter testing and exhaustive multi-parameter matrix testing with performance validation against established targets.

  • Adds complete spec_tests workflow implementation with parameter space generation and benchmarking capabilities
  • Introduces new workflow types, venv configurations, and report generation for spec tests
  • Implements server-side tokenization utilities and cleaned prompt generation for stable testing
  • Temporarily disables most model specifications to focus testing on primary configurations

Reviewed Changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
workflows/workflow_venvs.py Adds venv setup for spec_tests workflow with required dependencies
workflows/workflow_types.py Defines SPEC_TESTS workflow type and associated venv configurations
workflows/workflow_config.py Registers spec_tests workflow configuration and integration
workflows/run_workflows.py Updates workflow orchestration to support spec_tests workflow type
workflows/run_reports.py Adds spec test report generation and markdown table formatting
workflows/model_spec.py Temporarily comments out most model specifications for focused testing
utils/prompt_configs.py Adds TODO comment about authorization bug in default model configuration
utils/prompt_client.py Implements server-side tokenization/detokenization API endpoints
utils/cleaned_prompt_generation.py New utility for generating stable prompt tokens with encoding/decoding cycles
tests/tests_config.py Comprehensive parameter space generation logic for test configurations
tests/test_workflows.py Updates workflow tests to include SPEC_TESTS workflow validation
tests/run_tests.py Main test runner with argument parsing and model validation
tests/init.py Exports Tests class for workflow integration
spec_tests/ Complete spec_tests implementation with config, benchmarking, and execution logic
run.py Adds CLI arguments for run mode, context length, and endurance testing

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

concurrency_values.append(half_concurrency)

# Add max concurrency for stress testing
# TEMPORARILY DISABLED: concurrency_values.append(self.max_concurrency_limit)
Copy link
Preview

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment about temporarily disabling max concurrency should include a reason why it's disabled and when it should be re-enabled.

Copilot uses AI. Check for mistakes.

# The actual prompt values will be determined during cross product generation
# based on the specific concurrency value in each combination
# For now, return placeholders that indicate the pattern
# TEMPORARILY DISABLED 5x concurrency: return [1, -1, -5] # -1 = match concurrency, -5 = 5x concurrency
Copy link
Preview

Copilot AI Aug 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, this temporary disabling comment should explain the reason for disabling 5x concurrency testing and the conditions for re-enabling it.

Suggested change
# TEMPORARILY DISABLED 5x concurrency: return [1, -1, -5] # -1 = match concurrency, -5 = 5x concurrency
# 5x concurrency testing is temporarily disabled due to known instability and excessive resource usage
# (see issue #1234). Re-enable by adding -5 back to the return list once the underlying issues are resolved
# and the system can reliably handle high concurrency stress tests.
# Previously: return [1, -1, -5] # -1 = match concurrency, -5 = 5x concurrency

Copilot uses AI. Check for mistakes.

…revent overflow and update workflow execution to include spec tests. Adjusted benchmark script path for consistency and updated test assertions to reflect the new workflow structure.
# Conflicts:
#	workflows/model_spec.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Model Spec tests
2 participants