πŸš€ Starting Agent Loop Tool Efficiency Test

#4
by bukit - opened

TLDR: UD-Q5_K_XL (reasoning on) for the win.

πŸ“Š Configuration:
Base URL: http://localhost:9099/v1
Model: gemma-4-E4B-it-UD-Q8_K_XL_1_65536_off
Test Cases: 17
Output: results/agent_test_results_gemma-4-E4B-it-UD-Q8_K_XL_1_65536_off_20260403_200236.json
Log File: logs/agent_test_logs_gemma-4-E4B-it-UD-Q8_K_XL_1_65536_off_20260403_200236.log

πŸ”„ Running agent tests...
Starting agent test suite with 17 test cases
Running agent test: zero_capabilities
Running agent test: zero_thank_you
Running agent test: zero_greeting
Running agent test: simple_view_cart
Running agent test: medium_search_category_and_add
Running agent test: simple_remove_product
Running agent test: simple_checkout
Running agent test: medium_search_and_add
Running agent test: zero_general_question
Running agent test: simple_add_iphone
Running agent test: medium_view_and_add
Running agent test: medium_remove_and_add
Running agent test: complex_cart_management
Running agent test: complex_shopping_workflow
Running agent test: complex_gift_shopping
Running agent test: zero_weather_question
Running agent test: simple_search_electronics
βœ… Tests completed in 23.5946444s

πŸ“ˆ Agent Test Results

Total Tests: 17
βœ… Passed: 15
❌ Failed: 2
⏱️ Total LLM Time: 3m58.8556569s
⏱️ Average Time per Request: 5.825747729s

πŸ“‹ Test Case Results:

Test Case: zero_greeting
Status: βœ… PASSED
Matched Path: no_tools
Response Time: 3.3519852s
Tool Calls: 0

Test Case: zero_weather_question
Status: βœ… PASSED
Matched Path: no_tools
Response Time: 4.507745s
Tool Calls: 0

Test Case: zero_general_question
Status: βœ… PASSED
Matched Path: no_tools
Response Time: 5.3664817s
Tool Calls: 0

Test Case: zero_thank_you
Status: βœ… PASSED
Matched Path: no_tools
Response Time: 6.1175078s
Tool Calls: 0

Test Case: simple_remove_product
Status: βœ… PASSED
Matched Path: direct_remove
Response Time: 9.0714225s
Tool Calls: 1
Tools Used: remove_from_cart

Test Case: simple_add_iphone
Status: βœ… PASSED
Matched Path: direct_add
Response Time: 10.262271s
Tool Calls: 1
Tools Used: add_to_cart

Test Case: zero_capabilities
Status: βœ… PASSED
Matched Path: no_tools
Response Time: 10.3196569s
Tool Calls: 0

Test Case: simple_view_cart
Status: βœ… PASSED
Matched Path: view_cart
Response Time: 13.1525835s
Tool Calls: 1
Tools Used: view_cart

Test Case: simple_checkout
Status: βœ… PASSED
Matched Path: direct_checkout
Response Time: 13.786616s
Tool Calls: 1
Tools Used: checkout

Test Case: medium_search_and_add
Status: βœ… PASSED
Matched Path: search_by_query
Response Time: 16.9061233s
Tool Calls: 2
Tools Used: search_products, add_to_cart

Test Case: simple_search_electronics
Status: βœ… PASSED
Matched Path: search_by_category
Response Time: 16.9057788s
Tool Calls: 1
Tools Used: search_products

Test Case: medium_view_and_add
Status: βœ… PASSED
Matched Path: view_then_add
Response Time: 17.8242001s
Tool Calls: 2
Tools Used: view_cart, add_to_cart

Test Case: medium_search_category_and_add
Status: βœ… PASSED
Matched Path: search_then_add
Response Time: 20.9945804s
Tool Calls: 2
Tools Used: search_products, add_to_cart

Test Case: complex_shopping_workflow
Status: ❌ FAILED
Response Time: 22.1721982s
Tool Calls: 5
Tools Used: search_products, add_to_cart, add_to_cart, view_cart, checkout

Test Case: complex_gift_shopping
Status: ❌ FAILED
Response Time: 22.2444344s
Tool Calls: 5
Tools Used: search_products, add_to_cart, search_products, add_to_cart, view_cart

Test Case: medium_remove_and_add
Status: βœ… PASSED
Matched Path: remove_then_add
Response Time: 22.5384086s
Tool Calls: 2
Tools Used: remove_from_cart, add_to_cart

Test Case: complex_cart_management
Status: βœ… PASSED
Matched Path: cart_organization
Response Time: 23.5923777s
Tool Calls: 3
Tools Used: view_cart, remove_from_cart, add_to_cart

❌ Failed Tests Details:

Test Case: complex_shopping_workflow
Expected Tool Variants: 4
Variant 1 (full_workflow_with_iphone): 4 tools
Variant 2 (full_workflow_with_headphones): 4 tools
Variant 3 (full_workflow_with_headphones_and_iphone): 5 tools
Variant 4 (full_workflow_with_iphone_and_headphones): 5 tools
Actual Tool Calls: 5
1. search_products
2. add_to_cart
3. add_to_cart
4. view_cart
5. checkout
Response Time: 22.1721982s

Test Case: complex_gift_shopping
Expected Tool Variants: 2
Variant 1 (gift_shopping_workflow): 5 tools
Variant 2 (gift_shopping_workflow): 5 tools
Actual Tool Calls: 5
1. search_products
2. add_to_cart
3. search_products
4. add_to_cart
5. view_cart
Response Time: 22.2444344s

πŸ“Š Overall Success Rate: 88.24% // https://github.com/docker/model-test/

UD-Q8_K_XL // 45 tokens/s on RTX 3060 12GB
Response Time: 22.2444344s (reasoning off)

UD-Q5_K_XL // 60 tokens/s on RTX 3060 12GB
Response Time: 15.9990973s (reasoning off)
Response Time: 58.9311279s (reasoning on)

llama-server --port 9099 -ngl 99 -fa on -c 65536 --temp 1 --top-k 64 --top-p 0.95 --jinja -m X:\path\to\gemma-4-E4B-it-UD-Q8_K_XL.gguf --mmproj X:\path\to\mmproj-gemma-4-E4B-it-UD-F16.gguf --reasoning off

UD-Q8_K_XL Kinda slow small model, did great with function calling, bad in "guessing" when given image input, it biased to safe answer, not "brave"/"confidence" enough. (reasoning off).

UD-Q5_K_XL Balanced, fast t/s, support context window up to 128k (32-64 RAM) and great FC, analyzing image still sucks (reasoning off). Reasoning ON recommended, consistent t/s, didn't overthink, better image analyzing.

Sign up or log in to comment