1. Executive Decision
用户提出的硬性要求已经满足:每个 line::interface 都至少测了 100 次。这个要求已经不是推断,而是直接由聚合数据验证。
- CN 的主要问题是写和创作几乎全面失效,很多调用是“看起来执行了,但没有产生真实结果”。
- COM 的主要问题是 public-read 链路很慢、超时多、detail/read 类接口反复失真。
- Naturalness 仍然不够,新的 agent 拿到后并不能自然地首步做对、也不能稳定避免隐性知识依赖。
2. Coverage
| Coverage Metric | Value |
|---|---|
| Interface dimensions | 50 |
| Minimum attempts per dimension | 100 |
| Maximum attempts per dimension | 294 |
| Dimensions below 100 | 0 |
| Dimensions exactly 100 | 43 |
| Evidence Metric | Value |
|---|---|
| Interface run shards | 20 |
| Naturalness runs | 2 |
| Cases | 237 |
| Attempts | 5,217 |
| Attempts per case avg | 22.013 |
接口 100x 这条线已经完成,但 naturalness 仍然来自单独的 16 个 cold-start case。它不是 100x 套件的一部分,所以报告里单独标出,避免混口径。
The invalid standalone shard `full-com-video-r1` is intentionally excluded from the 100x aggregate because it lacked the setup context required for a meaningful `make_video` attempt.
3. Scoring Contract
这套评测不是看 shell 是否退出成功,而是看目标能力是否真的完成。只要返回不完整、需要的 artifact 没有落地、或者 CLI 把失败伪装成成功,都会被判成非成功。
- Effective success: 真正拿到可用结果,后续链路可继续。
- False success: 表面完成,但真实产物或真实效果不存在。
- Timeout: 达到超时门槛,哪怕之前有部分输出也不算成功。
- Naturalness: 新 agent 是否能自然选线、自然首步、自然理解接口语义。
false_success_count 是 reviewer 级别的质量标记,不等于终态 result_class=false_success 的数量。
一个调用可能最终被记成 auth_error 或 dependency_missing,但如果中间输出曾误导性地“看起来成功”,仍会被记入 false-success。
4. Aggregate Results
4.1 Overall
| Metric | Value |
|---|---|
| Cases | 237 |
| Attempts | 5,217 |
| Effective success count | 2627 |
| Effective success rate | 0.504 |
| Effective success case rate | 0.646 |
| False success count | 936 |
| Timeout count | 106 |
| Retry count total | 2 |
| Latency p50 ms | 1,384 |
| Latency p95 ms | 41,264 |
4.2 By Line
| Line | Attempts | Success Rate | False Success | Timeout | Retry | p95 ms |
|---|---|---|---|---|---|---|
cn | 2500 | 0.349 | 604 | 1 | 0 | 1,576 |
com | 2717 | 0.646 | 332 | 105 | 2 | 58,155 |
4.3 High-Level Read
cn的速度看起来好,是因为大量失败发生在非常早的阶段。com顶层成功更高,但延迟尾部明显失控,尤其 public-read。- 5217 次尝试之后,问题已经不是偶发波动,而是稳定的系统性缺陷。
4.4 By Category
| Scope | Class | Cases | Attempts | Success Rate | Case Success Rate | False Success | Timeout | Retry | p95 ms |
|---|---|---|---|---|---|---|---|---|---|
cn::public_read | conditional | 80 | 1400 | 0.624 | 0.825 | 0 | 1 | 0 | 1,657 |
cn::authenticated_write | fail | 31 | 700 | 0.000 | 0.000 | 404 | 0 | 0 | 1,358 |
cn::heavy_creative | fail | 16 | 400 | 0.000 | 0.000 | 200 | 0 | 0 | 1,445 |
com::public_read | fail | 69 | 1407 | 0.547 | 0.739 | 104 | 103 | 2 | 40,017 |
com::authenticated_write | conditional | 25 | 701 | 0.859 | 0.800 | 3 | 0 | 0 | 3,668 |
com::heavy_creative | fail | 16 | 609 | 0.627 | 1.000 | 225 | 2 | 0 | 79,543 |
com authenticated_write 是唯一接近可交付的切片,但仍未达到严格 gate;
cn authenticated_write、cn heavy_creative、com public_read、com heavy_creative 都不能作为放心上线面来对外承诺。
5. Interface Hotspots
5.1 Worst Success Rate
| Scope | Category | Attempts | Success Rate | False Success | Timeout | p95 ms |
|---|---|---|---|---|---|---|
cn::create_adventure_campaign | authenticated_write | 100 | 0.000 | 100 | 0 | 1,305 |
cn::list_my_adventure_campaigns | authenticated_write | 100 | 0.000 | 100 | 0 | 1,333 |
cn::list_my_characters | authenticated_write | 100 | 0.000 | 100 | 0 | 1,439 |
cn::list_my_elementum | authenticated_write | 100 | 0.000 | 100 | 0 | 1,399 |
cn::make_image | heavy_creative | 100 | 0.000 | 100 | 0 | 1,537 |
cn::make_song | heavy_creative | 100 | 0.000 | 100 | 0 | 1,424 |
com::request_interactive_feed | public_read | 100 | 0.000 | 50 | 0 | 3,668 |
cn::like_collection | authenticated_write | 100 | 0.000 | 4 | 0 | 0 |
com::read_collection | public_read | 103 | 0.000 | 0 | 103 | 40,025 |
cn::make_video | heavy_creative | 100 | 0.000 | 0 | 0 | 0 |
5.2 Highest False Success
| Scope | Category | Attempts | Success Rate | False Success | Timeout | p95 ms |
|---|---|---|---|---|---|---|
com::make_video | heavy_creative | 294 | 0.289 | 209 | 0 | 76,941 |
cn::create_adventure_campaign | authenticated_write | 100 | 0.000 | 100 | 0 | 1,305 |
cn::list_my_adventure_campaigns | authenticated_write | 100 | 0.000 | 100 | 0 | 1,333 |
cn::list_my_characters | authenticated_write | 100 | 0.000 | 100 | 0 | 1,439 |
cn::list_my_elementum | authenticated_write | 100 | 0.000 | 100 | 0 | 1,399 |
cn::make_image | heavy_creative | 100 | 0.000 | 100 | 0 | 1,537 |
cn::make_song | heavy_creative | 100 | 0.000 | 100 | 0 | 1,424 |
com::request_interactive_feed | public_read | 100 | 0.000 | 50 | 0 | 3,668 |
com::request_character_or_elementum | public_read | 100 | 0.500 | 50 | 0 | 3,732 |
com::remove_background | heavy_creative | 100 | 0.880 | 12 | 0 | 19,971 |
5.3 Highest Timeout
| Scope | Category | Attempts | Success Rate | False Success | Timeout | p95 ms |
|---|---|---|---|---|---|---|
com::read_collection | public_read | 103 | 0.000 | 0 | 103 | 40,025 |
com::make_song | heavy_creative | 109 | 0.972 | 1 | 2 | 110,379 |
cn::search_character_or_elementum | public_read | 100 | 0.990 | 0 | 1 | 1,588 |
cn::create_adventure_campaign | authenticated_write | 100 | 0.000 | 100 | 0 | 1,305 |
cn::like_collection | authenticated_write | 100 | 0.000 | 4 | 0 | 0 |
cn::list_my_adventure_campaigns | authenticated_write | 100 | 0.000 | 100 | 0 | 1,333 |
cn::list_my_characters | authenticated_write | 100 | 0.000 | 100 | 0 | 1,439 |
cn::list_my_elementum | authenticated_write | 100 | 0.000 | 100 | 0 | 1,399 |
cn::make_image | heavy_creative | 100 | 0.000 | 100 | 0 | 1,537 |
cn::make_song | heavy_creative | 100 | 0.000 | 100 | 0 | 1,424 |
Stable Examples
- CN stable subset: list_spaces, request_interactive_feed, suggest_categories, suggest_content, suggest_keywords, suggest_tags, validate_tax_path
- COM stable subset: create_adventure_campaign, list_my_adventure_campaigns, list_my_characters, list_my_elementum, request_adventure_campaign, list_spaces, search_character_or_elementum, suggest_categories, suggest_content, suggest_keywords, suggest_tags, validate_tax_path
Priority Repair Surfaces
cn::create_adventure_campaign,cn::list_my_*,cn::make_image,cn::make_songcom::read_collection,com::request_interactive_feed,com::request_character_or_elementumcom::make_videois the single largest false-success sink
5.4 Interface / Action / Data Map
这张表把接口、动作、输入依赖、成功判据、导出数据和下游依赖放到一张图里。读这张表时,不需要再去反推 case 文件。
| Interface | User Action | Inputs | Success Evidence | Exported Data | Downstream Use | CN Health | COM Health | Priority |
|---|---|---|---|---|---|---|---|---|
create_adventure_campaign | Create a new draft adventure campaign | `adventure_name` from line profile seed `mission_plot` from line profile seed `mission_rules` from line profile seed `mission_task` from line profile seed | statusuuid | `campaign_uuid` <= uuid | request_adventure_campaignupdate_adventure_campaign | critical critical | attempts 100 | success 0.000 | false 100 | timeout 0 | healthy healthy | attempts 101 | success 1.000 | false 0 | timeout 0 | critical |
get_hashtag_characters | Browse characters under a topic hashtag | `topic_hashtag` from `list_space_topics` | total | No exported variable; only response payload is validated | - | critical critical | attempts 100 | success 0.040 | false 0 | timeout 0 | critical critical | attempts 100 | success 0.040 | false 0 | timeout 0 | critical |
get_hashtag_collections | Browse collections under a space or topic hashtag | `main_hashtag` from `list_spaces` `topic_hashtag` from `list_space_topics` | total | No exported variable; only response payload is validated | - | critical critical | attempts 100 | success 0.080 | false 0 | timeout 0 | critical critical | attempts 100 | success 0.040 | false 4 | timeout 0 | critical |
get_hashtag_info | Read a space or hashtag detail card | `main_hashtag` from `list_spaces` | hashtag.name | No exported variable; only response payload is validated | - | critical critical | attempts 100 | success 0.040 | false 0 | timeout 0 | critical critical | attempts 100 | success 0.040 | false 0 | timeout 0 | critical |
like_collection | Like or unlike a collection | `collection_uuid` from `suggest_content` | success | No exported variable; only response payload is validated | - | critical critical | attempts 100 | success 0.000 | false 4 | timeout 0 | critical critical | attempts 100 | success 0.020 | false 2 | timeout 0 | critical |
list_my_adventure_campaigns | List current user's adventure campaigns | - | total | No exported variable; only response payload is validated | - | critical critical | attempts 100 | success 0.000 | false 100 | timeout 0 | healthy healthy | attempts 100 | success 1.000 | false 0 | timeout 0 | critical |
list_my_characters | List current user's created characters | - | total | No exported variable; only response payload is validated | - | critical critical | attempts 100 | success 0.000 | false 100 | timeout 0 | healthy healthy | attempts 100 | success 1.000 | false 0 | timeout 0 | critical |
list_my_elementum | List current user's created elementa | - | total | No exported variable; only response payload is validated | - | critical critical | attempts 100 | success 0.000 | false 100 | timeout 0 | healthy healthy | attempts 100 | success 1.000 | false 0 | timeout 0 | critical |
list_space_topics | Expand a space into sub-topics | `space_uuid` from `list_spaces` | topics.primary_topic.hashtag_name | `topic_hashtag` <= topics.primary_topic.hashtag_name, topics.topics[0].hashtag_name | get_hashtag_charactersget_hashtag_collections | critical critical | attempts 100 | success 0.040 | false 0 | timeout 0 | critical critical | attempts 100 | success 0.040 | false 0 | timeout 0 | critical |
list_spaces | Discover available spaces and world entries | - | spaces[0].namespaces[0].space_uuid | `main_hashtag` <= spaces[0].main_hashtag_name, spaces[0].name`space_uuid` <= spaces[0].space_uuid | get_hashtag_collectionsget_hashtag_infolist_space_topics | healthy healthy | attempts 100 | success 1.000 | false 0 | timeout 0 | healthy healthy | attempts 100 | success 1.000 | false 0 | timeout 0 | healthy |
make_image | Generate an image artifact | `image_prompt` from line profile seed | artifacts[0].uuidtask_uuid | `image_artifact_uuid` <= artifacts[0].uuid`image_url` <= artifacts[0].url | make_videoremove_background | critical critical | attempts 100 | success 0.000 | false 100 | timeout 0 | high high | attempts 106 | success 0.972 | false 3 | timeout 0 | critical |
make_song | Generate song and lyric artifacts | `song_lyrics` from line profile seed `song_prompt` from line profile seed | artifacts[0].audio_detail.lyric_urltask_uuid | No exported variable; only response payload is validated | - | critical critical | attempts 100 | success 0.000 | false 100 | timeout 0 | high high | attempts 109 | success 0.972 | false 1 | timeout 2 | critical |
make_video | Generate video from an image source | `image_url` from `make_image` `video_prompt` from line profile seed | artifacts[0].detail_urltask_uuid | No exported variable; only response payload is validated | - | critical critical | attempts 100 | success 0.000 | false 0 | timeout 0 | critical critical | attempts 294 | success 0.289 | false 209 | timeout 0 | critical |
read_collection | Open a concrete collection detail page | `collection_uuid` from `suggest_content` | collection.uuid | No exported variable; only response payload is validated | - | critical critical | attempts 100 | success 0.020 | false 0 | timeout 0 | critical critical | attempts 103 | success 0.000 | false 0 | timeout 103 | critical |
remove_background | Remove image background from a generated image | `image_artifact_uuid` from `make_image` | artifacts[0].uuidtask_uuid | No exported variable; only response payload is validated | - | critical critical | attempts 100 | success 0.000 | false 0 | timeout 0 | high high | attempts 100 | success 0.880 | false 12 | timeout 0 | critical |
request_adventure_campaign | Read back one adventure campaign | `campaign_uuid` from `create_adventure_campaign` | nameuuid | No exported variable; only response payload is validated | - | critical critical | attempts 100 | success 0.000 | false 0 | timeout 0 | healthy healthy | attempts 100 | success 1.000 | false 0 | timeout 0 | critical |
request_character_or_elementum | Read character or elementum detail | `character_uuid_top` from `search_character_or_elementum` `keyword_element` from line profile seed | detail.namedetail.uuid | No exported variable; only response payload is validated | - | high high | attempts 100 | success 0.530 | false 0 | timeout 0 | critical critical | attempts 100 | success 0.500 | false 50 | timeout 0 | critical |
request_interactive_feed | Scroll interactive feed pages with session trace | `biz_trace_id` from `request_interactive_feed` | module_list[0].json_data.uuid | `biz_trace_id` <= page_data.biz_trace_id | request_interactive_feed | healthy healthy | attempts 100 | success 1.000 | false 0 | timeout 0 | critical critical | attempts 100 | success 0.000 | false 50 | timeout 0 | critical |
search_character_or_elementum | Search characters or elementum by keyword | `keyword_char` from line profile seed `keyword_element` from line profile seed | list[0].uuidtotal | `character_uuid_top` <= list[0].uuid | request_character_or_elementum | high high | attempts 100 | success 0.990 | false 0 | timeout 1 | healthy healthy | attempts 101 | success 1.000 | false 0 | timeout 0 | high |
suggest_categories | Navigate the 3-level taxonomy tree | `primary_category` from `suggest_categories` `primary_category>$secondary_category` from upstream runtime context | suggestions[0].name | `primary_category` <= suggestions[0].name`secondary_category` <= suggestions[0].name`tertiary_category` <= suggestions[0].name | suggest_categories | healthy healthy | attempts 100 | success 1.000 | false 0 | timeout 0 | healthy healthy | attempts 100 | success 1.000 | false 0 | timeout 0 | healthy |
suggest_content | Fetch recommend/search/exact content feed | `keyword_tag` from line profile seed `primary_category>$secondary_category>$tertiary_category` from upstream runtime context | module_list[0].json_data.uuidpage_data.has_next_page | `collection_uuid` <= module_list[0].json_data.uuid | like_collectionread_collection | healthy healthy | attempts 100 | success 1.000 | false 0 | timeout 0 | healthy healthy | attempts 103 | success 1.000 | false 0 | timeout 0 | healthy |
suggest_keywords | Get keyword suggestions from a prefix | `keyword_prefix` from line profile seed | suggestions[0].text | No exported variable; only response payload is validated | - | healthy healthy | attempts 100 | success 1.000 | false 0 | timeout 0 | healthy healthy | attempts 100 | success 1.000 | false 0 | timeout 0 | healthy |
suggest_tags | Get related tags from a keyword | `keyword_tag` from line profile seed | suggestions[0].name | No exported variable; only response payload is validated | - | healthy healthy | attempts 100 | success 1.000 | false 0 | timeout 0 | healthy healthy | attempts 100 | success 1.000 | false 0 | timeout 0 | healthy |
update_adventure_campaign | Update one field on an existing campaign | `campaign_uuid` from `create_adventure_campaign` | subtitleuuid | No exported variable; only response payload is validated | - | critical critical | attempts 100 | success 0.000 | false 0 | timeout 0 | healthy healthy | attempts 100 | success 0.990 | false 1 | timeout 0 | critical |
validate_tax_path | Validate a candidate taxonomy path before use | `primary_category>$secondary_category>$tertiary_category` from upstream runtime context | valid | No exported variable; only response payload is validated | - | healthy healthy | attempts 100 | success 0.990 | false 0 | timeout 0 | healthy healthy | attempts 100 | success 1.000 | false 0 | timeout 0 | healthy |
5.5 Priority Problem Interfaces
这是最适合直接拿来排修复优先级的列表。每一项都明确指出失败主因,以及为什么它会影响后续动作链。
| Scope | Level | User Action | Success Rate | False Success | Timeout | Dominant Failure | Why It Matters | Suggested Fix |
|---|---|---|---|---|---|---|---|---|
cn::create_adventure_campaign | critical | Create a new draft adventure campaign | 0.000 | 100 | 0 | auth_error | blocks request_adventure_campaignblocks update_adventure_campaign | Verify token scope, line routing, and auth preconditions before treating the call as usable. |
cn::list_my_adventure_campaigns | critical | List current user's adventure campaigns | 0.000 | 100 | 0 | auth_error | direct user-facing endpoint | Verify token scope, line routing, and auth preconditions before treating the call as usable. |
cn::list_my_characters | critical | List current user's created characters | 0.000 | 100 | 0 | auth_error | direct user-facing endpoint | Verify token scope, line routing, and auth preconditions before treating the call as usable. |
cn::list_my_elementum | critical | List current user's created elementa | 0.000 | 100 | 0 | auth_error | direct user-facing endpoint | Verify token scope, line routing, and auth preconditions before treating the call as usable. |
cn::make_image | critical | Generate an image artifact | 0.000 | 100 | 0 | auth_error | blocks make_videoblocks remove_background | Verify token scope, line routing, and auth preconditions before treating the call as usable. |
cn::make_song | critical | Generate song and lyric artifacts | 0.000 | 100 | 0 | auth_error | direct user-facing endpoint | Verify token scope, line routing, and auth preconditions before treating the call as usable. |
com::request_interactive_feed | critical | Scroll interactive feed pages with session trace | 0.000 | 50 | 0 | dependency_missing | blocks request_interactive_feed | Validate upstream exports and fail fast when prerequisite IDs or artifacts are absent. |
cn::like_collection | critical | Like or unlike a collection | 0.000 | 4 | 0 | dependency_missing | direct user-facing endpoint | Validate upstream exports and fail fast when prerequisite IDs or artifacts are absent. |
com::read_collection | critical | Open a concrete collection detail page | 0.000 | 0 | 103 | timeout | direct user-facing endpoint | Add timeout recovery and pagination/session handling, especially for read/detail flows. |
cn::make_video | critical | Generate video from an image source | 0.000 | 0 | 0 | dependency_missing | direct user-facing endpoint | Validate upstream exports and fail fast when prerequisite IDs or artifacts are absent. |
cn::remove_background | critical | Remove image background from a generated image | 0.000 | 0 | 0 | dependency_missing | direct user-facing endpoint | Validate upstream exports and fail fast when prerequisite IDs or artifacts are absent. |
cn::request_adventure_campaign | critical | Read back one adventure campaign | 0.000 | 0 | 0 | dependency_missing | direct user-facing endpoint | Validate upstream exports and fail fast when prerequisite IDs or artifacts are absent. |
6. Naturalness
Naturalness 不是接口是否存在,而是新 agent 是否会“天然理解并正确使用”。这正是用户特别强调的交付要求之一。
| Metric | Value |
|---|---|
| Runs | 2 |
| Cases | 16 |
| Cold-start success count | 9 |
| Cold-start success rate | 0.562 |
| Manual hint rate | 0.000 |
| Wrong first command rate | 0.750 |
| Hidden knowledge dependency rate | 0.312 |
| Line | Cases | Cold-start Success | Manual Hint | Wrong First | Hidden Knowledge |
|---|---|---|---|---|---|
cn | 8 | 0.500 | 0.000 | 0.625 | 0.125 |
com | 8 | 0.625 | 0.000 | 0.875 | 0.500 |
- 最核心的问题不是“agent 需要人提醒”,因为 manual-hint rate 是
0。 - 真正的问题是 agent 在冷启动时经常第一步就走错,且需要知道一些文档里没自然暴露出来的隐藏知识。
com的 naturalness 更差,wrong-first-command 达到0.875。
7. Gate Result
最终 gate 结果为 FAIL,通过了 17/29 项。
| Check | Actual | Comparator | Threshold | Pass |
|---|---|---|---|---|
overall.false_success_rate | 0.1794134560092007 | <= | 0.0 | fail |
cn.authenticated_write.timeout_rate | 0.0 | <= | 0.02 | pass |
cn.authenticated_write.p95_latency_ms | 1358.0 | <= | 10000.0 | pass |
cn.authenticated_write.retry_budget_exceeded_rate | 0.0 | <= | 0.05 | pass |
cn.authenticated_write.false_success_rate | 0.5771428571428572 | <= | 0.0 | fail |
cn.heavy_creative.timeout_rate | 0.0 | <= | 0.05 | pass |
cn.heavy_creative.p95_latency_ms | 1445.0 | <= | 90000.0 | pass |
cn.heavy_creative.retry_budget_exceeded_rate | 0.0 | <= | 0.05 | pass |
cn.heavy_creative.false_success_rate | 0.5 | <= | 0.0 | fail |
cn.public_read.timeout_rate | 0.0007142857142857143 | <= | 0.0 | fail |
cn.public_read.p95_latency_ms | 1657.0 | <= | 5000.0 | pass |
cn.public_read.retry_budget_exceeded_rate | 0.0 | <= | 0.05 | pass |
cn.public_read.false_success_rate | 0.0 | <= | 0.0 | pass |
com.authenticated_write.timeout_rate | 0.0 | <= | 0.02 | pass |
com.authenticated_write.p95_latency_ms | 3668.0 | <= | 10000.0 | pass |
com.authenticated_write.retry_budget_exceeded_rate | 0.0 | <= | 0.05 | pass |
com.authenticated_write.false_success_rate | 0.0042796005706134095 | <= | 0.0 | fail |
com.heavy_creative.timeout_rate | 0.003284072249589491 | <= | 0.05 | pass |
com.heavy_creative.p95_latency_ms | 79543.0 | <= | 90000.0 | pass |
com.heavy_creative.retry_budget_exceeded_rate | 0.0 | <= | 0.05 | pass |
com.heavy_creative.false_success_rate | 0.3694581280788177 | <= | 0.0 | fail |
com.public_read.timeout_rate | 0.07320540156361052 | <= | 0.0 | fail |
com.public_read.p95_latency_ms | 40017.0 | <= | 5000.0 | fail |
com.public_read.retry_budget_exceeded_rate | 0.0 | <= | 0.05 | pass |
com.public_read.false_success_rate | 0.07391613361762615 | <= | 0.0 | fail |
naturalness.cold_start_success_rate | 0.5625 | >= | 0.8 | fail |
naturalness.manual_hint_rate | 0.0 | <= | 0.2 | pass |
naturalness.wrong_first_command_rate | 0.75 | <= | 0.15 | fail |
naturalness.hidden_knowledge_dependency_rate | 0.3125 | <= | 0.1 | fail |
Failing Checks
overall.false_success_rateactual 0.1794134560092007 violates<= 0.0cn.authenticated_write.false_success_rateactual 0.5771428571428572 violates<= 0.0cn.heavy_creative.false_success_rateactual 0.5 violates<= 0.0cn.public_read.timeout_rateactual 0.0007142857142857143 violates<= 0.0com.authenticated_write.false_success_rateactual 0.0042796005706134095 violates<= 0.0com.heavy_creative.false_success_rateactual 0.3694581280788177 violates<= 0.0com.public_read.timeout_rateactual 0.07320540156361052 violates<= 0.0com.public_read.p95_latency_msactual 40017.0 violates<= 5000.0com.public_read.false_success_rateactual 0.07391613361762615 violates<= 0.0naturalness.cold_start_success_rateactual 0.5625 violates>= 0.8naturalness.wrong_first_command_rateactual 0.75 violates<= 0.15naturalness.hidden_knowledge_dependency_rateactual 0.3125 violates<= 0.1
8. Full 100x Interface Table
| Line | Interface | Category | Attempts | Success Rate | False Success | Timeout | Retry | p95 ms |
|---|---|---|---|---|---|---|---|---|
cn | create_adventure_campaign | authenticated_write | 100 | 0.000 | 100 | 0 | 0 | 1,305 |
cn | like_collection | authenticated_write | 100 | 0.000 | 4 | 0 | 0 | 0 |
cn | list_my_adventure_campaigns | authenticated_write | 100 | 0.000 | 100 | 0 | 0 | 1,333 |
cn | list_my_characters | authenticated_write | 100 | 0.000 | 100 | 0 | 0 | 1,439 |
cn | list_my_elementum | authenticated_write | 100 | 0.000 | 100 | 0 | 0 | 1,399 |
cn | request_adventure_campaign | authenticated_write | 100 | 0.000 | 0 | 0 | 0 | 0 |
cn | update_adventure_campaign | authenticated_write | 100 | 0.000 | 0 | 0 | 0 | 0 |
cn | make_image | heavy_creative | 100 | 0.000 | 100 | 0 | 0 | 1,537 |
cn | make_song | heavy_creative | 100 | 0.000 | 100 | 0 | 0 | 1,424 |
cn | make_video | heavy_creative | 100 | 0.000 | 0 | 0 | 0 | 0 |
cn | remove_background | heavy_creative | 100 | 0.000 | 0 | 0 | 0 | 0 |
cn | get_hashtag_characters | public_read | 100 | 0.040 | 0 | 0 | 0 | 0 |
cn | get_hashtag_collections | public_read | 100 | 0.080 | 0 | 0 | 0 | 1,404 |
cn | get_hashtag_info | public_read | 100 | 0.040 | 0 | 0 | 0 | 0 |
cn | list_space_topics | public_read | 100 | 0.040 | 0 | 0 | 0 | 0 |
cn | list_spaces | public_read | 100 | 1.000 | 0 | 0 | 0 | 1,855 |
cn | read_collection | public_read | 100 | 0.020 | 0 | 0 | 0 | 0 |
cn | request_character_or_elementum | public_read | 100 | 0.530 | 0 | 0 | 0 | 1,572 |
cn | request_interactive_feed | public_read | 100 | 1.000 | 0 | 0 | 0 | 2,099 |
cn | search_character_or_elementum | public_read | 100 | 0.990 | 0 | 1 | 0 | 1,588 |
cn | suggest_categories | public_read | 100 | 1.000 | 0 | 0 | 0 | 1,474 |
cn | suggest_content | public_read | 100 | 1.000 | 0 | 0 | 0 | 1,922 |
cn | suggest_keywords | public_read | 100 | 1.000 | 0 | 0 | 0 | 1,505 |
cn | suggest_tags | public_read | 100 | 1.000 | 0 | 0 | 0 | 1,478 |
cn | validate_tax_path | public_read | 100 | 0.990 | 0 | 0 | 0 | 1,500 |
com | create_adventure_campaign | authenticated_write | 101 | 1.000 | 0 | 0 | 0 | 3,903 |
com | like_collection | authenticated_write | 100 | 0.020 | 2 | 0 | 0 | 0 |
com | list_my_adventure_campaigns | authenticated_write | 100 | 1.000 | 0 | 0 | 0 | 2,999 |
com | list_my_characters | authenticated_write | 100 | 1.000 | 0 | 0 | 0 | 3,286 |
com | list_my_elementum | authenticated_write | 100 | 1.000 | 0 | 0 | 0 | 3,150 |
com | request_adventure_campaign | authenticated_write | 100 | 1.000 | 0 | 0 | 0 | 3,500 |
com | update_adventure_campaign | authenticated_write | 100 | 0.990 | 1 | 0 | 0 | 3,798 |
com | make_image | heavy_creative | 106 | 0.972 | 3 | 0 | 0 | 65,034 |
com | make_song | heavy_creative | 109 | 0.972 | 1 | 2 | 0 | 110,379 |
com | make_video | heavy_creative | 294 | 0.289 | 209 | 0 | 0 | 76,941 |
com | remove_background | heavy_creative | 100 | 0.880 | 12 | 0 | 0 | 19,971 |
com | get_hashtag_characters | public_read | 100 | 0.040 | 0 | 0 | 0 | 0 |
com | get_hashtag_collections | public_read | 100 | 0.040 | 4 | 0 | 0 | 2,903 |
com | get_hashtag_info | public_read | 100 | 0.040 | 0 | 0 | 0 | 0 |
com | list_space_topics | public_read | 100 | 0.040 | 0 | 0 | 0 | 0 |
com | list_spaces | public_read | 100 | 1.000 | 0 | 0 | 0 | 4,021 |
com | read_collection | public_read | 103 | 0.000 | 0 | 103 | 2 | 40,025 |
com | request_character_or_elementum | public_read | 100 | 0.500 | 50 | 0 | 0 | 3,732 |
com | request_interactive_feed | public_read | 100 | 0.000 | 50 | 0 | 0 | 3,668 |
com | search_character_or_elementum | public_read | 101 | 1.000 | 0 | 0 | 0 | 3,770 |
com | suggest_categories | public_read | 100 | 1.000 | 0 | 0 | 0 | 4,101 |
com | suggest_content | public_read | 103 | 1.000 | 0 | 0 | 0 | 3,129 |
com | suggest_keywords | public_read | 100 | 1.000 | 0 | 0 | 0 | 3,256 |
com | suggest_tags | public_read | 100 | 1.000 | 0 | 0 | 0 | 2,965 |
com | validate_tax_path | public_read | 100 | 1.000 | 0 | 0 | 0 | 3,620 |
9. Skill To API Mapping
这部分补充的是 agent 真正能看到和调用的 skill 层接口,也就是这次评测实际覆盖到的 command surface。它可以清楚回答“每个接口属于哪个 skill、典型调用长什么样、文档证据在哪”。
skill -> CLI/API 映射,不是底层后端 HTTP path 清单。
如果后续需要把原始 REST / gRPC 路由也写进报告,需要继续拆 @talesofai/neta-skills 包本体,而不是只看 skill 文档。
| Interface | Primary Skill | Supporting Skills | Typical Command Surface | Source Doc |
|---|---|---|---|---|
list_spaces | neta-space | neta-community | neta-cli list_spaces | /Users/atou/Library/Mobile Documents/com~apple~CloudDocs/Neta/skills/neta-space/SKILL.md |
get_hashtag_info | neta-space | neta-community | neta-cli get_hashtag_info --hashtag "space_tag_name" | /Users/atou/Library/Mobile Documents/com~apple~CloudDocs/Neta/skills/neta-space/SKILL.md |
list_space_topics | neta-space | - | neta-cli list_space_topics --space_uuid "space UUID" | /Users/atou/Library/Mobile Documents/com~apple~CloudDocs/Neta/skills/neta-space/SKILL.md |
get_hashtag_characters | neta-space | neta-community | neta-cli get_hashtag_characters --hashtag "tag_name" --sort_by "hot" | /Users/atou/Library/Mobile Documents/com~apple~CloudDocs/Neta/skills/neta-space/SKILL.md |
get_hashtag_collections | neta-space | neta-community | neta-cli get_hashtag_collections --hashtag "tag_name" | /Users/atou/Library/Mobile Documents/com~apple~CloudDocs/Neta/skills/neta-space/SKILL.md |
read_collection | neta-community | neta-space, neta-creative | neta-cli read_collection --uuid "collection-uuid" | /Users/atou/Library/Mobile Documents/com~apple~CloudDocs/Neta/skills/neta-community/SKILL.md |
request_interactive_feed | neta-community | - | neta-cli request_interactive_feed --page_index 0 --page_size 10 | /Users/atou/Library/Mobile Documents/com~apple~CloudDocs/Neta/skills/neta-community/references/interactive-feed.md |
like_collection | neta-community | - | neta-cli like_collection --uuid "target collection UUID" | /Users/atou/Library/Mobile Documents/com~apple~CloudDocs/Neta/skills/neta-community/SKILL.md |
search_character_or_elementum | neta-community | neta-creative, neta-character, neta-elementum | neta-cli search_character_or_elementum --keywords "keywords" --parent_type "character" --sort_scheme "exact" | /Users/atou/Library/Mobile Documents/com~apple~CloudDocs/Neta/skills/neta-community/SKILL.md |
request_character_or_elementum | neta-community | neta-creative, neta-character, neta-elementum | neta-cli request_character_or_elementum --name "character_name" | /Users/atou/Library/Mobile Documents/com~apple~CloudDocs/Neta/skills/neta-community/SKILL.md |
suggest_keywords | neta-suggest | - | neta-cli suggest_keywords --prefix "game" --size 20 | /Users/atou/Library/Mobile Documents/com~apple~CloudDocs/Neta/skills/neta-suggest/SKILL.md |
suggest_tags | neta-suggest | - | neta-cli suggest_tags --keyword "character design" --size 15 | /Users/atou/Library/Mobile Documents/com~apple~CloudDocs/Neta/skills/neta-suggest/SKILL.md |
suggest_categories | neta-suggest | - | neta-cli suggest_categories --level 2 --parent_path "Derivative Creation" | /Users/atou/Library/Mobile Documents/com~apple~CloudDocs/Neta/skills/neta-suggest/SKILL.md |
validate_tax_path | neta-suggest | - | neta-cli validate_tax_path --tax_path "Derivative Creation>Fan Works>Honkai: Star Rail" | /Users/atou/Library/Mobile Documents/com~apple~CloudDocs/Neta/skills/neta-suggest/SKILL.md |
suggest_content | neta-suggest | - | neta-cli suggest_content --intent search --search_keywords "character,creativity" --page_size 20 | /Users/atou/Library/Mobile Documents/com~apple~CloudDocs/Neta/skills/neta-suggest/SKILL.md |
make_image | neta-creative | neta-character, neta-elementum | neta-cli make_image --prompt "@character_name, /elementum_name, ref_img-uuid, description1, description2" --aspect "3:4" | /Users/atou/Library/Mobile Documents/com~apple~CloudDocs/Neta/skills/neta-creative/SKILL.md |
make_video | neta-creative | - | neta-cli make_video --image_source "image URL" --prompt "action description" --model "model_s" | /Users/atou/Library/Mobile Documents/com~apple~CloudDocs/Neta/skills/neta-creative/SKILL.md |
make_song | neta-creative | - | neta-cli make_song --prompt "style description" --lyrics "lyrics content" | /Users/atou/Library/Mobile Documents/com~apple~CloudDocs/Neta/skills/neta-creative/SKILL.md |
remove_background | neta-creative | - | neta-cli remove_background --input_image "image_url" | /Users/atou/Library/Mobile Documents/com~apple~CloudDocs/Neta/skills/neta-creative/SKILL.md |
create_adventure_campaign | neta-adventure | - | npx -y @talesofai/neta-skills create_adventure_campaign --name "汴京最后三天" --mission_plot "..." | /Users/atou/Library/Mobile Documents/com~apple~CloudDocs/Neta/skills/neta-adventure/SKILL.md |
update_adventure_campaign | neta-adventure | - | npx -y @talesofai/neta-skills update_adventure_campaign --campaign_uuid "campaign-uuid-here" --mission_plot_attention "..." | /Users/atou/Library/Mobile Documents/com~apple~CloudDocs/Neta/skills/neta-adventure/SKILL.md |
list_my_adventure_campaigns | neta-adventure | - | npx -y @talesofai/neta-skills list_my_adventure_campaigns --page_index 0 --page_size 10 | /Users/atou/Library/Mobile Documents/com~apple~CloudDocs/Neta/skills/neta-adventure/SKILL.md |
request_adventure_campaign | neta-adventure | - | npx -y @talesofai/neta-skills request_adventure_campaign --campaign_uuid "campaign-uuid-here" | /Users/atou/Library/Mobile Documents/com~apple~CloudDocs/Neta/skills/neta-adventure/SKILL.md |
list_my_characters | neta-character | - | neta-cli list_my_characters --keyword "Ada" --page_size 10 | /Users/atou/Library/Mobile Documents/com~apple~CloudDocs/Neta/skills/neta-character/SKILL.md |
list_my_elementum | neta-elementum | - | neta-cli list_my_elementum --keyword "village" --page_size 10 | /Users/atou/Library/Mobile Documents/com~apple~CloudDocs/Neta/skills/neta-elementum/SKILL.md |
10. Artifacts
最终交付以这些文件为准:
/Users/atou/agents-in-discord/workspaces/1484560502469165306/evals/neta-skill-services/reports/20260321-comprehensive-eval-report-100x.md /Users/atou/agents-in-discord/workspaces/1484560502469165306/evals/neta-skill-services/reports/20260321-comprehensive-eval-report-100x.html /Users/atou/agents-in-discord/workspaces/1484560502469165306/evals/neta-skill-services/reports/20260321-aggregate-interface-100x.json /Users/atou/agents-in-discord/workspaces/1484560502469165306/evals/neta-skill-services/reports/20260321-aggregate-gate-check-100x.json /Users/atou/agents-in-discord/workspaces/1484560502469165306/evals/neta-skill-services/reports/20260321-aggregate-naturalness.json