Welcome to Skill Forge
A forge for AI agent skills. Build, lint, improve, evaluate, and ship production-ready skills that follow the open agentskills.io spec.
What is a skill?
A folder containing a SKILL.md file with YAML frontmatter — a structured set of instructions an AI agent reads on demand. Think of it as a compressed expert brain you can hand to any agent. The agent only loads the skill when its description matches what the user is asking for.
Required frontmatter (per agentskills.io):
name | 1–64 chars, lowercase a–z + digits + single hyphens. Must match the parent directory name. Pattern: ^[a-z][a-z0-9]*(-[a-z0-9]+)*$ |
description | 1–1024 chars. Must describe what the skill does and when to invoke it. Include trigger keywords / example user phrases — the agent uses this to route requests. |
Optional frontmatter:
license | Short SPDX identifier or filename. Still standard. Example: license: MIT or license: Proprietary. LICENSE.txt has complete terms. |
compatibility | Up to 500 chars describing intended product / required packages / network requirements. |
metadata | Object of arbitrary string key/value pairs. Recommended home for version, author, tags, etc. |
allowed-tools | Space-separated list of pre-approved tools (experimental). Example: allowed-tools: Bash(npm:*) Read |
Directory structure (everything beyond SKILL.md is optional):
skill-name/
├── SKILL.md required — entry point loaded at activation
├── scripts/ optional — executables the skill can run
│ └── *.py *.sh *.ts
├── references/ optional — on-demand docs the body links to
│ └── *.md *.json
├── assets/ optional — templates, sample data, images
│ └── *.docx *.pptx *.json *.png ...
└── evals/ optional — A/B test cases (see agentskills.io/evaluating-skills)
└── evals.json
Example minimal SKILL.md frontmatter:
--- name: csv-cleanup description: Clean a CSV file deterministically. Use when the user says "clean this CSV", "fix the data", or attaches a .csv that needs validation. Detects encoding, normalizes headers to snake_case, type-coerces, dedupes, emits a profile report. license: MIT metadata: version: "1.2.0" author: community tags: [data, etl, csv] ---
Progressive disclosure & context budget
An agent's context window is finite. Every token in SKILL.md is paid every time the skill activates, before the agent does any work. Heavy material — long examples, schemas, edge cases — belongs in references/, which loads only when the workflow explicitly cites it.
Think of it like code splitting for prompts: keep the entry point tight, defer the rest. The spec recommends ≤ 5000 tokens / ≤ 500 lines in the body. The Metrics tab visualizes the activation vs on-demand cost as a waterfall.
How to read the rubric
Each skill is scored on 8 evidence-based axes (0–100), weighted by importance. ≥90 = production-ready. Sourced from the agentskills.io spec + Anthropic's Building Skills guide + patterns observed in high-performing skills.
| Trigger Clarity · 20% | Description names when to invoke the skill, with example user phrases. |
| Knowledge Delta · 18% | Adds expert knowledge the model doesn't already have — not generic advice. |
| Workflow Concreteness · 15% | Actionable verb-object steps, not abstract "think about X" prose. |
| Output Specification · 12% | Names the artifact format the skill produces (table, JSON, doc layout). |
| Progressive Disclosure · 12% | SKILL.md body is short; heavy detail lives in references/. |
| Reasoned Constraints · 10% | "Must" comes with "because". No blanket ALL-CAPS imperatives. |
| Self-Containment · 8% | Referenced scripts and files actually exist in the bundle. |
| Token Efficiency · 5% | Body within the spec's 5000-token / 500-line cap. |
What this app does
- Validation tab — instant spec-compliance lint with teaching errors
- Improve with AI — three frontier models work together: DeepSeek V4 Pro critiques, Qwen 3.6 Plus rewrites, GLM-5.1 judges. Watch it live.
- Run Evals — A/B test the skill. Same model runs each case with and without the skill, measuring pass-rate and token deltas.
- History + Versions — every run is tracked, and overwrites create restorable snapshots.
How evals work — and why this is the right way to measure a skill
The eval spec format comes from agentskills.io/skill-creation/evaluating-skills — Anthropic's published methodology, also used by their skill-creator meta-skill. Each case is a realistic user prompt plus assertions the output must satisfy (substring, regex, length, or LLM-judged).
The methodology is paired A/B runs:
- with_skill — the same agent runs the case with your skill loaded.
- without_skill — the agent runs the identical case without the skill.
Why this is the right way: it controls for the prompt (same input both sides), controls for the model (we use Qwen 3.6 Plus on both runs), and produces a measurable delta. Vague claims of "my skill makes the agent better" become quantified lift — or get falsified. We measure three things:
| Pass rate | Did the output satisfy the assertions? Δ > 0 = the skill makes the agent more correct. |
| Tokens | Compares total output tokens. A skill that doubles token cost for the same pass rate is not a win — measure both. |
| Wall time | How long each side took end-to-end. Skills that bloat reasoning time pay a UX tax. |
A good skill improves at least one without degrading the others. The benchmark scorecard in the Evals tab surfaces all three deltas after each run.
Quick start
- Drop a
.skillor.zipon the left, or click any skill below. - Check the Validation and Rubric tabs to see where it stands.
- Click Improve with AI for a content rewrite.
- Click Generate Evals then Run Evals to measure actual lift.
Tips
- The Files tab shows what gets shipped when someone downloads the zip.
- The Metrics tab includes a token-budget waterfall showing what loads at activation vs on demand.
- Review Diff shows the selected AI proposal or forge evaluation patch with source context.
- Look for trigger collisions below — skills whose descriptions overlap can confuse the routing agent.
Catalog Overview
Aggregate health of your skill library. Click any skill in the sidebar to drill into its detail.
Skill
No run loaded.
Pipeline Events
A/B Evaluation Benchmark
Run cases with the skill vs without it. Same worker model on both sides. Pass rate / token / time deltas measure actual lift.
evals/evals.json yet.AI Improvement Studio
Watch three frontier models critique, rewrite, and judge this skill in real time.
Review Diff
Metrics
Selected Forge Report
Optimization Notes
Spec Compliance
Linted against the agentskills.io specification. Each finding ships with a teaching note.
Evaluation Rubric
Run History
Lineage
Environment & Share Readiness
Skill Versions
Snapshots of the live skill bundle captured before and after overwriting operations.
Installed Skill Files