Skill Forge

Welcome to Skill Forge

A forge for AI agent skills. Build, lint, improve, evaluate, and ship production-ready skills that follow the open agentskills.io spec.

What is a skill?

A folder containing a SKILL.md file with YAML frontmatter — a structured set of instructions an AI agent reads on demand. Think of it as a compressed expert brain you can hand to any agent. The agent only loads the skill when its description matches what the user is asking for.

Required frontmatter (per agentskills.io):

`name`	1–64 chars, lowercase `a–z` + digits + single hyphens. Must match the parent directory name. Pattern: `^[a-z][a-z0-9](-[a-z0-9]+)$`
`description`	1–1024 chars. Must describe what the skill does and when to invoke it. Include trigger keywords / example user phrases — the agent uses this to route requests.

Optional frontmatter:

`license`	Short SPDX identifier or filename. Still standard. Example: `license: MIT` or `license: Proprietary. LICENSE.txt has complete terms`.
`compatibility`	Up to 500 chars describing intended product / required packages / network requirements.
`metadata`	Object of arbitrary string key/value pairs. Recommended home for `version`, `author`, `tags`, etc.
`allowed-tools`	Space-separated list of pre-approved tools (experimental). Example: `allowed-tools: Bash(npm:*) Read`

Directory structure (everything beyond SKILL.md is optional):

skill-name/
├── SKILL.md          required — entry point loaded at activation
├── scripts/          optional — executables the skill can run
│   └── *.py *.sh *.ts
├── references/       optional — on-demand docs the body links to
│   └── *.md *.json
├── assets/           optional — templates, sample data, images
│   └── *.docx *.pptx *.json *.png ...
└── evals/            optional — A/B test cases (see agentskills.io/evaluating-skills)
    └── evals.json

Example minimal SKILL.md frontmatter:

---
name: csv-cleanup
description: Clean a CSV file deterministically. Use when the user says "clean this CSV", "fix the data", or attaches a .csv that needs validation. Detects encoding, normalizes headers to snake_case, type-coerces, dedupes, emits a profile report.
license: MIT
metadata:
  version: "1.2.0"
  author: community
  tags: [data, etl, csv]
---

Progressive disclosure & context budget

An agent's context window is finite. Every token in SKILL.md is paid every time the skill activates, before the agent does any work. Heavy material — long examples, schemas, edge cases — belongs in references/, which loads only when the workflow explicitly cites it.

Think of it like code splitting for prompts: keep the entry point tight, defer the rest. The spec recommends ≤ 5000 tokens / ≤ 500 lines in the body. The Metrics tab visualizes the activation vs on-demand cost as a waterfall.

How to read the rubric

Each skill is scored on 8 evidence-based axes (0–100), weighted by importance. ≥90 = production-ready. Sourced from the agentskills.io spec + Anthropic's Building Skills guide + patterns observed in high-performing skills.

Trigger Clarity · 20%	Description names when to invoke the skill, with example user phrases.
Knowledge Delta · 18%	Adds expert knowledge the model doesn't already have — not generic advice.
Workflow Concreteness · 15%	Actionable verb-object steps, not abstract "think about X" prose.
Output Specification · 12%	Names the artifact format the skill produces (table, JSON, doc layout).
Progressive Disclosure · 12%	SKILL.md body is short; heavy detail lives in `references/`.
Reasoned Constraints · 10%	"Must" comes with "because". No blanket ALL-CAPS imperatives.
Self-Containment · 8%	Referenced scripts and files actually exist in the bundle.
Token Efficiency · 5%	Body within the spec's 5000-token / 500-line cap.

What this app does

Validation tab — instant spec-compliance lint with teaching errors
Improve with AI — three frontier models work together: DeepSeek V4 Pro critiques, Qwen 3.6 Plus rewrites, GLM-5.1 judges. Watch it live.
Run Evals — A/B test the skill. Same model runs each case with and without the skill, measuring pass-rate and token deltas.
History + Versions — every run is tracked, and overwrites create restorable snapshots.

How evals work — and why this is the right way to measure a skill

The eval spec format comes from agentskills.io/skill-creation/evaluating-skills — Anthropic's published methodology, also used by their skill-creator meta-skill. Each case is a realistic user prompt plus assertions the output must satisfy (substring, regex, length, or LLM-judged).

The methodology is paired A/B runs:

with_skill — the same agent runs the case with your skill loaded.
without_skill — the agent runs the identical case without the skill.

Why this is the right way: it controls for the prompt (same input both sides), controls for the model (we use Qwen 3.6 Plus on both runs), and produces a measurable delta. Vague claims of "my skill makes the agent better" become quantified lift — or get falsified. We measure three things:

Pass rate	Did the output satisfy the assertions? Δ > 0 = the skill makes the agent more correct.
Tokens	Compares total output tokens. A skill that doubles token cost for the same pass rate is not a win — measure both.
Wall time	How long each side took end-to-end. Skills that bloat reasoning time pay a UX tax.

A good skill improves at least one without degrading the others. The benchmark scorecard in the Evals tab surfaces all three deltas after each run.

Quick start

Drop a .skill or .zip on the left, or click any skill below.
Check the Validation and Rubric tabs to see where it stands.
Click Improve with AI for a content rewrite.
Click Generate Evals then Run Evals to measure actual lift.

Tips

The Files tab shows what gets shipped when someone downloads the zip.
The Metrics tab includes a token-budget waterfall showing what loads at activation vs on demand.
Review Diff shows the selected AI proposal or forge evaluation patch with source context.
Look for trigger collisions below — skills whose descriptions overlap can confuse the routing agent.

Catalog Overview

Aggregate health of your skill library. Click any skill in the sidebar to drill into its detail.

Loading catalog summary…

Skill

No run loaded.

current live bundle

Pipeline Events

A/B Evaluation Benchmark

Run cases with the skill vs without it. Same worker model on both sides. Pass rate / token / time deltas measure actual lift.

Generate Evals

Starts fresh from SKILL.md. Useful when a skill has no evals yet. The previous active spec is archived first.

Improve Evals

Uses old specs, prior runs, quality warnings, and optional human notes to create a harder next test revision.

Run Evals

Runs each case twice: with the skill and without it. This is the evidence layer for lift.

Improve Skill from Eval Evidence

Leaves the eval spec alone and uses failed cases plus benchmark results to propose a better SKILL.md.

Archive Spec

Removes the current active spec from the live workflow while preserving the archived revision.

Evaluate Skill is separate. It scores the selected bundle state with the forge rubric and spec checks. Run Evals measures behavioral lift. Improve Evals hardens the test curriculum.

Loading eval quality…

Click Run Evals to A/B test this skill. Or Generate Evals if there's no evals/evals.json yet.

AI Improvement Studio

Watch three frontier models critique, rewrite, and judge this skill in real time.

Click "Improve with AI" to start.

DeepSeek V4 Pro critiques. Qwen 3.6 Plus rewrites. GLM-5.1 judges. Stages stream live.

Review Diff

This tab is a selected change viewer. By default it shows the latest AI rewrite proposal for this skill. If you click a hardening run in History, it switches to that run's SKILL.md patch. It is not a continuously running diff.

Select a run to inspect its diff.

Metrics

Two things live here. Selected Forge Report describes the evaluation currently selected from History or latest load. Current Bundle Token Budget describes the live skill files right now.

Selected Forge Report

Optimization Notes

No optimization report loaded.

Spec Compliance

Linted against the agentskills.io specification. Each finding ships with a teaching note.

One focused output. This tab only shows spec-blocking errors, warnings, passed checks, and teaching notes. The lint list is the canonical validation surface.

Loading…

Evaluation Rubric

Run History

Lineage

Loading lineage…

Environment & Share Readiness

Loading environment profile…

Skill Versions

Snapshots of the live skill bundle captured before and after overwriting operations.

Installed Skill Files

No file selected —

Welcome to Skill Forge

What is a skill?

Progressive disclosure & context budget

How to read the rubric

What this app does

How evals work — and why this is the right way to measure a skill

Quick start

Tips

Catalog Overview

Skill

Device Login

Invite Collaborator

Create a Skill

Fork Skill

AI Improvement Brief

Rate Version

Promote Proposal

Edit Eval Spec

Skill Forge Guide