flm — Training Dashboard

30s  | --  | Step --  | --
encyclop; Reference Guide (click to expand)

How It Works (V16 — Lean Hydra + Programmatic Geometry)

The model compresses English text into 32 concept slots (each a 16-dimensional vector, 512 dims total). Three non-autoregressive parallel decoders (each 6 layers deep) share the same concept bottleneck: (1) EN reconstruction, (2) Paraphrase, (3) Semantic Parse. FR/ES heads dropped — not project goals, and 3 heads provide sufficient pressure for language-independent encoding.

V16 key idea: Programmatic geometry data generation with strict train/test vocabulary separation. Templates with slot filling produce 18K+ word-order combos, diverse analogies, direction pairs, and cluster sentences. Geometry eval uses unseen test vocabulary — measures genuine generalization, not memorization. Geometry losses active from step 0 (no gate), warmup over 5K steps.

Trains on ~978K pairs: paraphrase (MRPC/QQP/NLI) + semantic parse. Each step: EN recon + one sampled secondary head (50/50 para/parse) + 7 geometry losses (every 5 steps).

Training Setup

Architecture ~110M params — EN encoder (6L×384d) + bottleneck (32×16=512d) + 3 decoders (6L×384d each)
Schedule 600K steps, cosine LR 2e-4 → 1e-5, warmup 2K steps, batch size 32
Data ~978K pairs: paraphrase (MRPC/QQP/NLI entailment) + semantic parse (custom grammar)
Sampling 50% Paraphrase, 50% Parse
Geometry No gate — warmup 0→1 over 5K steps from step 0. Every 5 steps.
Losses EN recon + head CE + geo×(WO=2.0 + HRepul=1.0 + BRepul=0.3 + Analogy=2.0 + DirCon=1.5 + Cluster=1.5)

Metrics

EN Recon Loss — EN reconstruction cross-entropy. Lower = better.
FR/ES Translation — Cross-entropy for translating EN→target language through the bottleneck.
DE/PT/ZH/JA Translation — Same as above for additional languages (V19). More languages = more geometric pressure on the bottleneck.
Para Loss — Paraphrase decoder CE. Tests meaning-preserving rewording.
Parse Loss — Semantic parse decoder CE. Tests structural understanding.
Contrastive (InfoNCE) — Pushes translation pairs close together in bottleneck space while pushing unrelated batch items apart. Uses temperature=0.07. This is NOT a reconstruction loss—it directly shapes the geometry of the bottleneck vectors. Falls fast early as the model learns to group same-meaning sentences, then plateaus.
NLI Graded — 3-tier contrastive: entailment pairs → sim 0.85, neutral → 0.50, contradiction → 0.15. Smooth L1 loss. Builds graded similarity structure in the bottleneck.
WN Noun Hierarchy — Sentence pairs differing by one noun; target cosine sim tracks WordNet path_similarity (0.3 + 0.6*dist). Builds noun taxonomy in bottleneck geometry.
WN Axis Consistency — Adjective antonym pairs (big/small, hot/cold) — diff vectors within each axis should be consistent (1 - mean pairwise cosine). Builds directional axes.
WN Troponym Chains — Verb specificity chains (move→run→sprint) — ordering margin + direction consistency. Builds verb hierarchy.
EN Token Acc — Fraction of EN tokens the decoder gets right.
Exact Match — Full EN sentence reconstructed perfectly.
EM EMA — Exponential moving average of exact match (decay=0.99).
Geo Scale — Geometry loss scale factor (0–1). Ramps from 0 to 1 over first 5K steps.

Geometry Probes (every 500 steps, TEST vocab)

Analogy — a−b+c≅d cosine score (test vocab). Want >0.8.
Clustering Gap — Within-group − between-group sim (test vocab). Want >0.05.
Dir Consistency — Same attribute = same direction? (test vocab). Want >0.3.
Word Order Sim — Swapped-order pair similarity (test vocab). Want <0.85.
Effective Rank — SVD dims for 90%/95% variance. Higher = richer representations.

7 Geometry Losses (from step 0)

Word Order — Push swapped-word pairs below sim 0.5. Weight 2.0.
Hard Repulsion — Push top-8 most similar unrelated pairs below sim 0.1. Weight 1.0.
Batch Repulsion — Push random batch pairs below sim 0.3. Weight 0.3.
Analogy — Reward a−b+c ≅ d structure. Target sim >0.9. Weight 2.0.
Dir Consistency — Same-attribute directions should align. Target sim >0.8. Weight 1.5.
Cluster Sep — Same-group close (>0.5), different-group far (<0.2). Weight 1.5.

Diagnostic Output

[OK] = perfect reconstruction. [DIFF] (X%) = X% token overlap.
EN diagnostics: prose, code, math, logic. Parse: structured output.

Reconstruction Loss

Contrastive Loss (InfoNCE)

EM EMA / Geometry Losses

Token Accuracy / Batch Similarities

Exact Match / Eval Similarities

Learning Rate Schedule

Dynamic Sampling Weights

Margin Losses + Repulsion

Clustering Gap & Direction

Analogy & Word Order

Effective Rank

Loading...