Word2vec V33 Mixed SG+CBOW — Embedding Geometry Probe

300d, 100K whole-word vocabulary, trained 1,000,000 steps on DFSG-compliant mix (~2B tokens).

24/26
Custom Analogies
92%
Custom Accuracy
59.2%
Google Benchmark
500
Vocab Visualized
300d
Embedding Dim

Google Analogy Benchmark (Standard Test)

The standard word2vec evaluation: 19,544 analogy questions across 14 categories. Format: A:B :: C:? — find D such that the A-to-B relationship mirrors C-to-D.

59.2%
Overall Accuracy
47.5%
Semantic
65.4%
Syntactic
80%
Coverage
9,294/15,705
Correct/Covered

Comparison with Published Models

All models evaluated on the same Google analogy test set (questions-words.txt, ~19.5K questions).

ModelCorpusVocabDimOverallSemanticSyntactic
V33 (ours)DFSG-compliant mix100K300 59.2%47.5%65.4%
word2vec (Mikolov 2013)Google News 100B3M300 61.0%~65%~57%
GloVe (Pennington 2014)Common Crawl 42B1.9M300 75.0%~81%~70%
GloVe (Pennington 2014)Wikipedia 6B400K300 71.7%~77%~67%
FastText (Bojanowski 2017)Wikipedia 16B2.5M300 77.8%~77%~78%

Note: Published models use 10-100x more training data. V33 trains on DFSG-compliant sources (Wikipedia, Gutenberg, Stack Exchange, arXiv, etc., ~2B tokens). Vocab coverage also matters — our 100K vocab covers 80% of test questions vs near-100% for larger vocabs.

Per-Category Breakdown

CategoryScoreAccuracyCoverageType
capital-common-countries374/50673.9%100%semantic
capital-world796/1,92741.3%43%semantic
currency5/3381.5%39%semantic
city-in-state1,111/2,26149.1%92%semantic
family304/42072.4%83%semantic
gram1-adjective-to-adverb309/99231.1%100%syntactic
gram2-opposite237/75631.3%93%syntactic
gram3-comparative1,208/1,33290.7%100%syntactic
gram4-superlative734/1,05669.5%94%syntactic
gram5-present-participle649/1,05661.5%100%syntactic
gram6-nationality-adjective847/1,29965.2%81%syntactic
gram7-past-tense1,089/1,56069.8%100%syntactic
gram8-plural1,122/1,33284.2%100%syntactic
gram9-plural-verbs509/87058.5%100%syntactic

t-SNE Visualization (Top 500 Words)

royalty
gender_m
gender_f
country
capital
animal
emotion
color
nature
food

Vector Arithmetic

ExpressionResultTop Matches
king - man + womanqueenqueen(0.75), princess(0.69), prince(0.58), infanta(0.55), empress(0.54), monarch(0.54)
paris - france + germanyberlinberlin(0.78), munich(0.77), vienna(0.73), dresden(0.64), prague(0.63), stuttgart(0.63)
tokyo - japan + italybolognabologna(0.58), turin(0.58), pisa(0.54), milan(0.54), perugia(0.54), rome(0.53)
bigger - big + smallsmallersmaller(0.69), larger(0.68), large(0.47), shorter(0.41), inconsiderable(0.41), insignificant(0.41)
went - go + comecamecame(0.89), brought(0.64), walked(0.63), coming(0.61), hurried(0.60), hastened(0.60)
queen - woman + mankingking(0.68), majesty(0.57), regent(0.56), queen's(0.56), king's(0.53), monarch(0.53)
swimming - swim + runrunningrunning(0.75), runs(0.52), ran(0.48), racing(0.44), jumping(0.41), raced(0.40)
dogs - dog + catcatscats(0.75), kittens(0.52), mice(0.52), rats(0.48), rabbits(0.48), pussy(0.46)
french - france + spainspanishspanish(0.85), portuguese(0.72), english(0.65), dutch(0.63), italian(0.62), castilian(0.59)
brother - man + womansistersister(0.87), daughter(0.77), mother(0.75), cousin(0.72), husband(0.71), niece(0.71)
worst - bad + goodbestbest(0.57), better(0.41), truest(0.36), hardest(0.35), well(0.33), greatest(0.33)
happy - good + badunhappyunhappy(0.59), miserable(0.52), sad(0.49), wretched(0.47), happiest(0.45), glad(0.44)

Analogy Tests (24/26 = 92%)

AnalogyExpectedGotTop 5
king:man :: woman:?queengirlgirl(0.62), man's(0.60), creature(0.58), woman's(0.53), gentleman(0.51)
king:queen :: man:?womanwomanwoman(0.73), girl(0.56), creature(0.55), lady(0.54), man's(0.52)
prince:man :: woman:?princessman'sman's(0.60), girl(0.58), creature(0.54), woman's(0.52), gentleman(0.49)
man:woman :: boy:?girlgirlgirl(0.84), baby(0.70), mother(0.68), child(0.67), girls(0.65)
father:mother :: son:?daughterdaughterdaughter(0.80), sister(0.72), brother(0.68), wife(0.67), grandson(0.65)
husband:wife :: brother:?sistersonson(0.76), sister(0.73), nephew(0.73), father(0.71), daughter(0.68)
he:she :: his:?herherher(0.90), girl's(0.70), my(0.68), husband's(0.66), sister's(0.66)
big:bigger :: small:?smallersmallersmaller(0.69), larger(0.68), large(0.47), shorter(0.41), inconsiderable(0.41)
good:better :: bad:?worseworseworse(0.67), worst(0.45), easier(0.43), safer(0.42), wiser(0.41)
slow:slower :: fast:?fasterfasterfaster(0.70), swifter(0.42), quicker(0.40), fastest(0.40), thicker(0.40)
tall:taller :: short:?shortershortershorter(0.63), long(0.40), shortened(0.39), broader(0.38), longest(0.37)
good:best :: bad:?worstworstworst(0.59), easiest(0.44), cheapest(0.40), finest(0.39), safest(0.39)
big:biggest :: small:?smallestlargestlargest(0.55), large(0.45), smaller(0.44), larger(0.42), smallest(0.41)
go:went :: come:?camecamecame(0.89), brought(0.64), walked(0.63), coming(0.61), hurried(0.60)
see:saw :: hear:?heardheardheard(0.80), came(0.67), knew(0.63), listened(0.62), spoke(0.62)
run:ran :: swim:?swamswamswam(0.74), waded(0.58), dived(0.57), rowed(0.56), swimming(0.55)
eat:ate :: drink:?drankdrankdrank(0.84), sipped(0.62), quaffed(0.60), drinking(0.58), wine(0.55)
take:took :: give:?gavegavegave(0.89), giving(0.61), drew(0.51), came(0.51), gives(0.50)
france:paris :: germany:?berlinberlinberlin(0.78), munich(0.77), vienna(0.73), dresden(0.64), prague(0.63)
france:paris :: italy:?romebolognabologna(0.69), rome(0.64), turin(0.63), milan(0.62), lucca(0.62)
japan:tokyo :: china:?beijingshanghaishanghai(0.63), beijing(0.61), nanjing(0.55), peking(0.54), chinese(0.53)
france:paris :: england:?londonlondonlondon(0.79), edinburgh(0.56), philadelphia(0.55), york(0.55), vienna(0.54)
france:french :: spain:?spanishspanishspanish(0.85), portuguese(0.72), english(0.65), dutch(0.63), italian(0.62)
france:french :: germany:?germangermangerman(0.84), russian(0.65), austrian(0.64), italian(0.63), dutch(0.62)
car:cars :: dog:?dogsdogsdogs(0.71), cats(0.56), puppy(0.52), puppies(0.52), terrier(0.51)
child:children :: man:?menmenmen(0.78), women(0.59), people(0.55), fellows(0.49), folks(0.49)

Directional Consistency

How consistently word pairs share the same direction vector (1.0 = perfect, 0.0 = random).

DirectionConsistencyPairsExamples
Gender (M→F)0.38011king→queen, man→woman, boy→girl, father→mother, brother→sister, he→she
Tense (present→past)0.48011go→went, run→ran, see→saw, come→came, eat→ate, take→took
Singular→Plural0.2099car→cars, dog→dogs, cat→cats, house→houses, tree→trees, city→cities
Positive→Negative0.1069happy→sad, good→bad, love→hate, beautiful→ugly, rich→poor, strong→weak
Country→Capital0.4108france→paris, germany→berlin, italy→rome, japan→tokyo, spain→madrid, england→london
Country→Language0.6618france→french, germany→german, spain→spanish, italy→italian, japan→japanese, china→chinese

Semantic Clusters

CategoryWithin-SimWordsMembers
Colors0.56411red, blue, green, yellow, black, white, purple, orange, brown, pink
Countries0.46512france, germany, england, spain, italy, japan, china, russia, india, brazil
Music0.44110song, music, piano, guitar, drum, violin, orchestra, melody, rhythm, concert
Food0.43612bread, cheese, meat, fish, rice, fruit, cake, soup, milk, butter
Emotions0.38512happy, sad, angry, afraid, surprised, love, hate, joy, fear, hope
Weather0.38010rain, snow, wind, storm, sun, cloud, thunder, fog, frost, ice
Animals0.35313dog, cat, horse, fish, bird, wolf, bear, lion, tiger, elephant
Professions0.31610doctor, teacher, lawyer, engineer, scientist, artist, soldier, farmer, priest, judge
Body parts0.29112head, hand, foot, arm, leg, eye, ear, nose, mouth, heart
Math0.24510number, equation, formula, theorem, proof, algebra, geometry, calculus, function, variable

Inter-Group Similarity Matrix

Diagonal = within-group similarity. Off-diagonal = between-group similarity.

AnimalsBody parColorsCountrieEmotionsFoodMathMusicProfessiWeather
Animals0.350.160.160.020.090.16-0.030.080.100.14
Body parts0.160.290.12-0.050.100.09-0.030.100.080.10
Colors0.160.000.560.020.030.15-0.020.05-0.020.18
Countries0.020.000.000.470.010.04-0.030.010.030.02
Emotions0.090.000.030.010.390.03-0.080.110.140.09
Food0.160.090.150.040.030.440.010.060.040.12
Math-0.03-0.03-0.02-0.03-0.080.010.240.01-0.03-0.02
Music0.080.100.050.010.110.060.000.440.110.09
Professions0.100.08-0.020.030.140.000.000.000.320.02
Weather0.140.100.180.020.090.120.000.000.020.38