Words and Cliqueish Behavior

Words like hanging around their word friends when they're about the same thing. Some words wear multiple hats and hang out with different crowds—maybe they're spies? Word-artifacts are fractals mapping to the narratives of human life. Anyway, this is about clustering.

How this happened

Not words—I'm not liable for their origin (although that claim may be a tad perpetch-a-sketch¹)—but clusters in the Enochian Language Modeling project.

In the early days of the project, circa April 2025, I had a simple goal: find words that shared a given ngram (i.e., root) and tease out possible meanings for said root. I eventually got it working and tested it out on a 20 or so ngrams.

A mostly good example of it working

Here's a summary the Archivist agent gave about the debate:

The debate centered on whether 'DA,' a defined word meaning "there," should be considered a foundational root morpheme in the Enochian language. Key arguments supporting this proposal highlighted its 100% semantic coverage and cohesive thematic presence across related terms, suggesting it plays a central role in expressing spatial or existential concepts. The strongest counterarguments questioned the dataset's comprehensiveness and potential biases, emphasizing the need for broader linguistic analysis.
Ultimately, the adjudicator accepted 'DA' as a meaningful root candidate, acknowledging its significant semantic alignment and morphological patterns within Enochian. This acceptance was contingent on further research to validate these findings independently. The decision reflects confidence in 'DA's foundational role while recognizing areas for deeper exploration.

For reference, this is the data the LLM was working off of (not the full prompt):

DA ⭐️ (root form) — there <fasttext:0.0, semantic similarity:0.0, tier:Untiered> | GETA — there <fasttext:0.489, semantic similarity:1.0, tier:Untiered> | CHISDA — are there <fasttext:0.661, semantic similarity:0.641, tier:Untiered> | CHRISTEOS — let there be [...let there be division in her... / ...and let there be no one creature equal...] <fasttext:0.363, semantic similarity:0.566, tier:Untiered> | EMNA — here [...neither know any (long) time here...] <fasttext:0.455, semantic similarity:0.561, tier:Untiered> | ASYMP — another [...no one creature equal with another.] <fasttext:0.336, semantic similarity:0.505, tier:Untiered> | DROLN — any [...neither know any (long) time here.] <fasttext:0.331, semantic similarity:0.505, tier:Untiered> | CORSI — such [...our comfort and of such as are prepared.] <fasttext:0.373, semantic similarity:0.486, tier:Untiered> | CORS — such [...a building such as is not but in the mind...] <fasttext:0.367, semantic similarity:0.486, tier:Untiered> | AR — that [Bring down your train 3363 that the Lord may be magnified...] <fasttext:0.394, semantic similarity:0.464, tier:Untiered> | DLUGAM — given <fasttext:0.477, semantic similarity:0.455, tier:Untiered> | DRILPA — great [Great art thou in the God of... / ...is, was, and shall be great...] <fasttext:0.269, semantic similarity:0.423, tier:Untiered> | ANGELARD — thoughts [...even as the heart of man doth his thoughts...] <fasttext:0.209, semantic similarity:0.412, tier:Untiered> | IX — let <fasttext:0.487, semantic similarity:0.408, tier:Untiered> | DOOAIN — name [...whose name amongst you is wrath. / ... (his) Name is become mighty...] <fasttext:0.532, semantic similarity:0.407, tier:Untiered> | DE — of [...the corners of your governments...] <fasttext:0.62, semantic similarity:0.405, tier:Untiered>...

This is what the result was:

DA - A root morpheme denoting spatial or existential presence, often associated with concepts of "there" or existence within Enochian lexicon.

To be honest, this isn't terrible—lots of words centering around "let there be" or "are there" and so on. I blissfully ignored words that dealt more with things like "name" or "of" or "his thoughts"—those could have just been outliers, right?

However.

An example of the problem

One result in particular challenged me:

Here's the summary from the Archivist:

In the debate about the root word 'NA' in the Enochian language, proponents argued that it consistently aligns with themes of divinity and theological significance across related words, supported by strong semantic coherence (100% coverage) and morphological evidence (high FastText scores). The strongest argument was its central role in conveying divine concepts, fitting naturally within Enochian's focus on celestial themes. Critics contended the analysis relied too heavily on thematic assumptions without sufficient linguistic proof of 'NA' influencing word formation or meaning directly. They pointed out that while semantic and morphological connections exist, they might be coincidental rather than indicative of a genuine root relationship. Despite these objections, the adjudicator accepted 'NA' as a meaningful candidate due to its consistent thematic presence and structural evidence within the language's lexicon, deeming it foundational in Enochian cosmology.

For whatever reason, this is the list of words provided to the LLM:

NAZAVABH — hyacinth <fasttext:0.425, semantic similarity:0.411, tier:Untiered> | IA — (rootword indicating something celestial or sacred) <fasttext:0.386, semantic similarity:0.394, tier:Untiered> | FAAIP — voices,voicings,psalms [...understand your voices of wonder...] <fasttext:0.265, semantic similarity:0.348, tier:Untiered> | NA — Lord, the name of the trinity (when temporarily separated into three) <fasttext:0.575, semantic similarity:0.347, tier:Untiered> | LEPHE — (exact meaning unknown; part of LIFE LEPHE LURFANDO, which is a charge to tell the wicked to tell the truth) <fasttext:0.363, semantic similarity:0.343, tier:Untiered>...

The result:

NA - A divine or sacred concept central to Enochian cosmology, representing authority and celestial communication.

Sounds okay, I suppose—that is, if you don't have a strangely familiar acquaintance with the Enochian language! Let's consider the following words of which NA is a part.

NA, a gloss for ENAY, meaning "The Lord"
IADNAH, divine knowledge, where IAD means something akin to "divine"
IADNAMAD, undefiled knowledge, where MAD means something akin to "your God"
ANANAEL, sacred wisdom, where EL is "the one" and... who knows about the rest
NAZPSAD, sword, where NAZ deals with rectangular prisms
NAZARTH, pillars of gladness, where ARTH is the gladness idea
VOVINA, dragon(s)
BUTMONA, mouth
MONASCI, the great name

As you can see, "authority and celestial communication" is really just covered by MONASCI (sort of), BUTMONA, NA, but that's about it—everything else is all over the place, with some of the words clustering around "knowledge".

So I had a few problems:

I provided a woefully truncated list of words that contain the ngram "NA", which limited the scope of its meaning.
Because of this woefully truncated list, in any reconstruction efforts, if I had moved forward with this exact model, the AI would be reconstructing Enochian words using only this definition.
Which implies that all other meanings of NA are left out entirely—decidedly not helpful.

That's when I started to consider: what if I started making clusters of meanings? Rather than having a list like the one above that feels very kitchen-sink, why not break it into sections based on meaning?

Clustering

This is the technical section that explains what my thought process was and what I ended out programming.

What is?

As I foreshadowed earlier, clusters are groups of a thing concentrated together based on some common overlap—in my case, a meaning shared across multiple words, clustered together.

Consider the following piece of a prompt fed to the solo analysis process:

Candidates (contain NA): NA ⭐️, NANBA, NAPTA, NAPEAI, OOAONA, NAZPSAD
Related definitions & citations: NANBA — thorns. usage: ...whose wings are *thorns* to stir up vexation... <fasttext:0.963, semantic similarity:100.0, tier:Very strong connection> | NAPTA — (two-edged) swords. usage: ...with *two-edged swords* flaming... <fasttext:0.997, semantic similarity:100.0, tier:Very strong connection> | NAPEAI — swords. usage: ...o you *swords of* the south... <fasttext:0.687, semantic similarity:100.0, tier:Very strong connection> | OOAONA — eyes. usage: ...which have 42 *eyes* to stir up wrath... <fasttext:-0.194, semantic similarity:100.0, tier:Very strong connection> | NAZPSAD — sword. usage: ...the sun is as *a sword*... <fasttext:0.753, semantic similarity:100.0, tier:Very strong connection>

In short, we have:

NANBA, thorns
NAPTA, swords
NAPEAI, swords
OOAONA, eyes (that are looking pointedly I guess)
NAZPSAD, sword

OOAONA aside, this is a much more consistent grouping. For the ngram "NA", there are actually 59 clusters—some approved, some rejected, all processed with unique records.

But how did I get here? Not exactly through genius; I went through stages with a lot of trial and error.

How did?

As mentioned elsewhere on this site, I'm so-so at programming. I'd like to think that I'm good at big picture things,² but who knows; all I can really say is that my smooth brain had to try a lot of things before I could get this process—which is far from perfect—to work.

The first thing I tried was an agglomerative clustering process. I dare not explain the math because I don't fully get it—I have a friend whom I plan to sit down with and see if he can teach me how these things work so I understand them better, but for now, let's just say that it's a kind of clustering that groups things together based on the distance each word is from another.³

That's nice and all, but tuning it took a while; I tried a few recommended settings (thanks to AI), and had something passable after a while.

One thing I noticed fairly quickly is that the output sometimes had no difference from cluster to cluster in some cases, just a shift in word order. For example:

A, B, C
B, C, A
C, D, A, F
D, C

Of those, (1) and (2) are identical sets; the only difference is order. Processing the same set of definitions over and over again and expecting different results is not really productive; so I decided to eliminate duplicate sets. Ordered properly, that leaves us with these sets:

A, B, C
A, C, D, F
C, D

Nice. But something felt... lacking. It felt like there might be other, better ways.

And I thought, "Aren't there other methods of clustering?"

Why yes, yes there are.

The clustering methods

In short, I'm using the following methods. I'll provide a #plain-speak style definition for each to help try to keep them somewhat separate.

Agglomerative: clusters based on distance between words/concepts.
DBSCAN (aka, Density-Based Spatial Clustering of Applications with Noise): closely packed data points, outliers are treated as noise.
Threshold Graph Clustering: single-linkage with a hard cutoff distance.
Ego-clustering: uses seed nodes and finds clusters around it.
Fuzzy: membership between one and other clusters is not absolute; it is percentage based.
k-Nearest-Neighbors: figures out how members fit in with one another.

Anyway, yeah. It would probably do me some good to find a math nerd or computer science person to help me figure this stuff out, but I used these.

However, there's an important question: what makes for a good cluster versus a bad one? What I opted to do was describe what I considered a good cluster and score clusters accordingly. And then I took things one step further and decided to have each method complete, tuning the parameters for each ngram, to get clusters that mirrored that numerical description.

This results in me knowing exactly which clustering method "works best" for each ngram. Ultimately, this data may be able to be used for something, but I'm not entirely sure what yet.

For the ngram "AFE", the following tuning seemed to work best: ``. This yielded these two clusters:

AFE Cluster One

Candidates (contain AFE): EFAFAFE, OFAFAFE
Related definitions & citations: EFAFAFE — vials. usage: ...which have *vials* 8 of wrath... <fasttext:0.996, semantic similarity:100.0, tier:Very strong connection> | OFAFAFE — vials. usage: come away, and not *your vials*. <fasttext:0.893, semantic similarity:100.0, tier:Very strong connection>

AFE Cluster Two

Candidates (contain AFE): OFAFAFE, EFAFAFE, FAFEN
Related definitions & citations: OFAFAFE — vials. usage: come away, and not *your vials*. <fasttext:0.893, semantic similarity:100.0, tier:Very strong connection> | EFAFAFE — vials. usage: ...which have *vials* 8 of wrath... <fasttext:0.996, semantic similarity:100.0, tier:Very strong connection> | FAFEN — intent,succession of events,consequences,something following another thing. usage: ...true ages of time, *to the intent that* from your highest..., bring down *your train*... <fasttext:0.541, semantic similarity:100.0, tier:Somewhat possible connection>

AFE Cluster Three

Candidates (contain AFE): FAFEN, OFAFAFE
Related definitions & citations: FAFEN — intent,succession of events,consequences,something following another thing. usage: ...true ages of time, *to the intent that* from your highest..., bring down *your train*... <fasttext:0.541, semantic similarity:100.0, tier:Somewhat possible connection> | OFAFAFE — vials. usage: come away, and not *your vials*. <fasttext:0.893, semantic similarity:100.0, tier:Very strong connection>

Conclusion

There's probably some revision that I need to do with this, such as have smaller ngrams allow for larger clusters and more clusters in abundance and have larger ngram expect smaller clusters and ideally fewer. I haven't tuned this yet, but I likely will at some point before long.

Anyway, that's what I have to say about the clustering process right now. Thanks for reading, take care!

Footnotes

A portmanteau word combining perpetually and etch-a-sketch into perpetch-a-sketch. Cute, right?? ↩
I say "good at big picture things," but compared to whom or to what? I'm better at big picture things than I was maybe a couple decades ago, but I'd like to think we're all working towards being better and better versions of ourselves. Really, I have no metric to judge this by that I'm comfortable using. I've noted: when I'm in a room with people who are immature and vapid, I feel like a big-brain genius; when I'm in a room with anyone who is remotely competent at what they're doing and at thinking things through, I feel like I've either met my match or am an idiot. All this to say, rather than how (lacking) genius my efforts are, I want to be recognized for my persistent curiosity, wonder, and open-mindedness whilst maintaining at least some level of critical thinking. ↩
It's probably worth mentioning that words and their meanings need to be converted into some kind of numeric value or values in order to really get anywhere with clustering. After all, without any kind of metric to work with, the difference is experienced and ineffable. ↩