Let's brainstorm a list of miscellaneous things:

In [9]:
items_text = """
- pasta
- thomas dolby
- alpha
- apples
- cats
- pears
- meters
- brick
- dogs
- beta
- howard jones
- concrete
- asphalt
- milk
- rebar
- gillian gilbert
- hamsters
- bread
- butter
- wendy carlos
- gamma
- birds
- bananas
- rick wakeman
- inches
- glass
- feet
- gary numan
- miles
- lumber
- kilometers
- geoff downes
"""

# Split the text into non-empty lines...
items = [x for x in items_text.split("\n") if x]

Let's install some needful modules...

In [10]:
%pip install requests scikit-learn

Note: you may need to restart the kernel to use updated packages.


Next, let's pick an embedding model and generate semantic vector representations for all our list items:

In [3]:
import requests

# TinyLlama-1.1B-Chat-v1.0.Q5_K_M.llamafile.exe -ngl 9999 --embedding --nobrowser --port 8887
llamafile_base_url = 'http://127.0.0.1:8887'

def generate_embeddings(items):
    response = requests.post(
        f"{llamafile_base_url}/embedding",
        json={ "content": items }
    )
    data = response.json()
    embeddings = [x['embedding'] for x in data['results']]
    return embeddings

embeddings = generate_embeddings(items)
embeddings

[[0.009491786360740662,
  0.035500749945640564,
  -0.010661028325557709,
  -0.019359176978468895,
  0.020187685266137123,
  -0.017124688252806664,
  0.020965272560715675,
  -0.0033423982094973326,
  0.00609687902033329,
  -0.05279877781867981,
  0.008993982337415218,
  -0.00932239554822445,
  0.009492006152868271,
  -0.01349897962063551,
  -0.005563233979046345,
  0.010440753772854805,
  0.01187113393098116,
  -0.0002146565675502643,
  -0.015647541731595993,
  -0.00019681811681948602,
  0.0067336359061300755,
  0.006047920789569616,
  -0.0005779559724032879,
  -0.002174144145101309,
  0.032633136957883835,
  -0.005216240417212248,
  -0.010189303196966648,
  0.00917273759841919,
  -0.01215176098048687,
  0.009840095415711403,
  0.011553104035556316,
  -0.011159605346620083,
  -0.0018355734646320343,
  -0.011880375444889069,
  0.04802987724542618,
  0.05148657038807869,
  0.00737421028316021,
  -0.011330530978739262,
  -0.005537863355129957,
  -0.006482837256044149,
  -0.0146590145304799

Now that we have vectors, let's try clustering them within the semantic space of the model. This should be roughly analogous to grouping them by meaning:

In [18]:
from sklearn.cluster import KMeans
from itertools import groupby

# Let's say we want to organize the list into this many clusters
n_clusters = 12

# Use the k-means algorithm to come up with a cluster ID for each embedding
cluster_ids = KMeans(n_clusters=n_clusters, n_init='auto').fit_predict(embeddings)

# Associate each cluster ID with the corresponding item
cluster_ids_with_items = zip(cluster_ids, items)

# Group the pairs of (cluster_id, item) into lists based on cluster ID
grouped_cluster_ids_with_items = groupby(
    sorted(cluster_ids_with_items, key=lambda x: x[0]),
    key=lambda x: x[0]
)

# Simplify that whole mess so we just have a list of clustered items
clustered_items = [
    [item for cluster_id, item in item_group]
    for cluster_id, item_group
    in grouped_cluster_ids_with_items
]

clustered_items

[['- apples', '- pears', '- bananas'],
 ['- thomas dolby',
  '- howard jones',
  '- gillian gilbert',
  '- rick wakeman',
  '- geoff downes'],
 ['- alpha', '- beta', '- gamma'],
 ['- meters', '- inches', '- feet', '- miles', '- kilometers'],
 ['- lumber'],
 ['- cats', '- dogs', '- hamsters', '- birds'],
 ['- pasta', '- bread'],
 ['- brick', '- rebar'],
 ['- glass'],
 ['- concrete', '- asphalt'],
 ['- milk', '- butter'],
 ['- wendy carlos', '- gary numan']]

It's not perfect, but we've got our list roughly organized. Let's try coming up with a title for each cluster:
This ends up being way faster than using pytorch!

In [19]:
import requests

system_prompt = """You are a helpful but terse assistant."""

user_prompt = """
Given the following list of items, I need a succinct label that effectively encapsulates the overall theme or purpose.

This is the list of items:

%s

Can you generate a concise, descriptive label for this list? Thanks in advance!
"""

prompt_template = """<|system|>
%s</s>
<|user|>
%s</s>
<|assistant|>"""

def generate_topic(items):
    text = "\n".join(items)
    prompt = prompt_template % (
        system_prompt,
        user_prompt % text,
    )
    # https://github.com/Mozilla-Ocho/llamafile/blob/main/llama.cpp/server/README.md#api-endpoints
    response = requests.post(
        f"{llamafile_base_url}/completion",
        json={
            "prompt": prompt,
            # maximum number of tokens to predict
            "n_predict": 32,            
            # this tells the LLM how much of a rando to be while selecting tokens during generation
            "temperature": 0.1,
            # this tells the LLM how many different tokens to decide between at each step of generation
            "top_k": 3,
            # this tells the LLM how picky to be about the most likely tokens to select while generating
            "top_p": 0.8,
        }
    )
    data = response.json()
    return data["content"]

for cluster in clustered_items:
    topic = generate_topic(cluster)

    print(f"# {topic}")
    print()
    for item in cluster:
        print(f"{item}")
    print()

# 
"Fruits"

- apples
- pears
- bananas

# 
"Essential Artists: Thomas Dolby, Howard Jones, Rick Wakeman, Geoff Downes"

- thomas dolby
- howard jones
- gillian gilbert
- rick wakeman
- geoff downes

# 
"Essential items for a successful project"

- alpha
- beta
- gamma

# 
"Essential measurements for everyday life"

- meters
- inches
- feet
- miles
- kilometers

# 
"Essential materials for construction and home improvement"

- lumber

# 
"Animals"

- cats
- dogs
- hamsters
- birds

# 
"Essential Ingredients for a Comforting Meal"

- pasta
- bread

# 
"Brick and rebar: essential building materials"

- brick
- rebar

# 
"Essential items for a well-organized and functional home"

- glass

# 
"Materials for Construction and Maintenance"

- concrete
- asphalt

# 
"Nourishing and delicious foods"

- milk
- butter

# 
"A list of helpful and supportive individuals"

- wendy carlos
- gary numan

