LLM — mediocreatbest.xyz

on a small, manageable representation of text embeddings

September 24, 2023

I finally got out of a bit of a depressive lull and funk for coding things for fun, and I was able to make something cute that I think is possibly even useful.

I've been playing with #LLM text #embedding #models.

I've been interested in using them as a tool to filter long lists of text into a shorter list that I can process myself.

Specific example: I wanted to find all recently published papers in a certain topic area from a journal in my field. There's ~300 papers. Only ~10 are relevant.

For my first attempt, I hand picked ~5 papers out of ~100 papers. I then used the paper's title, author list, abstract, and keywords to get a text representation of the paper.

I knew that I wanted to try applying Content Defined Chunking to the problem, to split a long text into more manageable pieces.

However, I didn't know the optimal chunking size, so I tried multiple strategies and weighted them by their length.

For instance, chunks of 32 chars at 0.25 weight vs 256 char at 0.75 weight.

The result of this experiment is in this Google Colab notebook: https://gist.github.com/player1537/c5970698349ec635c361e92321f2ca1c

I was ultimately able to produce a list of around 40 papers that I should look more closely at. I'm still looking through these, but this is much better than the ~300 I started with.

While working on this, I realized that it'd be nice to be able to share a URL to someone of that point in a semantic embedding space.

The challenge: the smallest embedding models yield vectors of size 384. Naively, this can be encoded as a very, very large URL, but not something small enough that a human could feasibly type it themselves.

So, I wondered: could you embed a quantized version of the vector? What quantization would work: 8 bit? smaller? even 1 bit?

There's an interesting property that actually makes 1 bit encoding unique.

First, consider that when doing cosine similarity of embedding vectors, you first normalize the incoming vectors, then do a dot product.

Then, consider that encoding 0 bits as negative and 1 bits as positive means that you can use bitwise operators to do the dot product.

I coded that idea: https://gist.github.com/player1537/cf5dc8853ccfe4767660e703d06d6a1e

Then, how to encode as a string? Well, we're already comfortable with UUIDs at 128 bits encoded as ~36 hex characters, so encoding 384 bits as ~64 base64 characters isn't much worse.

The encoding process is also in that notebook

The punchline: this method allows you to represent the text “AdaVis: Adaptive and Explainable Visualization Recommendation for Tabular Data” as the embedding:

8eG2V2UQTNVqfa+mG/2zpGRaokJ00yr5b0ww6zybFrzLK2F2XepPCBpseCQnDopE

which is small enough to be manageable imo!

on syntax-highlighting text-similarity

September 8, 2023

I'd like to share a little tool that I created that I've been playing with. I think now is the time to share it because I believe it has a unique solution to the “how do you chunk up a document” for #LLM #embeddings, which is something @simon@simonwillison.net" title="simon@simonwillison.net">@simon mentioned being an open challenge in his recent blog post.

I've been calling it Semeditor, short for Semantic Editor, available here: https://gist.github.com/player1537/1c23b91b274d2e885be80d5892bac5b7

It can be run with --demo to get the text from the screenshot.

This tool is born out of a need of mine: I'm currently editing an academic paper and I'm in charge of revising the story from one concept (web services) to another concept (jupyter extensions). But that's a tricky thing to quantify: how can one know what the text is actually about?

So, I wanted to create a syntax highlighter that highlights the semantic difference between the meanings of the texts.

Functionally, the tool takes two samples of text (top-left, bottom-left) and finds a way to differentiate those two samples.Then, it applies that same differentation on a third sample of text (right) and highlights accordingly.

For the semantic meaning of the text, I used the smallest model I could find: BAAI/bge-small-en. For differentating the text, I used a Support Vector Machine (SVM) classifier.

Now, for what I feel is the most interesting part of this tool: the chunking.

Since this is running locally on one's computer, I really don't want to constantly recompute the embeddings, especially as I'm intending this to be an editor, so the edits could be early on in the text and that would ruin naive chunking strategies.

Instead, I used an implementation of Content Defined Chunking (CDC) for Python called FastCDC.

The core idea of CDC is to consistently determine chunk boundaries based on the contents of the text. So, in theory, if you edit one part of the text early on, eventually the chunking will re-align itself with previous chunk boundaries.

I believe this works off of a windowed hashing function, so hash(string[i:i+N]) and check if the first few binary digits are zeros, and if so, output a chunk boundary.

You can control the chunking to get arbitrarily small chunks: I use between 32 and 128 chars.

To make each chunk more usable, especially in the face of chunk boundaries cutting words in half, I combine three chunks together to compute the actual embeddings from.

In terms of usability, I've already found this useful, at least marginally. I wanted to flex my tkinter knowledge a bit so I created it as a GUI app and I've found the need to add a couple convenience features. First, repeating the exemplars (left) gives better results. Second, asking an LLM to rephrase helps.

I'm scared of sharing projects like these because I worry they'll either be completely ignored, or worse, looked down upon. Regardless, I'm hoping to overcome that fear by just sharing it anyways.

Especially now, because I feel the techniques within it will become obsolete soon, but I think it's an interesting enough technique that I want to share it before it becomes obsolete. Maybe it will inspire someone, who knows.

Originally posted on Mastodon