Fuck LLMs, Embedding Models Are the Best Part of the AI Revolution

LLMs are impressive. But the hype has turned everyone's brains to mush. I've watched companies burn through thousands in API costs trying to make ChatGPT do basic search.

There's a quieter, cheaper, more reliable part of the AI stack that actually works: embeddings. They're what I ship. Nobody talks about them because they're not as sexy as a chatbot.

The Hype Industrial Complex

The problem isn't that LLMs are bad. It's using them for tasks that don't need generation.

LLMs hallucinate confidently. They're expensive — every interaction is a metered API call. They're slow. They're unpredictable — same prompt, different day, different output. Good luck writing tests. And they don't scale: you can run billions of documents through an embedding model; try that with an LLM.

Use LLMs intentionally, where generation adds value. Don't use them as a search engine with extra steps.

Embeddings: The Quiet Workhorse

An embedding model takes text (or images, or audio) and converts it into a vector representing the meaning of that input. Similar meanings end up close together in vector space. That's it.

They don't generate anything. No hallucinations. Same input, same output, every time. You can actually test this. Fast (milliseconds) and cheap (fractions of a cent).

"But Embeddings Are Just Dumb Similarity"

This is what people who haven't looked at embedding models since 2022 think. Modern embedding models are instruction-tuned. You can tell them how to embed.

Take Qwen3-embedding, which I use. You can prefix your input with task instructions:

search_query: how do I implement vector search
search_document: This article explains pgvector installation...

The model embeds the query differently than the document because it understands the asymmetric nature of search: queries are short and intent-focused, documents are long and informational.

Other models like E5-mistral and GTE support similar patterns. You can specify classification: vs clustering: vs retrieval: prefixes. This isn't "prompt engineering". It's a documented, stable API. One prefix, same behavior every time.

I Built This With Embeddings

This blog has an AI chat feature. Let me show you how it actually works, because it's almost entirely embeddings.

Build-Time Embedding Generation

At build time, I run a script that processes all blog posts:

// scripts/generate-embeddings.ts
const response = await fetch('https://openrouter.ai/api/v1/embeddings', {
  method: 'POST',
  headers: {
    Authorization: `Bearer ${apiKey}`,
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    model: 'qwen/qwen3-embedding-8b',
    input: texts,
  }),
})

The script chunks each post by H2 sections (100-1500 characters per chunk), embeds them all in a single batch API call, and writes the embeddings to a generated TypeScript file. One API call at build time. Costs pennies.

Tiered Semantic Retrieval

When a user asks a question, I embed their query and find relevant content using pure vector math:

// src/lib/chat/retrieval.ts
const scoredChunks = chunks
  .map((chunk) => ({
    ...chunk,
    similarity: cosineSimilarity(queryEmbedding, chunk.embedding),
  }))
  .filter((chunk) => chunk.similarity >= minSimilarity)
  .sort((a, b) => b.similarity - a.similarity)

No LLM involved. Just cosineSimilarity, a function that computes the cosine of the angle between two vectors. Pure math. Deterministic. Free at runtime.

I use a three-tier system with different thresholds:

tier1Threshold = 0.4  // Highly relevant chunks - full text
tier2Threshold = 0.3  // Related posts - summary only
tier3Threshold = 0.25 // Topic matches - just titles

The retrieval is diversity-aware (max 2 chunks per post) so you get breadth, not just the same post over and over.

The LLM only comes in at the end, to format a response using the retrieved context. That's the only part that needs generation.

pgvector

You don't need a fancy vector database. PostgreSQL with pgvector does everything you need:

SELECT * FROM posts
ORDER BY embedding <=> query_embedding
LIMIT 10;

The <=> operator is cosine distance. pgvector supports HNSW (fast approximate search) and IVFFlat (partition-based for large datasets). The SQL you already know just works.

It Gets Stupid Fast

Full float32 embeddings are large. A 1536-dimension embedding (OpenAI's ada-002) is 6KB per vector. At a million vectors, that's 6GB just for embeddings.

But you don't need full precision:

halfvec: Float16 instead of float32. Halves storage. Negligible accuracy loss.

Binary quantization: Each dimension becomes a single bit. 1536 dimensions → 192 bytes. That's 48× smaller.

The trick is to use binary for coarse filtering, then rerank the top candidates with full precision:

-- Coarse search with binary (fast)
SELECT id FROM posts
WHERE binary_embedding <~> query_binary_embedding < threshold
LIMIT 100;

-- Rerank with full precision (accurate)
SELECT * FROM posts
WHERE id IN (...)
ORDER BY embedding <=> query_embedding
LIMIT 10;

Your 1M vector dataset goes from 6GB to 125MB. Query times drop from seconds to milliseconds.

You Don't Even Need a GPU

Cosine similarity is embarrassingly parallel — just multiplying and summing arrays. pgvector uses SIMD automatically.

Benchmark: 1M vectors at 1536 dimensions, ~5ms on a 2020 MacBook. No GPU, no CUDA. Compare that to LLM inference needing H100s at $4/hour.

The Deployment Economics

Qwen3-embedding-0.6B runs in 1GB of RAM. Self-host on a $5/month VPS or a Raspberry Pi.

The math: self-hosted embeddings cost ~$30/month for unlimited queries. Hosted embeddings are $0.02 per 1M tokens. Hosted LLMs are $2-30 per 1M tokens — 100x more expensive, and you're generating far more tokens per interaction.

Embeddings are a rounding error. LLMs are a line item that makes your CFO nervous.

When to Use What

Embeddings: understanding — search, recommendations, clustering, duplicate detection, similarity. "How related are these things?"

LLMs: generation — creating text that didn't exist before. Chat, summarization, creative writing.

Most "AI features" need understanding, not generation. The magic is using both: embeddings for retrieval (fast, cheap, deterministic), LLMs for synthesis (only when needed). That's RAG done right.

Why Nobody Talks About This

Business leaders can't directly use embeddings — no chat interface. LLMs you can demo. Embeddings are infrastructure.

Also: OpenAI, Anthropic, and cloud providers benefit from you using the expensive tool. Embeddings are too cheap to build a growth story around. So LLMs get the hype; embeddings get quietly shipped by engineers who know better.

Multimodal

CLIP and SigLIP put text and images in the same vector space. Search images with text queries, find visually similar images, zero-shot classification — no fine-tuning, no labeled datasets. Just embed and compare.

The Vector Database Grift

Pinecone, Weaviate, Qdrant, Milvus. There's a whole ecosystem of vector databases charging serious money for what pgvector does for free.

When do you actually need a dedicated vector DB?

Billions of vectors (not millions, billions)
Real-time updates at massive scale
Multi-region distributed search

For everyone else: pgvector. It's free, it's in your existing stack, and it works.

The Actual Revolution

Next time someone says "let's add AI," ask what problem they're solving. Nine times out of ten, they want embeddings but don't know it yet.

The best AI is the AI you barely notice. It just works.