The Beauty of 20 Tokens per Second: Why I Use Local AI to Think Slower

Many times we are obsessed with the fast track.

We want the answer now. The summary now. The draft now. The idea now.

Cloud AI models are designed to satisfy that anxiety. They are engineering marvels capable of producing neat paragraphs, structured plans, and complete documents in a matter of seconds. Before you even finish your coffee, the machine can already give you something that looks finished.

And right there, exactly, lies the problem.

Lately, I have started to feel a sort of cognitive vertigo when working with models that are too fast. The answer arrives so quickly, so confidently, and so completely that I often feel I am entering the conversation too late. The model has already made decisions. It has already framed the problem. It has already chosen what was important and what wasn’t.

I am left in the role of reviewing a result, not participating in the thinking process.

That is where the Zone of Non-Development begins.

In my previous article, I defined the Zone of Non-Development as that space where AI facilitates the task for us, but also removes the friction that would have helped us grow. It is the place where efficiency becomes so comfortable that, without us realizing it, it replaces learning.

So, to avoid that zone, I didn’t look for a faster model.

I looked for another rhythm.

I went local.

My Homelab as a Cognitive Regulator

My current thinking environment is not just a subscription or a browser tab. It is also a physical machine that I have in my office right in front of me: a Beelink SER9 Pro with an AMD Ryzen AI 9 HX 370, configured to dedicate a good portion of its memory to local AI workloads.

I use tools like Ollama, OpenWebUI, n8n, and OpenClaw to run local models, connect them to my own knowledge base, and experiment with different workflows.

When I load heavy models, I don’t get the instantaneous speed of a large data center. In general, I see a stable generation of around 20 tokens per second.

In a world obsessed with performance, that might look like a limitation.

For me, it became a cognitive regulator.

At that speed, the AI doesn’t feel like a vending machine delivering a finished product. It feels more like a visible thinking partner. The text appears at a rhythm that allows me to read, question, compare, interrupt, and think along with the model.

The machine doesn’t simply generate.

It unfolds.

And because it unfolds slowly enough, I can keep participating.

Watching the Machine Think

One of the most important changes in my way of working was being able to observe the model’s reasoning patterns as they develop.

With local models, especially when I enable verbose mode or use reasoning-oriented prompts, I can see where the answer is going while it is still forming. I can notice what categories the model is building, what assumptions it is making, what evidence it is prioritizing, and at what point it begins to deviate.

This matters because AI errors don’t just appear at the end.

Sometimes the problem starts in the very first lines.

The model might have misunderstood the task. It might overgeneralize. It might choose the wrong framework. It might start writing something that sounds correct but is clearly drifting away from the actual purpose.

When the model is too fast, I often only see the error after it has produced a polished 800-word response. At that point, I am no longer guiding the thought process. I am cleaning up the result.

But when generation happens at human speed, I can detect the detour earlier.

I can stop it.

I can redirect it.

I can say: “No, this isn’t the point.”

That small gesture matters. The Stop button ceases to be a technical feature and becomes a symbol of intellectual agency.

I am not passively waiting for the machine to finish.

I am actively deciding if the line of thought is still useful.

Human-in-the-Loop is Not a Box to Check

In many conversations about AI, human-in-the-loop sounds like a governance term. It usually means that a person reviews the result before approving it.

That is important, but it is not enough.

For me, true human-in-the-loop integration implies participating during the process, not just at the end. It means being close enough to the generation to be able to modify the direction while the thought is still alive.

That is where local AI changed my relationship with the tool.

The slower pace forces me back to work. I cannot outsource the entire cognitive process and come back later just to approve it. I have to read. I have to compare. I have to decide. I have to notice when the model is helping me think and when it is trying to think for me.

That difference is key.

Because the goal is not just to produce better content.

The goal is to become a better thinker during the process.

The Speed Cap We Don’t Have Yet

Sometimes I think that, just like in a video game where I can cap the FPS (frames per second) to get a more stable experience, I would also like to be able to cap the TPS (tokens per second) in online models.

Not because the model can’t go faster.

But because I don’t always want it to go faster.

In gaming, an FPS limit can reduce noise, temperature, power consumption, and unnecessary stutters. It isn’t always about squeezing out every possible frame. Sometimes it is about achieving a more fluid, stable, and controlled experience.

Something similar happens to me with AI.

I don’t always need the model to hand me a complete answer in three seconds. Sometimes I want to see how it is built. I want to be able to read as it appears. I want to detect if the reasoning starts to deviate before it becomes a perfectly written but conceptually flawed wall of text.

I would like to be able to tell an online model: “Don’t run. Go at 20 tokens per second.”

That limit wouldn’t be a performance failure. It would be a cognitive design decision.

It would be a way to protect my attention, my judgment, and my participation in the process.

Because the problem is not that AI is fast. The problem is when speed leaves us out of the thinking process.

A Second Brain and Many Hats

This slower workflow works because the model doesn’t start from scratch.

My local setup is connected to what I think of as my Second Brain: an ecosystem of notes, documents, Obsidian vaults, curated files, LLM Wikis, and structured knowledge that gives the AI better context.

I usually call this AI Nutrition.

The quality of the response depends heavily on the quality of what feeds the system. If I give the model messy, outdated, or generic context, I shouldn’t be surprised if the result is superficial. But when I give it curated knowledge, clear boundaries, and well-designed instructions, the model becomes much more useful.

This is also where Mini Brains come in.

A Mini Brain is a small, structured file that defines how an AI should behave for a specific task. It gives the model a role, a purpose, rules, boundaries, and a knowledge framework. In practical terms, it allows me to change lenses without having to rebuild the entire context from scratch.

I can load a critical auditor Mini Brain when I need a proposal to be challenged.

I can switch to a learning experience designer Mini Brain when I want to identify points of pedagogical friction.

I can use a systems architect Mini Brain when I need to evaluate whether an idea is technically viable.

Each Mini Brain is a different hat.

But since the conversation moves at human speed, changing hats doesn’t feel like sending isolated orders to a machine. It feels like shifting perspectives within the same thinking process.

That is the real value.

The AI doesn’t just give me answers. It helps me move between different modes of thinking.

Productive Friction in a Frictionless World

We are often told that friction is the enemy of productivity.

In some cases, it’s true. Bad friction wastes time. Confusing tools, broken processes, unclear instructions, and unnecessary bureaucracy make work harder without making it better.

But not all friction is bad.

In learning, a certain amount of friction is necessary. We need resistance. We need tension. We need moments where the answer doesn’t appear immediately, because those moments force us to organize ideas, test assumptions, and build understanding.

I call that productive friction.

My local AI setup gives me back that friction.

It slows down the process enough to keep me present. It prevents me from becoming a passive consumer of machine-generated text. It makes using AI feel less like receiving an answer and more like participating in a dialogue.

And that changes the outcome.

When the process ends, I don’t just have a draft, a summary, or a plan.

I have a clearer understanding of the problem.

I have a firmer idea of what I think.

I learned something.

Why Slower Can Be Better

I’m not saying everyone needs a homelab.

Nor am I saying that cloud AI is bad.

Cloud models are incredibly useful. They are powerful, accessible, and often the best choice when we need speed, scale, or convenience. I use them too. They are part of my workflow.

But I no longer believe that faster is always better.

Sometimes, faster means thinking less.

Sometimes, an instantaneous response shrinks the space where reflection should occur.

Sometimes, the most valuable AI system isn’t the one that gives you an immediate answer, but the one that keeps you engaged long enough to understand how that answer is being built.

For me, 20 tokens per second became a reminder.

A reminder to slow down.

A reminder to keep participating.

A reminder that the purpose of AI is not to remove me from the process, but to help me become more capable within it.

Because true success is not finishing the task faster.

True success is being more capable than when you started.