AI and Testing: Ollama and Models

In this post I want to take the initial steps to get some basic tooling available and operating. This is step one if you’re going to work in a technologist context with AI applications.

If you want to play along as these posts continue, you should know how to get a Python version installed on your system and be able to execute scripts from the command line. (Or, at least, from your IDE of choice.) With Python, it’s always recommended that you use a virtual environment. I recommend using the tool uv for all your Python needs. That said, it’s not necessary for this series as I want to focus on AI tooling and not Python tooling.

I should note that uv can actually install Python for you if you don’t have it and also handle all dependencies that you might need. If you’re going to spend any time at all in Python, I highly recommend spending a little time with uv. This is no different than getting used to npm or yarn in a JavaScript/TypeScript context.

Ollama

To follow along, you’re going to want to install Ollama on your machine. But why? What is this good for?

Ollama is best understood as a local runtime for large language models, designed to make running them feel as ordinary as running any other development dependency. Instead of treating LLMs as remote services behind APIs and accounts, Ollama treats them as artifacts you can download, inspect, and run on your own machine.

You pull a model, start it, and interact with it: no cloud account required, no infrastructure to manage. In that sense, Ollama does for language models what package managers did for programming libraries: it makes experimentation routine rather than exceptional.

What makes Ollama especially useful for learning is that it exposes the shape of a model without forcing you to understand all the math up front, or even at all. You can see parameters, context length, quantization, and generation settings. You can also change how the model behaves without rebuilding it from scratch.

The main point here is that Ollama doesn’t try to hide complexity so much as stage it, letting you approach local AI incrementally: first as a conversational tool, then as a programmable component, and eventually as part of a small application. It turns “running an LLM” from a research activity into a developer or tester habit.

Getting Ollama Set Up

In terms of getting set up, go to the Ollama downloads and get an installable for your machine. This will install an app and, crucially, when that app is running, you will have a server running at http://localhost:11434/. This server will be up as long as the app is running.

For Windows, it’s that simple. For Mac, you also have to install Ollama using Homebrew:

  brew install ollama

This is because to run Ollama as a service on a Mac, you will need to do this:

  brew services start ollama

To then stop the service, besides closing the app, you can do this:

  brew services stop ollama

Once you have Ollama up and running on your operating system, you can go to that URL in your browser or simply do a curl:

  curl http://localhost:11434

You should get Ollama is running returned. You could also test this by issuing a GET request through a tool like Postman, of course.

That last point is important to understand because all of these language models are predicated upon an API of some sort. This means if you’ve done API testing in the past, you can leverage all that knowledge, and that tooling, in this context.

Ollama Models

You can check out the models that Ollama makes available. You’re going to want to download certain models in order to experiment. When doing this, you need to take into account what your machine can handle.

An Apple M1 or a Windows with a good NVIDIA GPU is generally going to be best the larger up the model chain you go.

Ah, But Which Model?

A lot of people, when starting out, are often told to look at models called Qwen or DeepSeek. But why? And what’s the difference?

When comparing Qwen and DeepSeek (or, really, any models), it helps to think of them as emphasizing different learning experiences, even though they share modern architectural ideas like Mixture-of-Experts as well as reasoning behavior, the latter of which tied to what’s called chain-of-thought.

Qwen models tend to be more general-purpose and approachable. They behave like well-rounded assistants that are good at everyday tasks: chatting, summarizing, basic coding, and experimentation. For someone learning how to run local LLMs, wire them into small apps, and explore prompts, Qwen often feels predictable and forgiving. I would characterize it as a solid “daily driver” model that works well across many scenarios.

A “daily driver” refers to a specific, often smaller, AI model that a user relies on for routine, everyday tasks like drafting emails, summarizing notes, brainstorming, or coding snippets, treating it as their primary, go-to AI assistant on their own hardware for privacy and convenience.

DeepSeek, by contrast, leans more heavily into explicit reasoning. Many DeepSeek models are trained to break problems down step by step, especially for math, logic, or structured problem-solving. This can be powerful, but it can also feel heavier and more opinionated. If your goal is to study how reasoning-oriented models behave, or to build apps that benefit from deliberate, multi-step thinking (like analyzers, planners, or tutors), DeepSeek can be a great choice. For pure learning and experimentation, though, it may feel like starting with a race car rather than a reliable sedan.

How I frame this for people starting out is that Qwen is often the better starting point for learning local LLM workflows, while DeepSeek shines when you specifically want to explore or leverage structured reasoning.

Demystifying Some Terms

Earlier I brought up Mixture-of-Experts and chain-of-thought because those terms get tossed around a lot, particularly when people are choosing models. However, as in most technical disciplines, I find the terms get thrown around without a lot of context. Particularly of the “why should I care about this?” variety.

What I can say for our purposes here is that both models may internally use Mixture-of-Experts, but that’s an engine detail, not a driving experience. What the user actually feels is a mix of the objective and the subjective.

  • How predictable responses are
  • How well the model handles vague vs structured prompts
  • How easy it is to integrate into small tools without prompt gymnastics

For someone using Ollama to learn and build, I would offer the following guidance. Start with Qwen if you want a smooth onboarding into local LLMs, some general chat + coding + app integration, and fewer surprises in response style. Try DeepSeek if you want to study reasoning behavior explicitly, generate step-by-step outputs for logic, math, or analysis, or to experiment with “thinking-heavy” applications.

There’s a subtle but important clarification, worth stating plainly. Chain-of-thought is not a feature you get. It’s a behavior you observe. Mixture-of-Experts is not a behavior you see. It’s an efficiency choice under the hood.

The question you’re really asking about models is this: “Do I want a model that feels general and flexible, or one that feels deliberate and analytical?” For learning local LLMs and building small apps, I would argue that starting general and moving toward specialized is usually the saner path. This is why Qwen tends to be the better first stop, with DeepSeek as an excellent second exploration.

Model Details

For purposes of getting this series started, I’m going to look at the qwen3 model and, more specifically, what is currently qwen3:latest. If you look at that model, you’re going to see some specifics called out.


parameters:   8.19B
quantization: Q4_K_M

What is all that? Well, you’re essentially looking at how big the model’s brain is and how tightly that brain has been compressed so it fits on your machine. Nothing mystical. Just scale and compression. But let’s dig in a bit.

Regarding the parameters, these are the adjustable values inside the model: basically its learned memory. Here, 8.19B means 8.19 billion parameters. Broadly speaking, more parameters means more capacity to represent patterns while fewer parameters means faster and lighter, but less nuanced.

You can think of parameters like knobs on a giant soundboard. A small board has fewer knobs, which means quicker to adjust, simpler sound. A big board has many knobs, which means richer sound, but harder to manage.

An 8B-class model is widely considered a sweet spot for local LLMs. This is the case because it’s big enough to feel “smart” but small enough to run on consumer hardware. Thus, it’s ideal for learning, experimenting, and small apps. That’s why you’ll see a lot of people recommend the 7 to 9B models as a starting point.

Now, let’s talk about that quantization of Q4_K_M. This one looks scary, but it’s really about how aggressively the model has been compressed. By default, model parameters are stored as high-precision numbers (think: very detailed decimals). Quantization reduces that precision to save memory and run faster. It’s sort of like the original model is a full-resolution JPEG image while the quantized model is a high-quality but compressed JPEG. Still good. Just smaller. That’s great and all, but let’s decode it piece by piece.

  • The “Q4” means 4-bit quantization, which means each parameter uses 4 bits instead of 16 or 32. The effect of this is a much smaller memory footprint, faster inference, but a slight loss in precision.
  • The “K” refers to a k-quant scheme, which is really just a more advanced quantization method. It preserves important weights more carefully than older Q4 formats. The effect of this is better quality than so-called “naive” 4-bit compression and also less noticeable degradation in responses.
  • The “M” means “medium” balance, where the balance is between speed, memory, and quality.

So, with all this said, you can imagine the model as a library. The parameters (8.19B) are the number of books. The quantization (Q4_K_M) is how tightly the books are packed on the shelves. Q4_K_M doesn’t remove books from the library. It just uses thinner paper, a smaller font, and maybe some smarter shelving. You still have the same ideas, just stored more efficiently.

With that context in place, why is this combo popular in Ollama? An 8.19B model with Q4_K_M typically means a few things:

  • Runs comfortably on most modern computers
  • Fits into 8 to 12 GB of RAM (sometimes less)
  • Fast enough for interactive use
  • Good quality for chat, coding, and small tools

In other words, this configuration is chosen so you can actually use the model, not just admire it.

Download the Model

So, let’s go ahead and get that model. If you have Ollama on your machine and running, you can check what models you already have (which should be none if this is your first time install):

  ollama list

To get the model, you can just do this:

  ollama run qwen3

This will download the model, and please do note the sizes of the models in terms of your bandwidth! Once downloaded, you will be placed “in” the model. You are effectively using a REPL, as it were. You could type into this model just as you would using ChatGPT or Gemini or Claude. You can leave the REPL with: /bye.

If you want to run the model again, just do ollama run qwen3. It won’t redownload.

Once back at your command prompt, you can run the ollama list command again and you’ll see that you have the model locally stored.

You can delete the model you just downloaded by running ollama rm qwen3. This is handy when you experiment with a lot of models but want to clean up and save some space.

Testing Your Model

Beyond running prompts in the REPL, you can generate API calls to the model.

curl http://localhost:11434/api/generate -d '{
  "model": "qwen3",
  "prompt": "What is the speed of light in a vacuum?",
  "stream": false
}'

This is no different than any API request you would make if you do any sort of API testing. One detail worth noting here is that certain endpoints stream responses as JSON objects which means your output can be quite extensive. If stream is set to false, as I’m doing above, the response will be a single JSON object.

Exploring Your Model

You can get details about the model you downloaded.

  ollama show qwen3

Let’s dig into a bit of what you’re seeing there.


context length      40960
embedding length    4096

With this, you’re seeing how much the model can “keep in mind at once” and how it internally represents meaning. One is about memory over time. The other is about depth of understanding.

Starting with context length, this is the maximum amount of text the model can consider in a single conversation or request. 40,960 tokens is a pretty long memory.

I should note that “tokens” are pieces of text, roughly 3/4 of a word in English.

Context length is like the size of the model’s whiteboard. A small context means short notes, frequent erasing. A large context means you can spread out diagrams, notes, and references all at once. This is the infamous “context window” that you’ll often hear about in discussions regarding using LLMs.

A 40k context window means the model can remember long conversations, read or reason over large documents, and work with codebases, logs, or transcripts without constantly “forgetting” earlier parts. For local models, this is very generous and especially useful for learning and experimentation.

Embedding length is subtler. An embedding is how the model converts text into numbers that capture meaning. The embedding length is the size of those numbers or, more precisely, that numeric vector. Here, 4096 dimensions is a fairly rich semantic representation.

I should note that in my text classification series, I dig into these concepts quite a bit.

Think of embeddings like coordinates in meaning-space. If context length is how much the model can see, embedding length is how precisely it can place ideas relative to one another. Short embeddings mean rough locations (“kind of like this”) while longer embeddings mean finer distinctions (“like this, but not like that”) A 4096-dimensional embedding allows the model to better distinguish similar concepts, track nuance in language, and support search, retrieval, and semantic comparison more effectively.

A useful analogy is perhaps map versus territory. Context length is how much of the map is unrolled on the table. Embedding length is how detailed the map’s coordinates are. You can have a huge map with blurry symbols, or a sharp map that only shows a small area. This model gives you both a reasonably wide table and a reasonably high-resolution map.

A gentle rule of thumb is that if you care about long conversations or documents, context length matters. If you care about search, similarity, or semantic nuance, embedding length matters. This Qwen configuration gives you a reasonable amount of both, which is one reason it’s a solid choice for learning and small local applications.

Does This All Matter?

Why do these numbers matter for local LLM users? For someone learning local LLMs and building small apps, this combo means that you can experiment with long prompts, document-based workflows, and multi-step reasoning without truncation. This further means that you can build simple RAG-style tools, note analyzers, and chatbots that don’t forget earlier instructions. This is all without needing enterprise hardware!

Beyond that point, what this does is give you some idea of how to interpret these values when you look at any model you might care to download and use. In fact, let’s look at another set of values worth understanding.

Generation Values

As part of the Ollama “show” command, you will also see this:


  Parameters
    stop              "<|im_start|>"
    stop              "<|im_end|>"
    temperature       0.6
    top_k             20
    top_p             0.95
    repeat_penalty    1

At the risk of simplifying, these settings control how the model chooses its next word. This is not what it knows but how cautiously, creatively, or repetitively it speaks. If you want to anthropomorphize a bit, think of this as the personality and impulse control panel. Let’s decode the list and here I’ll take these in a natural order.

Temperature controls randomness. Lower means more predictable; higher means more creative. At 0.6, the model is calm, focused, and slightly creative, but not wild. Temperature is like turning the volume knob on spontaneity. A value of 0.2 is careful, literal, almost boring; 0.6 is thoughtful, balanced; 1.0+ is improvisational jazz. For learning, coding, and small apps, 0.6 is a very sane default.

Top-K limits how many word choices the model considers at each step. A Top-K of 20 means that only the twenty most likely next tokens are eligible. This is like saying: “Don’t consider every word in the dictionary, just the most plausible ones.” A lower Top-K means more conservative and more repetitive. A higher Top-K means more expressive and more risk of drifting. A value of 20 is restrained but not rigid.

Top-P is a probability cutoff. The model considers the smallest set of tokens whose combined probability is greater than or equal to 95%. So, if Top-K is “pick from the top 20 options,” Top-P is “pick from the options that together account for almost all reasonable choices.” Top-P at a value of 0.95 encourages variety, avoids very unlikely words, and smooths out responses.

It’s perhaps worth noting that Top-K and Top-P work together, not in competition.

The repeat penalty controls how strongly the model avoids repeating itself. A value of 1.0 is neutral (meaning, no penalty). Anything greater than 1.0 means the model is discouraged from repetition, while anything lower than 1.0 means the model will allow repetition more easily. Repeat penalty is like telling the model: “If you’ve already said this, think twice before saying it again.” A value of 1 means: “Don’t interfere” and thus let the model’s training handle repetition naturally. That’s usually fine for modern models like Qwen.

The stop tokens are hard stop signals. They tell the system: “If the model emits this token, stop generating immediately.” Why do these even exist? Language models don’t naturally know when to stop speaking. You’ve probably seen this if you’ve used any LLM: the model keeps talking long after it has answered the question or starts to repeat itself. These markers act as hard boundaries to mitigate that.

It’s also worth noting that some models use special internal markers to separate instructions (to the LLM) and responses (to the user). Without stop tokens those markers can occasionally appear in the output. By defining them as stop signals, the system ensures that responses end cleanly and don’t leak internal “thinking.”

The stop tokens are about correctness and cleanliness of operation, not creativity of output.

All of what we looked at here are called “generation values,” more commonly referred to as inference parameters or hyperparameters. Taken together, these are settings that control the behavior of the model while it’s producing text. These values don’t change the model’s underlying knowledge (its “weights”), but rather dictate how it selects the next word (token) from its probability distribution.

Again, as I indicated earlier, these settings don’t change what the model knows; they change how carefully it speaks.

These defaults are good for beginners because the particular configuration for our Qwen model basically says: “Be clear. Be steady. Don’t hallucinate. Don’t ramble. Don’t get weird.” Which is exactly what you want when learning local LLM behavior, writing prompts, building small tools and scripts, and debugging output. You can always loosen things later.

Next Steps!

Okay, so, this was dive into the concepts behind running your own local model along with the tooling that makes this possible. I went through all this because without local LLMs, you have to use public ones available in the cloud. And that tends to translate to cost, such as paying for credits to use the LLM past a certain point. I want everything to be free for this series.

Now that we have this set up, the next post is going to exercise our local model with some code, which is where we’ll get into Python. So, if you want to play along, make sure you have Python ready to go.

Share

This article was written by Jeff Nyman

Anything I put here is an approximation of the truth. You're getting a particular view of myself ... and it's the view I'm choosing to present to you. If you've never met me before in person, please realize I'm not the same in person as I am in writing. That's because I can only put part of myself down into words. If you have met me before in person then I'd ask you to consider that the view you've formed that way and the view you come to by reading what I say here may, in fact, both be true. I'd advise that you not automatically discard either viewpoint when they conflict or accept either as truth when they agree.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.