AI and Testing: Knowledge Graphs and Ontologies

If you’ve been following this series, you’ve seen how local LLMs can be used for everything from basic inference to evaluation frameworks. This post takes a different angle. Rather than asking what a model knows, we’re going to ask how we can take what a model reads and turn it into structured, queryable knowledge.

That is the promise of knowledge graphs: not just storing facts, but storing the relationships between facts in a way that supports traversal, inference, and grounded reasoning. Combined with a local LLM acting as an extraction engine, you get a pipeline where raw text becomes a machine-queryable knowledge structure, all without sending anything to an external API.

If that sounds abstract, consider the practical stakes. As AI tooling becomes embedded in software pipelines, testers will be increasingly expected to understand not just whether a system produces correct output, but why it reasons the way it does. Knowledge graphs make the reasoning substrate visible and testable. Being able to build, query, and validate a knowledge structure puts you ahead of the curve; not just as an AI researcher, but as someone who can audit what a model actually “knows” and trace failures back to gaps in structured data.

I should note that this post, and the two that follow it, will probably be the heaviest lift in this series, at least from a conceptual and code point of view, depending on your experience and interest level.

Fluent Is Not the Same as True

Large language models are remarkable at generating fluent, contextually sensitive text. They are considerably less remarkable at being accountable for it. Hallucinations, outdated knowledge, and the fundamental opacity of weight-based reasoning are not edge cases. They are structural properties of how these systems work.

A model that can’t show its work can’t be verified, and a system that can’t be verified can’t be trusted in any context where accuracy matters.

This is not an argument against language models. It is an argument for being deliberate about what we ask them to do. The question worth asking is not “can the model answer this?” but “can we trace where the answer came from?” That distinction between generating an answer and grounding one is where knowledge graphs become relevant.

It’s important to understand that a knowledge graph is not a replacement for a language model. It’s a different kind of knowledge representation: explicit, typed, traversable, and inspectable. Where a language model encodes what it knows as patterns distributed across billions of parameters, a knowledge graph encodes what it knows as named entities connected by labeled relationships. You can read it, query it, update it, and audit it.

From a quality standpoint, testers must be advocating for the idea that transparency, reliability, verifiability, and explainability must not be aspirational qualities bolted on afterward. Instead, they must be structural consequences of how the knowledge is stored.

The combination of knowledge graphs and LLMs is where things get interesting. A local language model can read unstructured text and extract the entities and relationships a knowledge graph needs to be populated. The knowledge graph can then answer structured queries that the language model alone could not reliably support. The model’s final response, grounded in what the graph actually contains rather than what the model happens to remember, becomes traceable in a way that raw generation never is.

That is the pipeline this post, and the two following, builds. It’s not a production system, by any means. It’s a demonstration of a pattern: one worth understanding because the problems it addresses, such as hallucination, opacity, and lack of contextual grounding, are not going away, are the terrain this approach is built for, and are a future discriminant in terms of technical testing in an increasingly AI-driven world.

The Domain Under Test

You need a domain in which to work for this kind of example. The source material for this demonstration is one of my papers on the Books of Chronicles and its role in establishing what my paper calls a “permission structure” for biblical interpretation across the Second Temple period. It’s a not-too-dense, argumentative, scholarly (but only if you squint) text. Exactly the kind of material that stress-tests extraction quality in interesting ways.

The full paper is available here: Revision as Faithfulness and you might want to have that downloaded as part of your project folder, just so you have the full source that we’ll extract from.

Without speaking to the quality of paper, I can at least say that it’s structurally rich in exactly the ways a knowledge graph thrives on. First, there are clearly typed entities:

texts (Samuel-Kings, Chronicles, 1 Enoch, Jubilees)
people (the Chronicler, Freedman, Pajunen, Jonker)
communities (post-exilic Yehud, Qumran, rabbinic schools)
concepts (the permission structure, darash, the traditioning process)
events (the exile, the return, canonization)

Second, and more importantly, the relationships between these are varied and meaningful: “Chronicles revises Samuel-Kings,” “Jubilees extends the method established by Chronicles,” “Pajunen refines Ben Zvi’s argument,” “the canon ratifies the Chronicler’s precedent.” Each of those quoted phrases is a predicate, by which I mean a labeled, directional relationship connecting two entities. That predicate variety is what makes graph queries interesting rather than flat.

In fact, historical and biblical material is a natural fit for this work when learning because knowledge graphs originated largely in the humanities and library sciences before they became a big data engineering concern. It’s also the case that serious ontological infrastructure exists for exactly this domain, the CIDOC-CRM being one example.

There’s also a layer of recursive irony worth noting. My paper is about interpretive recontextualization: the idea that authoritative texts get re-read through new lenses in new circumstances. In this post, I’m about to do exactly that to the paper itself: extracting its semantic content into a formal structure and querying it. The Chronicler re-read Samuel-Kings to address his community’s present need; the pipeline I’ll show you how to work with re-reads the paper to address a user’s present query. The method is the same.

One thing worth being deliberate about is that biblical scholarly work has layers of relationship that often make it more interesting than flat historical prose. There’s the narrative layer (what happens), the theological layer (what it means), and the intertextual layer (how a given passage connects to others). The pipeline constructed in this post, and the two following, will reach most directly into the narrative and argumentative layers; the intertextual layer is where a fuller system would go, and the extraction schema I’ll present to you is designed to leave that door open for your own experimentation.

Obviously we have the full paper available but, for this post, we’ll work with a single section. Section II.C (the Satan revision case study) is self-contained, entity-rich, and has a clean argumentative structure: text A says X, text B says Y, the Chronicler resolves this by importing concept Z from text C. That makes it an ideal extraction target, and the resulting graph is immediately queryable in interesting ways:

“What texts does the Chronicler draw on?”
“What theological problem is the śāṭān revision solving?”
“How does 4QSam^a relate to the Chronicler’s source?”

To save us a little time around the mechanics of this process, the passage is available here: passage.txt. That said, you’ll get this file from the code repo, which I’ll come to in a bit, and we’ll explore in the next post.

Two Concepts Worth Distinguishing

Ontologies

An ontology is a formal specification of concepts and their relationships within a domain. Think of it as the schema or grammar of a knowledge space. It answers questions like: what kinds of things exist here, what properties do they have, and how can they relate to one another?

In our domain for this post, the ontology says things like: a Text can revise another Text, a Concept can be established by a Text, a Community can receive a Text. These are the rules of the graph before any specific facts are added.

The analogy I’ve found that helps clarify this is a city’s zoning map. The zoning map specifies what kinds of things go where and how they relate: residential next to commercial, industrial separated from residential. The ontology plays the same role: it describes the structure of the knowledge space without yet populating it.

Knowledge Graph

A knowledge graph is the populated instance of an ontology; actual entities and facts stored according to that schema. Where the ontology says “Texts can revise other Texts,” the knowledge graph says “Chronicles revises Samuel-Kings.” Where the ontology says “a Concept can be established by a Text,” the graph says “the permission structure was established by Chronicles.” Where the ontology says “a Community can receive a Text,” the graph says “post-exilic Yehud received Chronicles.” The ontology provides the structure; the graph provides the content.

To extend the city analogy, if the ontology is the zoning map, the knowledge graph is the actual city built according to that map. The ontology provides the structure; the graph provides the content. Together, they give you something that will matter when we get to the query stage: explicit, typed relationships between named entities that can be traversed, filtered, and reasoned over, rather than approximated.

Perhaps an easy way to frame this: the ontology says “Authors write Books”; the knowledge graph says “Tolkien wrote The Fellowship of the Ring.”

Where Local LLMs Fit

There’s an honest tension worth naming at the outset. LLMs are not naturally good knowledge graphs. Their “knowledge” is distributed across weights in a way that is not inspectable, not updatable without retraining, and not precisely queryable. In fact, that’s exactly the problem RAG was designed to address: you bolt a real knowledge store onto the model.

That being said, there is something more interesting possible here. We could use a vector store as the knowledge layer. Here “vector store” just means a database that retrieves information by semantic similarity rather than explicit structure. Alternatively, and preferably, we can use a structured graph as the knowledge layer. And rather than retrieving semantically similar passages, we can run typed queries that follow relationship chains.

The contrast I’m implicitly drawing here is between semantic similarity (fuzzy, approximate) versus typed graph queries (explicit, traversable).

The local LLM plays two roles in this pipeline:

Constructor: it reads the source text and extracts structured triples (subject–predicate–object statements) that populate the graph.
Interface: it receives structured query results and synthesizes them into a grounded natural language answer.

This separation matters. The model is not being asked to remember facts or reason from its weights. It’s being asked to read carefully and format what it finds, then to summarize what the graph returns. Those are tasks local models handle well, and keeping them separate makes the pipeline both more reliable and more testable.

The constructor and interface roles can be evaluated independently, which is exactly the property a testing-focused pipeline needs.

Pipeline Overview

The full pipeline I’m going to show you in this post moves through four stages:


Raw Text (passage.txt)
   ↓
[Stage 1] Ollama: extraction prompt → JSON triples
   ↓
[Stage 2] RDFLib: builds in-memory RDF graph
   ↓
[Stage 3] SPARQL: four structured queries
   ↓
[Stage 4] Ollama: grounded answer from query results
   ↓
Natural Language Answer

Each stage is a clean seam. You can run and test each piece independently before connecting them into the full pipeline. The code is organized to reflect this: five files, each with its own entry point for isolated testing.

kgllm_v1/
├── config.py        # model name, namespace, endpoint
├── extraction.py    # Stage 1: prompt construction, Ollama call, JSON parsing
├── graph.py         # Stage 2: RDFLib graph construction and serialization
├── queries.py       # Stage 3: four SPARQL queries and result formatting
├── pipeline.py      # orchestrates all four stages end to end
└── passage.txt      # source text (Section II.C of the paper)

You can find all of this code in the kgllm_v1 portion of the repo. You can clone and/or download the repo should you wish and just copy that directory over to your working project to play with it locally.

A Testing Note

Here I want to briefly touch on the code used in these upcoming posts. The pipeline we’re looking at here sits at the boundary between development and testing. If you’re a technical tester in the sense of being comfortable with any sort of programmatic code (or just familiar with the code in this blog series), you’ll recognize the pattern immediately: structured inputs, explicit outputs, independently testable stages. If you’re more comfortable on the execution and evaluation side, the code is worth reading even given that you didn’t write it yourself. I say that because understanding what each stage does is what makes the output meaningful when you run it.

If this project were occurring a real-world work context, the pipeline is not (or at least should not be) a black box you feed text into and receive answers from. It’s a grey box whose internals are visible, inspectable, and deliberately designed to be tested at each seam.

The pipeline we’ll look at sits at an interesting boundary: it’s simultaneously production logic and a test harness. As production logic, it takes real input, calls a real model, and returns a grounded answer. As a test harness, every stage is independently runnable, every output is explicitly surfaced rather than discarded, and the validation layer reports what it drops rather than silently passing bad data downstream.

What is this SPARQL thing?

SPARQL, a term mentioned above, is worth defining carefully because it has a slightly confusing name and an easy analogy is available. The name stands for SPARQL Protocol and RDF Query Language. Yes, it’s a recursive acronym in the tradition of GNU and similar projects, which means the name doesn’t actually help you understand what it is. So, set the acronym aside.

The useful definition is this: SPARQL is to RDF graphs what SQL is to relational databases. (I’ll come to RDF graphs momentarily.) If you know SQL, you already understand the shape of what SPARQL does: you write a query that specifies a pattern you want to find, and the query engine returns the data that matches that pattern. The syntax looks different and the underlying data model is different, but the conceptual role is identical.

The analogy I’ve found that works well for readers who don’t know SQL is the detective framing. A SPARQL query is essentially a description of what you’re looking for with some pieces left blank. You tell the engine: “find me a triple where the subject is Chronicles, the predicate is anything, and the object is unknown, and then tell me what that unknown object turns out to be.” The query engine searches the graph for everything that fits that description and returns what it found in the blank spots.

That blank-filling intuition is actually quite close to what SPARQL literally does: any variables prefixed with ? in a query are the blanks, and the results are the values that make the pattern true. I’ll show you examples of this as we go.

SPARQL is a W3C standard, which means it works the same way regardless of whether your graph is stored in RDFLib’s in-memory engine, Oxigraph, or a large enterprise triplestore like Apache Jena. The queries you see in this post would run unchanged against a production system, which is one of the genuine strengths of the RDF ecosystem.

What is this RDF thing?

RDF is another term where the expansion (Resource Description Framework) doesn’t do much work for someone learning. So, again, let’s set the acronym aside and go straight to the structure.

The core idea is disarmingly simple: RDF represents knowledge as a collection of three-part statements called triples. Every triple has the same shape: subject, predicate, object.

“Chronicles revises Samuel-Kings.”
“The permission structure was established by the Chronicler.”
“Jubilees belongs to the Second Temple period.”

Each of those is a triple, and an RDF graph is just a collection of them. The reason it’s called a graph rather than a table or a list comes down to what happens when you collect many triples together. Each entity that appears as a subject in one triple might appear as an object in another. Those overlapping references create a web of connections, which is exactly what a graph is in the mathematical sense: nodes connected by edges. The subjects and objects are the nodes; the predicates are the labeled edges between them.

The analogy that tends to land well here is a genealogy. A family tree is essentially a hand-drawn RDF graph. “Abraham is the father of Isaac” is a triple. “Isaac is the father of Jacob” is another. When you draw those connections on paper the graph structure emerges naturally, and you can traverse it: who are all the descendants of Abraham? Which figures appear in both the patriarchal and Mosaic periods? Those are graph queries.

What makes RDF more powerful than a simple list of facts is precisely that traversal capability. You can follow chains of relationships across many triples to find connections that no single triple states explicitly. That is what SPARQL then gives you: a formal language to express, which is why defining RDF first sets up the SPARQL definition cleanly.

Dependencies and Setup

First, I’m going to head off a possible issue that has to do with Pylance in Visual Studio Code. Or, rather, not an issue, but simply a fact of using this kind of logic. Back in the post on LangChain Templates, I had you create a pyrightconfig.json file. For this post, add one entry to that:

{
  "typeCheckingMode": "basic",
  "reportUnknownVariableType": "none",
  "reportUnknownMemberType": "none",
  "reportUnknownArgumentType": "none",
  "reportArgumentType": "none",
  "reportAttributeAccessIssue": "none"
}

{

"typeCheckingMode": "basic",

"reportUnknownVariableType": "none",

"reportUnknownMemberType": "none",

"reportUnknownArgumentType": "none",

"reportArgumentType": "none",

"reportAttributeAccessIssue": "none"

}

You’ll also want to grab some Python dependencies (making sure you are in your virtual environment, which you’ve been using for this whole series):

  python -m pip install rdflib requests

You likely already have requests, but rdflib is definitely something we haven’t used up to this point.

Of course, as with all posts in this series, you also need Ollama running locally with a model available. For this post, I’m going to settle on one we’ve used before, which is qwen2.5:latest. This is a strong choice for structured extraction tasks but, realistically, any model with reliable JSON output discipline should work. If you don’t have that model, you can pull it down:

  ollama pull qwen2.5:latest

Just to be clear on this choice, Qwen2.5 does have that solid JSON discipline I mentioned. By this I mean, the model tends to close its brackets, respect the schema you show it in the prompt, and not wander into prose mid-output the way some models do. For the extraction call portion, where silent corruption is the failure mode to avoid, that reliability matters more than raw capability.

Qwen2.5 also handles academic register reasonably well, which matters here. My source text uses terms like “intertextual,” “Vorlage,” “pesher,” and “haśśāṭān.” A weaker model will either hallucinate relationships around unfamiliar terms or flatten them into something generic. Qwen2.5 tends to preserve specificity.

As a bit of an implementation note that may mean nothing to you right now, the RDF graph we will create lives entirely in memory for the duration of the pipeline run. RDFLib’s in-process SPARQL engine handles all queries without any external triplestore being needed. This is a deliberate design choice for this blog post, as it keeps the stack simple and the focus on the pipeline logic rather than infrastructure.

Assuming you have the model locally available, let’s do a quick check. Before running the logic we’ll get to in this post, it’s worth confirming that Ollama is responding correctly to the kind of request the pipeline will make. Create a script with the following code and run it:

import requests

response = requests.post(
  "http://localhost:11434/api/generate",
  json={
    "model": "qwen2.5:latest",
    "prompt": "hello",
    "stream": False,
  },
)

print(response.status_code)
print(response.json())

import requests

response = requests.post(

"http://localhost:11434/api/generate",

json={

"model": "qwen2.5:latest",

"prompt": "hello",

"stream": False,

)

print(response.status_code)

print(response.json())

That script sends a minimal POST request to the local endpoint we’ll be hitting and with the same structure that the pipeline uses internally, just with a trivial prompt instead of a full extraction request. There are two things to verify in the output: the status code is 200, and the response dictionary contains a response key with text in it. That response field is exactly what the pipeline reads at Stage 1 of the code we’ll be working on.

Next Steps!

This post got you set up. You have the basis for the work we’ll be looking at and the work itself from the code repo. In the next post, we’ll start going through that code, executing it, and analyzing the results.

Stories from a Software Tester

Twice upon a time, in another space, no distance in any direction from here …