AI and Testing: A Knowledge Graph Pipeline in Practice

In the previous post we talked about the conceptual basis of knowledge graphs and ontologies and pointed toward the code we’ll be using. In this post, we’re going to dive into that code and put the concepts into action.

As a reminder, all of the code we’ll look at in this post is available in the kgllm_v1 portion of the repo. You probably have a working directory you’ve been using for the posts in this series. In that case, you can clone and/or download the repo and just copy that kgllm_v1 directory over to your working project to play with it locally.

As mentioned in the last post, this is the heaviest lift of code in this series. I recommend taking time to at least look at the code in some depth. You’ll be having a lot of it explained to you, running the code individually, and then running all of it together as part of a pipeline. The focus here is how to reason about knowledge graph and ontology systems, not so much on how to build them.

Central Configuration

All pipeline configuration lives in a single file. If you want to adapt this project to your own domain or model, you only need to touch config.py. One thing you’ll note in that code is the following:

KG_NAMESPACE = "http://example.org/faithfulness/"

1	KG_NAMESPACE = "http://example.org/faithfulness/"

The namespace is not a URL that anything connects to. RDF was designed to be a web-scale knowledge format where graphs from different sources could be merged without naming collisions. Namespaces make every term globally unique. In other words the chronicles in the URI http://example.org/faithfulness/chronicles can’t collide with anyone else’s chronicles because the prefix is specific.

Realistically, for production use, you would use your domain. So, I could have done something like this: https://testerstories.com/faithfulness/chronicles. I didn’t do that because the example.org domain is reserved by IANA specifically for illustrative use in documentation, making it exactly right for this context. I wanted it clear that you’re hitting nothing real, certainly nothing related to my blog domain in particular, and that this code stands alone.

Note that changing the domain is a one-line edit in the config file, and since all other files import from that config file, that single change propagates everywhere.

The rest of the config defines the model name and endpoint, plus four lists that serve as controlled vocabularies for the extraction stage: the allowed entity types, predicate categories, confidence levels, and source claim types. Rather than letting the LLM freely invent its own terms, the prompt we use will constrain it to these specific values. That constraint is what makes the resulting graph queryable. You’ll see exactly how those lists get wired into the prompt when we look at extraction next.

Stage 1: Triple Extraction

Here you can look at extraction.py. This is the first substantive piece of the pipeline. The call_ollama() function is the primary pipeline execution path. This sends the text passage to the local model and asks it to identify entities and relationships as structured JSON triples.

Notice the imports at the top of this file: ENTITY_TYPES, CONFIDENCE_LEVELS, and SOURCE_CLAIM_TYPES. These are pulled directly from config. Those lists get interpolated into the SYSTEM_PROMPT via f-strings, so the model sees them as explicit constraints. The same lists appear again later during output validation; anything the model returns that doesn’t match an allowed value gets dropped.

This shows you that the config file isn’t just settings; it’s the contract between the prompt and the parser.

The Two-Stage Prompt

I recommend taking some time to read the SYSTEM_PROMPT. Rather than asking for JSON directly, the prompt asks the model to reason first, then produce output. All within a single API call. The model returns one JSON object with two top-level keys: reasoning, which contains its step-by-step analysis, and triples, which contains the structured output. There’s no second request; the reasoning and the result arrive together. The key directions are:


Before outputting triples, reason through:
1. What are the key entities in this passage?
2. What are the most significant relationships between them?
3. Which relationships are explicitly stated versus inferred?

This exploits the reasoning orientation of instruction-tuned models and consistently produces better triples than cold JSON requests. The reasoning trace — the content of the reasoning key in the returned JSON — is printed by the pipeline before the triples are shown, so you can inspect what the model identified before seeing the structured result. This is genuinely useful for debugging extraction quality: if a triple is missing, the trace often shows whether the model saw the relationship and failed to format it, or never identified it at all.

When we get to the full pipeline run, you’ll see the trace and the triples side by side: a clean result worth contrasting with what a misaligned run looks like should you start experimenting with your own passages.

The Triple Schema

Each extracted triple carries more than just subject, predicate, and object. To understand the full shape of what the model returns, it helps to see the outer envelope first:

{
  "reasoning": "step-by-step analysis...",
  "triples": [ ... ]
}

{

"reasoning": "step-by-step analysis...",

"triples": [ ... ]

}

Each entry in the triples array looks like this, which you can see in the EXAMPLE TRIPLE portoin of the code:

{{
  "subject": {{
    "id": "chronicles",
    "label": "Chronicles",
    "type": "Text"
  }},
  "predicate": {{
    "id": "revises",
    "label": "revises",
    "category": "textual"
  }},
  "object": {{
    "id": "samuel_kings",
    "label": "Samuel-Kings",
    "type": "Text"
  }},
  "confidence": "high",
  "source_claim": "stated",
  "section": "II.A"
}}

{{

"subject": {{

"id": "chronicles",

"label": "Chronicles",

"type": "Text"

}},

"predicate": {{

"id": "revises",

"label": "revises",

"category": "textual"

}},

"object": {{

"id": "samuel_kings",

"label": "Samuel-Kings",

"type": "Text"

}},

"confidence": "high",

"source_claim": "stated",

"section": "II.A"

}}

Three metadata fields are worth paying attention to. The predicate.category field classifies the relationship as textual, argumentative, or historical, which are the three categories defined in config. The confidence and source_claim fields describe extraction quality: stated means the relationship is explicitly present in the text, inferred means the model performed interpretive reasoning to identify it. The section field ties the triple back to a passage reference when the model can identify one. All of these distinctions become queryable in Stage 3.

Defensive Parsing

The parsing layer (the parse_extraction() function) is intentionally lenient. Rather than failing hard on a malformed triple, it drops the offending entry with a warning and continues. Validation (the validate_triples() function) checks for two things: required keys being present, and confidence and source_claim values matching the lists defined in config.

If the model invents a value outside those allowed sets (say, “very high” for confidence) the triple is dropped rather than passed downstream with bad data. This behavior can be more instructive than an outright crash because you see exactly which triples were problematic and why, without losing everything the model extracted correctly.

I’ve set up extraction.py to run independently so you can test without the full pipeline.

  python extraction.py

The sample passage in the __main__ block of extraction.py is a condensed version of the census account from Section II.C of the paper. It’s short enough to produce fast results and rich enough to return several meaningful triples.

The __main__ block at the bottom of the file is a Python convention: code inside if __name__ == "__main__": runs only when the file is executed directly, not when it’s imported by another module. Each file in this pipeline has this block, containing just enough synthetic data to exercise that stage in isolation. It’s the closest thing Python has to a built-in test harness.

Let’s consider sample output from one run. Since this calls the live model, your exact triples and reasoning trace will differ, but the structure will be the same.


--- RAW MODEL OUTPUT ---
{
  "reasoning": "Identified key entities and relationships: David, YHWH, satan, census, First Chronicles, Second Samuel, Balaam narrative, Numbers, Stokes. Analyzed the relationships between them to determine which are explicitly stated or inferred.",
  "triples": [
    {
      "subject": {
        "id": "second_samuel",
        "label": "Second Samuel",
        "type": "Text"
      },
      "predicate": {
        "id": "opens_with",
        "label": "opens with",
        "category": "textual"
 ...
------------------------

  Extracted 4 valid triples (0 dropped).
Reasoning trace:
Identified key entities and relationships: David, YHWH, satan, census, First Chronicles, Second Samuel, Balaam narrative, Numbers, Stokes. Analyzed the relationships between them to determine which are explicitly stated or inferred.

Triples extracted: 4
  Second Samuel --[opens with]--> YHWH inciting David to take the census (high, stated)
  First Chronicles --[grounds revision in]--> the Balaam narrative of Numbers (high, stated)
  the Balaam narrative of Numbers --[features the angel of YHWH as an adversary]--> the angel of YHWH acting as an adversary (high, stated)
  Stokes (2009) --[identifies structural parallels between the two narratives as uncanny]--> structural parallels between the two narratives (high, stated)

Examining the Output

As I mentioned earlier, the reasoning trace is visible and coherent. The model correctly identified the key entities and flagged the relationships before producing structured output. That two-stage behavior is exactly what the prompt was designed to elicit. All four triples parsed successfully with zero dropped, meaning the JSON structure was valid and the schema constraints were met.

What’s worth examining is the predicate labels. “Opens with” and “grounds revision in” are readable English but verbose, and neither is drawn from the predicate vocabulary defined in the system prompt. “Features the angel of YHWH as an adversary” is essentially a sentence rather than a predicate. The fourth triple uses the full citation “Stokes (2009)” as a subject identifier rather than a clean entity id. These are exactly the kinds of extraction artifacts worth ferreting out.

The output isn’t wrong, per se, but it reflects the model making its own predicate choices rather than selecting from the defined vocabulary. That matters for query consistency: “grounds revision in” and “draws on” describe the same relationship but would not match in a SPARQL query. If you want tighter predicate control, the extraction prompt is where you tighten it. An instruction like “prefer short predicate labels of two to four words drawn from the suggested vocabulary” is a reasonable first experiment.

Stage 2: Graph Construction

Here we can consider the code in graph.py. This stage (with the build_graph() function) takes the validated JSON from extraction and turns it into an RDF graph using RDFLib. The config namespace imported here is the same one defined in the config file. Every entity and relationship the model identified becomes a URI rooted at that namespace.

The Reification Pattern

Standard RDF triples are three-part statements: subject, predicate, object. They can’t carry metadata. You can’t, for example, attach a confidence level or a section reference to a triple directly because there’s nowhere to put it in the basic data model.

Reification solves this by treating each triple as a node in its own right. Instead of a direct edge between two entities, you create a named node representing the relationship and attach properties to that node:


# Without reification — a direct edge, no metadata possible:
chronicles --revises--> samuel_kings

# With reification — the relationship is itself a node:
:triple_001  a             rdf:Statement
:triple_001  kg:subject    :chronicles
:triple_001  kg:predicate  :revises
:triple_001  kg:object     :samuel_kings
:triple_001  kg:confidence 'high'
:triple_001  kg:section    'II.A'

Think of it as the footnote mechanism for a knowledge graph. The main text makes a claim (“Chronicles revises Samuel-Kings”) and the footnote carries metadata about it: confidence, location in the argument, whether it’s stated or inferred. Standard RDF gives you the claim. Reification gives you the footnote by promoting the relationship itself to a first-class node that can be pointed at, labeled, and annotated.

One practical consequence of this is that a single extracted triple produces roughly eight to ten RDF statements in the graph once you count the type assertions, labels, reification links, and metadata properties. When the pipeline prints “Graph constructed: N RDF statements,” that N number will be several times larger than the number of triples that extraction identified. That’s expected. It’s the cost of carrying metadata on every relationship.

So, yes, this is more verbose, but it’s what makes Queries 2 and 4 possible. (These refer to queries that we’ll look at soon.) Filtering triples by confidence level or source claim requires metadata on relationships, not just on entities. Reification is the price of that expressiveness.

Etymologically, reification comes from the Latin word res (“thing”) and facere (“to make”). Literally, it means “thing-making.” Conceptually, it’s the process of treating an abstract idea, a relationship, or a process as if it were a concrete, physical thing.

Building Entity and Predicate Nodes

Before a triple node can be constructed, its subject, object, and predicate each need to become URIs in the graph. The id field from the extracted JSON (already in snake case from the extraction prompt; “id”: “snake_case_identifier”,) gets appended to the config namespace. So, “chronicles” becomes http://example.org/faithfulness/chronicles, and “revises” becomes http://example.org/faithfulness/revises. Each entity also gets a type assertion and a human-readable label attached to it.

One property of this approach is worth noting: entity nodes are shared across triples, not duplicated. If Chronicles appears as the subject of one triple and the object of another, the same URI node is reused in both reified statements. This is a meaningful feature of the graph model. Chronicles is one thing that participates in multiple relationships, not two separate things that happen to share a name. RDFLib handles this naturally because adding the same URI twice is idempotent.

Idempotent here means that adding to the graph, by calling g.add(), with a triple that already exists in the graph produces no change; the graph stays the same. RDFLib simply ignores the duplicate. So when the _add_entity() function is called for Chronicles a second time, the type assertion and label it tries to add are already present, and nothing is written twice.

Running graph.py independently against its synthetic test data produces this:


Graph constructed: 35 RDF statements.

--- GRAPH SUMMARY ---
Graph constructed: 35 RDF statements.

--- GRAPH SUMMARY ---
Total RDF statements: 35

Entity types:
kg:Text: 3
kg:Concept: 1
---------------------

Graph saved to ./faithfulness_kg.ttl (turtle format).

Two things are worth pausing on here to re-emphasize what I said above. First, the arithmetic: 35 RDF statements from 3 logical triples. One logical triple becomes roughly eleven RDF statements through reification, because each relationship expands into a named node with subject, predicate, object, confidence, source claim, and section all attached as separate statements. That’s the concrete cost of metadata expressiveness that I referred to above.

Second, the entity type summary shows four entities across three triples. That asymmetry is a signal about graph density: a graph where triple count significantly exceeds entity count has many relationships between a small number of entities, which is a richly connected structure. For an argumentative scholarly text like my source paper, you would expect that pattern to intensify as extraction scales up.

The .ttl file (Turtle; short for Terse RDF Triple Language) makes the same graph readable as text, the RDF equivalent of pretty-printed JSON. Each logical triple appears as a named node (kg:triple_000, kg:triple_001, kg:triple_002) with all its properties listed beneath it.

Consider kg:triple_000 from the generated file. This directly shows reification in practice. The single claim “Chronicles revises Samuel-Kings” expands to thirteen RDF statements. Seven of those live on the triple node itself. You can count them directly in the file, which I’ll annotate here:


kg:triple_000 a rdf:Statement ;   # type assertion
    kg:confidence "high" ;        # metadata
    kg:object kg:samuel_kings ;   # object link
    kg:predicate kg:revises ;     # predicate link
    kg:section "II.A" ;           # metadata
    kg:sourceClaim "stated" ;     # metadata
    kg:subject kg:chronicles .    # subject link

The remaining six come from the entity and predicate nodes that the triple links to: “Chronicles,” “Samuel-Kings,” and “revises” each get a type assertion and a label, two statements apiece. That structure is what allows the SPARQL queries in the next stage to filter by confidence or category rather than just by entity name.

Stage 3: SPARQL Queries

Here you can reference the queries.py file.

Four queries are defined, each designed to answer a different kind of question about the graph. They range from basic entity neighborhood lookup to two-hop traversal, demonstrating what graph queries can do that vector similarity search cannot. Each query is implemented as a Python function:

query_entity_neighborhood()
query_argumentative_triples()
query_multihop()
query_inferred_triples()

Each of those functions take the RDF graph and returns a plain list of dictionaries. A shared format_results() helper handles the consistent output formatting you’ll see across all four results below, which is why the output looks uniform regardless of what each query is actually asking.

The queries are really important so let’s dig into those a bit.

Query 1: Entity Neighborhood

This query effectively asks: “What is Chronicles directly connected to, and through what kind of relationship?”

QUERY_1 = """
SELECT ?targetLabel ?predicateLabel ?predicateCategory
WHERE {
  ?subject kg:label "Chronicles" .

  ?triple kg:subject   ?subject ;
          kg:predicate ?predicate ;
          kg:object    ?target .

  ?predicate kg:label    ?predicateLabel ;
             kg:category ?predicateCategory .
  
  ?target kg:label ?targetLabel .
}
ORDER BY ?predicateCategory ?targetLabel
"""

QUERY_1 = """

SELECT ?targetLabel ?predicateLabel ?predicateCategory

WHERE {

?subject kg:label "Chronicles" .

?triple kg:subject ?subject ;

kg:predicate ?predicate ;

kg:object ?target .

?predicate kg:label ?predicateLabel ;

kg:category ?predicateCategory .

?target kg:label ?targetLabel .

}

ORDER BY ?predicateCategory ?targetLabel

"""

Since this is the first time we’re looking at a query together, let me break this down a bit. SPARQL queries follow a SELECT/WHERE pattern similar to SQL, but instead of table columns, you’re selecting named variables. Anything prefixed with ? is a variable that the query engine will try to bind to matching values in the graph.

The WHERE block describes a pattern the graph must match: here, find a node whose kg:label is “Chronicles”, then find any triple node that links to it as a subject, then follow that triple’s predicate and object to retrieve their labels and categories.

Notice that the query doesn’t hardcode URIs or assume anything about graph structure beyond what the reification pattern established in Stage 2. It finds Chronicles by label, walks its outgoing relationships, and returns whatever it finds. The ORDER BY groups textual relationships together, then argumentative, then historical, making the result readable without post-processing.

Those three categories come from config.py and were injected into the extraction prompt as the allowed values. However, which category applies to any given relationship is the model’s judgment, not the pipeline’s. The pipeline constrains the vocabulary; the model supplies the meaning.

This is the most basic graph traversal and serves as a first validation that extraction captured the core relationships correctly.

One design detail worth noting: the query is not hardcoded to Chronicles. The query_entity_neighborhood() function accepts any entity label as a parameter, defaulting to Chronicles. The same traversal pattern works for any node in the graph. Swap the label and you get the neighborhood of a different entity without changing anything else.

Query 2: Predicate Filter

This query effectively asks: “Which relationships in the graph are argumentative rather than textual or historical?”

QUERY_2 = """
SELECT ?subjectLabel ?predicateLabel ?objectLabel ?confidence
WHERE {
  ?triple kg:subject   ?subject ;
          kg:predicate ?predicate ;
          kg:object    ?object ;
          kg:confidence ?confidence .

  ?predicate kg:category "argumentative" ;
             kg:label    ?predicateLabel .

  ?subject kg:label ?subjectLabel .
  ?object  kg:label ?objectLabel .
}
ORDER BY ?confidence ?subjectLabel
"""

QUERY_2 = """

SELECT ?subjectLabel ?predicateLabel ?objectLabel ?confidence

WHERE {

?triple kg:subject ?subject ;

kg:predicate ?predicate ;

kg:object ?object ;

kg:confidence ?confidence .

?predicate kg:category "argumentative" ;

kg:label ?predicateLabel .

?subject kg:label ?subjectLabel .

?object kg:label ?objectLabel .

}

ORDER BY ?confidence ?subjectLabel

"""

This query uses the predicate category metadata and surfaces the extraction’s most interpretively demanding work. I say that because argumentative relationships in scholarly prose require the model to do more than read. It has to understand what is being claimed.

Query 3: Multi-hop Traversal

This query effectively asks: “What chains of relationships lead out from the permission structure concept?”

QUERY_3 = """
SELECT ?hopOneLabel ?hopOnePredicateLabel
       ?hopTwoLabel ?hopTwoPredicateLabel
       ?terminalLabel
WHERE {
  ?concept kg:label "permission structure" .

  ?tripleOne kg:subject   ?concept ;
             kg:predicate ?hopOnePredicate ;
             kg:object    ?hopOne .

  ?hopOnePredicate kg:label ?hopOnePredicateLabel .
  ?hopOne kg:label ?hopOneLabel .

  ?tripleTwo kg:subject   ?hopOne ;
             kg:predicate ?hopTwoPredicate ;
             kg:object    ?terminal .

  ?hopTwoPredicate kg:label ?hopTwoPredicateLabel .
  ?terminal kg:label ?terminalLabel .

  FILTER(?terminal != ?concept)
}
ORDER BY ?hopOneLabel ?terminalLabel
"""

QUERY_3 = """

SELECT ?hopOneLabel ?hopOnePredicateLabel

?hopTwoLabel ?hopTwoPredicateLabel

?terminalLabel

WHERE {

?concept kg:label "permission structure" .

?tripleOne kg:subject ?concept ;

kg:predicate ?hopOnePredicate ;

kg:object ?hopOne .

?hopOnePredicate kg:label ?hopOnePredicateLabel .

?hopOne kg:label ?hopOneLabel .

?tripleTwo kg:subject ?hopOne ;

kg:predicate ?hopTwoPredicate ;

kg:object ?terminal .

?hopTwoPredicate kg:label ?hopTwoPredicateLabel .

?terminal kg:label ?terminalLabel .

FILTER(?terminal != ?concept)

}

ORDER BY ?hopOneLabel ?terminalLabel

"""

This is the query that earns its place in this post! It walks two relationship hops and returns the full chain, something a vector similarity search could not replicate because it requires following typed relationships rather than retrieving semantically similar passages.

What you would expect to see here is:


permission structure → established by → Chronicles → revises → Samuel-Kings

And:


permission structure → extended by → Jubilees → addresses → Hellenization crisis

Whether those chains actually appear depends on what the extraction captured, which is itself instructive.

Query 4: Confidence Diagnostic

This query effectively asks: “Which triples were marked as inferred rather than directly stated, and where in the paper do they cluster?”

QUERY_4 = """
SELECT ?subjectLabel ?predicateLabel ?objectLabel
       ?sourceClaim ?section

WHERE {
  ?triple kg:subject    ?subject ;
          kg:predicate  ?predicate ;
          kg:object     ?object ;
          kg:sourceClaim "inferred" .

  ?subject  kg:label ?subjectLabel .
  ?predicate kg:label ?predicateLabel .
  ?object   kg:label ?objectLabel .

  OPTIONAL { ?triple kg:section ?section . }
}
ORDER BY ?section ?subjectLabel
"""

QUERY_4 = """

SELECT ?subjectLabel ?predicateLabel ?objectLabel

?sourceClaim ?section

WHERE {

?triple kg:subject ?subject ;

kg:predicate ?predicate ;

kg:object ?object ;

kg:sourceClaim "inferred" .

?subject kg:label ?subjectLabel .

?predicate kg:label ?predicateLabel .

?object kg:label ?objectLabel .

OPTIONAL { ?triple kg:section ?section . }

}

ORDER BY ?section ?subjectLabel

"""

This is a diagnostic query. Consider a hypothesis worth testing against any argumentative scholarly text: that inferred triples cluster toward the later sections, where an author is no longer presenting primary evidence but synthesizing it into broader claims. Those later sections are where the model has to do more interpretive work because the relationships are no longer explicitly stated but constructed from the argument’s momentum. The ORDER BY ?section in this query makes that pattern immediately visible if it exists: inferred triples will sort to the sections where they appear, and any clustering will show up in the output without further analysis.

You can run queries.py independently against its synthetic graph and I should note that the synthetic graph in the __main__ block of query.py contains just three triples, which is enough to exercise all four queries and verify the SPARQL logic without needing extraction to have run first. The output you’ll see:


Graph constructed: 35 RDF statements.

--- GRAPH SUMMARY ---
Total RDF statements: 35

Entity types:
kg:Text: 3
kg:Concept: 1
---------------------


=== Query 1: Entity Neighborhood (Chronicles) ===
[Q1] — 1 result(s):
1.   target: Samuel-Kings | predicate: revises | category: textual

=== Query 2: Argumentative Triples ===
[Q2] — 2 result(s):
1.   subject: permission structure | predicate: established by | object: Chronicles | confidence: high
2.   subject: Jubilees | predicate: extends | object: permission structure | confidence: medium

=== Query 3: Multi-hop from Permission Structure ===
[Q3] — 1 result(s):
1.   hop_one: Chronicles | hop_one_predicate: established by | hop_two: None | hop_two_predicate: revises | terminal: Samuel-Kings

=== Query 4: Inferred Triples ===
[Q4] — 1 result(s):
1.   subject: Jubilees | predicate: extends | object: permission structure | section: IV.C

Let me be clear what’s happening here: running queries.py independently executes all four SPARQL queries against the same synthetic three-triple graph used in graph.py. The results are modest by design (three triples produce limited traversal depth) but each query demonstrates its intended behavior clearly.

Query 1 returns a single result: Chronicles is connected to Samuel-Kings through a textual relationship labeled “revises.” That’s exactly what the first synthetic triple contains, and seeing it returned correctly confirms that the entity neighborhood traversal is working. The query found Chronicles by label, followed its outgoing relationships, and returned the connected entity with the predicate and category attached.

Query 2 returns two results, both argumentative relationships: the permission structure established by Chronicles, and Jubilees extending the permission structure. These are the two synthetic triples whose predicates carry the argumentative category, and the fact that they appear here — ordered by confidence, with the medium-confidence Jubilees triple appearing after the high-confidence one — confirms that the predicate category metadata is both stored and queryable.

Query 3 shows the multi-hop traversal in action. Starting from the permission structure concept, it follows one hop to Chronicles via “established by,” then a second hop from Chronicles to Samuel-Kings via “revises.”

The hop_two: None in the result is worth a brief explanation. In SPARQL, a variable in the SELECT clause that has no corresponding pattern in the WHERE clause is simply unbound. The query engine has nothing to assign to it. Here ?hopTwoLabel is selected but never appears in a WHERE pattern, so it produces a blank column in every result row. It’s not missing data; the second-hop information is fully present in the adjacent columns: the predicate is “revises” and the terminal node is Samuel-Kings.

The two example chains mentioned earlier — permission structure → established by → Chronicles → revises → Samuel-Kings, and permission structure → extended by → Jubilees → addresses → Hellenization crisis — are not both present in the synthetic data. The first chain appears because both its hops exist in the synthetic triples. The second does not appear because the Jubilees triple connects Jubilees to the permission structure as object, not as subject, so there’s no outgoing hop from Jubilees to follow.

That asymmetry is itself instructive: the multi-hop query only traverses relationships where the intermediate node appears as a subject in a second triple. When the full paper’s extraction produces a richer graph with more outgoing relationships per entity, this query would return the longer and more varied chains the synthetic data can only hint at.

You can probably see how complicated this can all get, even with visuals. I was struggling with how to adapt the descriptions, and the outputs, to some visual format for these posts.

Like Query 1, this query is parameterized. The query_multihop() function accepts any concept label, defaulting to “permission structure.” Any entity in the graph can serve as the traversal starting point. The two-hop chain logic stays the same regardless of where you begin.

One more point about Query 4, before looking at the output, is that the OPTIONAL clause in this query is important. Without it, any triple lacking a section reference would be silently excluded. A standard triple pattern either matches completely or contributes nothing to the result set. OPTIONAL makes section a nullable column instead of a filter condition: all inferred triples appear in the results, with the section field blank where none was identified. For a diagnostic query whose purpose is to surface everything the model inferred, excluding triples just because section is absent would undermine the point.

Query 4 returns the one inferred triple in the synthetic data: Jubilees extends the permission structure, sourced from section IV.C. The section ordering confirms that the diagnostic query is working as intended. In a full extraction run, this is where you would look first to see where the model was doing interpretive work rather than direct reading.

The Full End-to-End Run

For this, you can check out the pipeline.py file, which wires all the stages together and prints a readable narrative at each transition. The output is designed to tell a story: you see the reasoning trace before the triples, the graph summary before the query results, and the grounded answer last, clearly labelled as derived from the graph, not from the model’s weights.

A basic run of the script uses the default passage.txt and the default question. The pipeline accepts four arguments. A basic run uses the defaults for everything:

  python pipeline.py

The two most useful flags for experimentation let you supply your own source material:

  python pipeline.py --passage my_passage.txt
  python pipeline.py --question "What texts does Chronicles revise?"

Two output flags control how much the pipeline shows as it runs. The –verbose flag prints the full raw model output at each stage. The –save-graph flag serializes the RDF graph to faithfulness_kg.ttl after construction:

  python pipeline.py --save-graph --verbose

The individual scripts we ran earlier all operated against synthetic test data defined in their own __main__ blocks. Running the full pipeline is different: it loads the actual passage (from passage.txt), calls the model, builds a real graph from the extracted triples, and runs all four queries against that graph before asking the model to synthesize an answer.

It’s worth noting that the pipeline makes a second model call (the call_grounded_answer() function) but a deliberately different kind than the call_ollama() call in Stage 1. The system prompt for this stage instructs the model to answer only from the formatted query results it receives: no drawing on its own knowledge, explicit acknowledgment of which query result supports each claim, and a hard requirement to say clearly when the results are insufficient rather than fill gaps from inference.

This second call also omits the “format”: “json” constraint used in extraction. That’s done because this stage wants natural language prose, not structured output. The design intent is a model that’s honest about what the graph contains rather than what it already knows. Whether that intent holds is part of what the output below demonstrates.

What follows is a complete end-to-end run against passage.txt with the default question. The output is worth examining carefully, because it illustrates both what the pipeline does well and where its current limitations lie.


============================================================
  STAGE: 1 of 4 — Triple Extraction
============================================================
  Loaded passage: passage.txt (8145 characters)

  Sending passage to model for triple extraction...
  Model: qwen2.5:latest

--- RAW MODEL OUTPUT ---
{
  "reasoning": "The passage discusses revisions made by the Chronicler in Chronicles compared to earlier texts like Samuel-Kings. It focuses on the changes to the census account and how these changes are justified through intertextual connections with other canonical texts, particularly the Balaam narrative.",
  "triples": [
    {
      "subject": {
        "id": "chronicles",
        "label": "Chronicles",
        "type": "Text"
      },
      "predicate": {
        "id": "revises",
        " ...
------------------------

  Extracted 7 valid triples (0 dropped).

  Reasoning trace (first 400 chars):
    The passage discusses revisions made by the Chronicler
    in Chronicles compared to earlier texts like Samuel-
    Kings. It focuses on the changes to the census account
    and how these changes are justified through intertextual
    connections with other canonical texts, particularly the
    Balaam narrative.

  Triples extracted: 7

============================================================
  STAGE: 2 of 4 — Graph Construction
============================================================

  Building RDF graph from extracted triples...
  Graph constructed: 85 RDF statements.

--- GRAPH SUMMARY ---
Total RDF statements: 85

Entity types:
  kg:Text: 7
  kg:Concept: 4
  kg:Community: 1
---------------------

  Graph saved to faithfulness_kg.ttl (turtle format).

============================================================
  STAGE: 3 of 4 — SPARQL Queries
============================================================

  Running Query 1: Entity Neighborhood (Chronicles)...

  [Q1 — Entity Neighborhood] — 1 result(s):
    1.   target: Samuel-Kings | predicate: revises | category: textual

  Running Query 2: Argumentative Triples...

  [Q2 — Argumentative Triples] — 1 result(s):
    1.   subject: Satan of 1 Chronicles 21:1 | predicate: identifies as | object: Heavenly Adversary | confidence: medium

  Running Query 3: Multi-hop from Permission Structure...

  [Q3 — Multi-hop Traversal]: No results found.

  Running Query 4: Inferred Triples...

  [Q4 — Inferred Triples] — 1 result(s):
    1.   subject: Satan of 1 Chronicles 21:1 | predicate: identifies as | object: Heavenly Adversary | section: II.C

============================================================
  STAGE: 4 of 4 — Grounded Answer
============================================================

  Question:
  Based on the knowledge graph, what is the relationship between Chronicles and the permission structure, and which later texts are connected to that structure through the graph?

  Sending query results to model for grounded answer...

--- RAW GROUNDED ANSWER ---
Based on the provided query results, there is no direct information about the relationship between "Chronicles" and a permission structure or which later texts are connected to that structure through the graph.

The only relevant result comes from Query 1, which states:
- **Target**: Samuel-Kings | Predicate: revises | Category: textual**

This indicates that the text "Samuel-Kings" (which likely refers to the Books of Samuel and Kings) has a revisionary relationship with some other text or texts. However, this does not directly connect to "Chronicles" in terms of permission structures or later texts.

Queries 2 and 4 both refer to "Satan of 1 Chronicles 21:1," but neither of these provides information about the permission structure or related texts. Query 3 returned no results, meaning there is no multi-hop traversal available from the given data points.

Therefore, based on the provided query results, we do not have enough information to answer the question regarding the relationship between Chronicles and the permission structure or which later texts are connected to that structure through the graph.
---------------------------


============================================================
  ANSWER (grounded in graph query results)
============================================================
  Based on the provided query results, there is no direct
  information about the relationship between "Chronicles"
  and a permission structure or which later texts are
  connected to that structure through the graph.  The only
  relevant result comes from Query 1, which states: -
  **Target**: Samuel-Kings | Predicate: revises | Category:
  textual**  This indicates that the text "Samuel-Kings"
  (which likely refers to the Books of Samuel and Kings) has
  a revisionary relationship with some other text or texts.
  However, this does not directly connect to "Chronicles" in
  terms of permission structures or later texts.  Queries 2
  and 4 both refer to "Satan of 1 Chronicles 21:1," but
  neither of these provides information about the permission
  structure or related texts. Query 3 returned no results,
  meaning there is no multi-hop traversal available from the
  given data points.  Therefore, based on the provided query
  results, we do not have enough information to answer the
  question regarding the relationship between Chronicles and
  the permission structure or which later texts are
  connected to that structure through the graph.

============================================================

  Pipeline complete.
  Triples extracted : 7
  RDF statements    : 85
  Query 1 results   : 1
  Query 2 results   : 1
  Query 3 results   : 0
  Query 4 results   : 1

Analyzing the Pipeline Output

Stage 1 extracts seven valid triples from the passage with zero dropped, and the reasoning trace is coherent: the model correctly identified the core subject matter as the Chronicler’s revision of the census account and its intertextual grounding in the Balaam narrative. That’s an accurate characterization of what Section II.C is doing! Seven triples from a passage of roughly 1,500 words is a modest yield, but a deliberate one: the extraction prompt favors fewer high-confidence triples over many uncertain ones. At that density, running the full paper (roughly 20,000 words across seven sections) would likely produce somewhere between sixty and one hundred triples, which is where multi-hop traversal starts returning genuinely surprising chains rather than the single-hop results visible here.

Stage 2 constructs 85 RDF statements from those seven triples: roughly twelve per triple, which is consistent with what the reification pattern produces when all metadata fields are populated and entity nodes are shared across triples rather than duplicated.

Stage 3 is where things get instructive. Query 1 returns a single result (Chronicles revises Samuel-Kings) which is correct but thin. Query 2 surfaces one argumentative triple: the Satan of 1 Chronicles 21:1 identified as a Heavenly Adversary, marked medium confidence. Query 3 returns no results at all, and Query 4 confirms the same triple from Query 2 as the one inferred relationship in the graph.

The empty result from Query 3 is the most telling finding. The multi-hop traversal starts from the permission structure concept and finds nothing. And that’s because the model did not extract “permission structure” as an entity from this passage. That absence is not a pipeline failure. It’s an accurate reflection of what the extraction captured: Section II.C is a case study demonstrating the permission structure through a specific textual example, but it doesn’t use the phrase “permission structure” explicitly. The model extracted what was present rather than inferring the concept from context.

Stage 4 responds to this correctly. The grounded answer acknowledges that the query results don’t contain enough information to answer the question about the permission structure, and explains what the results do show rather than confabulating an answer. That is the behavior the system prompt was designed to produce. Seeing it work as intended is more valuable than a confident but unsupported answer would have been.

What this run demonstrates is a genuine property of knowledge graph pipelines: the quality of the answers you can retrieve is bounded by the quality of the extraction. The graph is only as rich as what the model found. Running the same pipeline against the full paper rather than a single section, or adjusting the extraction prompt to explicitly ask for the permission structure concept, would produce a different graph and correspondingly richer query results. That might be a natural next experiment.

What All This Demonstrates

The pipeline we’ve built and played around with here is not a production system. I think that’s clear. However, it is a demonstration of a pattern, and the pattern is worth being precise about.

A local language model, given a well-structured prompt, can read a scholarly passage and produce typed, metadata-carrying triples with reasonable fidelity. It will make its own predicate choices when the vocabulary is underspecified, it will miss concepts that are present implicitly rather than explicitly, and it will occasionally produce entity identifiers that need sanitization. None of that is surprising. What matters is that those failure modes are visible, debuggable, and correctable (all qualities!) and this is because the extraction output is structured, the graph is inspectable, and the queries are explicit. When something is wrong, you can see exactly where it went wrong.

That visibility is the core property this approach buys you. A language model answering a question directly gives you fluent prose with no audit trail. A language model operating as the constructor and interface of a knowledge graph gives you something you can interrogate: which triples were extracted, which were marked inferred, which queries returned results, and what the grounded answer was actually derived from.

The answer in Stage 4 that said “we do not have enough information” was not a failure. It was the system working correctly! And knowing the difference between those two outcomes is only possible because the pipeline makes its reasoning explicit at every stage.

The natural next experiments would follow directly from what the run revealed. Running the full paper rather than a single section would produce a graph dense enough to exercise the multi-hop query meaningfully. Tightening the extraction prompt’s predicate vocabulary would improve query consistency. Adjusting the question to match what the graph actually contains would demonstrate the grounded answer at its best rather than at its limit.

The key thing for a tester to note is that each of those is a one-variable change against a system whose stages are independently testable, which is, in the end, the entire testing argument for this architecture.

The Chronicler revised Samuel-Kings to make it answerable to his community’s present need. This pipeline does something structurally similar: it takes a text, makes its relationships explicit, and makes those relationships queryable in service of a present question. The method, as noted in the previous post, is the same.

Next Steps

This post made a specific claim: that the pipeline’s failure modes are visible, debuggable, and correctable. The natural continuation here is the one that actually demonstrates that claim systematically rather than just asserting it. That points toward evaluation, and our old friend DeepEval is a reasonable fit, but with a specific angle worth considering.

That specific angle ties into the interesting testing question for this pipeline. That question is not “did the model get the right answer” in the conventional sense. It’s “did the extraction produce a graph that faithfully represents the source text.” That is a faithfulness evaluation problem, which happens to be exactly what DeepEval’s faithfulness metric is designed for.

The recursive irony of running a faithfulness evaluator against a pipeline built on a paper called “Revision as Faithfulness” is almost too good to ignore.

Join me in the next post as we continue this journey.

Stories from a Software Tester

Twice upon a time, in another space, no distance in any direction from here …

AI and Testing: A Knowledge Graph Pipeline in Practice

Central Configuration

Stage 1: Triple Extraction

The Two-Stage Prompt

The Triple Schema

Defensive Parsing

Examining the Output

Stage 2: Graph Construction

The Reification Pattern

Building Entity and Predicate Nodes

Stage 3: SPARQL Queries

Query 1: Entity Neighborhood

Query 2: Predicate Filter

Query 3: Multi-hop Traversal

Query 4: Confidence Diagnostic

The Full End-to-End Run

Analyzing the Pipeline Output

What All This Demonstrates

Next Steps

Leave a Reply Cancel reply