AI and Testing: From Ontology to Implementation

In the previous post, we looked at setting up an ontology based on a Z-Machine specification. Our goal was to get this in place so that we could have an LLM generate code to implement the portion of the ontology that we described. In this post, we’ll attempt exactly that.

The Context

Let’s remind ourselves of our operating context: you’re working at a company that wants to put retrogaming back on the map. You all want to create a Z-Machine interpreter to play the Infocom games from the 1980s. Your company is investing heavily in AI generation and wants to use models to generate the code base for this project and, further, let’s say this project will be Python based.

Which Model?

Generally, you’re going to want to use a model that’s specifically geared to coding. Qwen2.5, which we’ve looked at in other posts in this series, is a reasonable choice for code generation. That said, we were using general Qwen2.5 models and there are coder-specific models available. For this post, I would recommend the following:

  ollama pull qwen2.5-coder:14b

Now, that said, do note that the download can take a while depending on your connection: the 14B version is around nine gigabytes. If you want something smaller to test the pipeline, qwen2.5-coder:7b is about half the size and still reasonable for code generation.

Taking Stock of What We Have

The pre-steps completed so far are:

Gold standard ontology authored and validated ✔
Spec generated in text readable form ✔
Zcode files acquired and version-verified ✔

Let’s break this down a little further, based on work we did in the last post. At this point in the project, we have three artifacts in hand:

A normative spec (extraction_input_sect11.txt): the structured
plain-text rendering of Section 11 of the Z-Machine Standard, which defines what a conforming interpreter must handle.
A conventional spec (extraction_input_appb.txt): the structured rendering of Appendix B, which documents fields that real-world Infocom and Inform story files use by practice rather than by mandate.
A formal ontology (zmachine_header.ttl): the reconciled, machine-readable knowledge graph we built in the previous step, which captures both sources and adds structured relationships: version applicability, access authority, pointer semantics, and the supersession relationships that the spec describes only informally.

The natural next question is: what can you do with this? One answer is what we did in the last post: validate the ontology itself, checking for structural consistency, coverage, and correct version applicability ranges. That’s valuable, but it’s internal: it confirms the ontology is well-formed, not that it’s useful.

A more interesting test is to ask whether the ontology can serve as a specification for generating working code. Specifically: can a local language model read these three sources, reconcile them, and produce a Python module that correctly parses the Z-Machine header from a real story file?

This is a stronger claim than extraction. Extraction asks whether a model can find structure in prose. Code generation from a formal spec asks whether a model can translate structured knowledge into a verifiable artifact; in this case, one that either works against a real story file or doesn’t.

The instinct might be to just give the model the ontology. It’s already reconciled and structured, isn’t it? So, why not use it directly?

The problem is that the ontology captures what but not always why. When the model encounters the supersededBy relationship linking, say, field_font_width_v5 to field_font_height_v6_at26, the ontology alone doesn’t explain that this represents a deliberate swap of meaning at the same byte offset across a version boundary. The specification text, on the other hand, makes the rationale concrete.

Similarly, the conventional fields in Appendix B (the release number, serial code, Tandy bit) are present in the ontology but their practical significance (that ignoring them means misreading real Infocom files) comes from the prose context.

Think of it this way: the ontology is the architect’s formal drawing. The specification texts are the building code and the contractor’s field notes. A good engineer reads all three: the drawing tells you what to build, the building code tells you the constraints it must satisfy, and the field notes tell you where the drawing made assumptions that real-world practice has since clarified.

Giving the model all three sources also creates an implicit test of the ontology itself: if the model produces code that contradicts the specification text, we learn something about a gap or ambiguity in the ontology. If the model produces code that contradicts only the conventional fields, we learn something about how well Appendix B was captured. The reconciliation task surfaces discrepancies.

The Generation Script

With our three source artifacts in hand, the next step is to give them to a model and ask it to produce working code. That work is handled by generate.py. Before running it, it’s worth understanding a few design choices baked into the prompts you will see in that file, because they reflect decisions about how to use the ontology rather than just pass text to a model.

The script uses two prompts: a system prompt and a user prompt. I recommend taking a look at those. This is a standard pattern for chat-style models, but the division of labor here is deliberate. The system prompt establishes what kind of agent the model is and what it will be working from. The user prompt delivers the actual task, with the three source documents injected at runtime. Think of the system prompt as the job description and the user prompt as the work order.

The System Prompt

The system prompt opens by telling the model it is working from three sources, and it names them in order of authority: normative spec first, conventional spec second, ontology third. That ordering matters. It primes the model to treat the sources as a hierarchy rather than a flat pile of text.

The most consequential instruction in the system prompt is the reconciliation rule:


Where they appear to differ, prefer the normative spec text, but add a comment noting the discrepancy.

This is the normative/conventional distinction from the ontology, encoded directly as a decision rule in the model’s operating instructions. Rather than leaving the model to guess which source to trust when they conflict, the prompt makes the tiebreaker explicit. The ontology itself captures this distinction via the hasSourceAuthority property, tagging each individual as either zm:normative or zm:conventional. The prompt translates that structural fact into plain English so the model can act on it.

The system prompt also enumerates the version-conditional branching cases explicitly: the Flags 1 split between Versions 1-3 and Version 4+, the font dimension swap at offsets 0x26 and 0x27, and the header extension table. These are exactly the cases the ontology encodes structurally: the supersededBy relationships, the version-gated bit individuals, the pointerTo link. Notice that naming them in plain English in the prompt is not redundant. The ontology expresses structure; the prompt expresses intent. A model reading a supersededBy triple knows two fields are related. A model reading “in Version 6 these meanings are SWAPPED” knows what to do about it.

Finally, the system prompt ends with a hard output constraint: the entire response must be valid Python, with no prose, no markdown fences, nothing but code. This is a reliability measure that reveals something real about working with local models. A hosted model will generally follow formatting instructions consistently. A local model, especially under a large context load, will sometimes slip into explanation mode and wrap its output in markdown. The extract_code function in the script handles that case defensively, but the prompt constraint is the first line of defense.

The User Prompt

The user prompt is where the three source documents are actually injected, formatted into a template at runtime. But before asking the model to write any code, the prompt asks it to do something first:


Before writing code, work through the following:
  1. List all fields you will parse, grouped as: scalar fields, flags
     registers, extension table fields, conventional fields.
  2. Identify the cases where version-conditional branching is required.
  3. Then write the complete module.

This is a chain-of-thought nudge: ask the model to reason before it acts. Steps 1 and 2 mirror exactly what a careful human engineer would (hopefully!) do before writing a parser from scratch. By making that reasoning explicit in the output, we also get something useful for debugging: if the generated code handles the font swap incorrectly, we can look at step 2 of the model’s reasoning and see whether it identified the swap at all. The mistake traces back to a specific point in the reasoning chain rather than appearing as an unexplained error in the output.

What the Script Does

To run this script, you will definitely have wanted to generate the artifacts from the previous post. At runtime, the script loads all three source files from disk, formats them into the user prompt template, and sends the result to a locally running Ollama instance. The model, context window, and endpoint are all configurable at the top of the file. The default is qwen2.5-coder:14b with a context window of 32,768 tokens, which is necessary because the combined prompt (system instructions plus all three source documents) is large.

The response streams back token by token, so you see the model working in real time rather than waiting for a complete response. Once the full response has been received, extract_code strips any markdown fences the model may have added despite the output constraint, then does a basic sanity check: if the output contains no Python indicators at all (def, import, class), it reports an error rather than writing unusable text to disk. If the output passes that check, it’s written to zmachine_header.py in the same directory.

If you run this and the model produces prose instead of code, the most likely cause is context length pressure: the model has run out of effective context window and is summarizing rather than generating. The script’s error output will suggest trying a larger model or reducing the context. The qwen2.5-coder:7b model is a reasonable fallback if the 14b version is taxing your hardware.

The Generated Output

Running the script produced a Python module that we’ll use as our reference implementation for the rest of this post. You can view it here: zmachine_header.py.

If you run the script yourself, your output may (and likely will) differ. As we all certainly know, local models are not deterministic by default, and small variations in token sampling mean two runs of the same prompt against the same model can produce structurally similar but textually different code. That’s expected. The reference file is here so you have a concrete artifact to follow along with as we evaluate what the model produced, not as the one correct answer.

If you want to simply follow along with this post, you can feel free to use my generated example, which is why I provide it.

What we care about is whether the output is functionally correct: does it parse the right fields, handle version-conditional branching, and produce sensible output against real story files? That’s what the next steps will test.

Testing the Generated Parser

With a generated parser in hand, the next question is whether it actually works. That’s the job of test.py. What makes this script interesting is what it uses as its source of truth: not a hardcoded list of expected values, but the ontology itself.

This is the payoff of having built the ontology in the first place. Rather than writing a test that says “for a Version 3 file, expect these fields at these offsets,” the script queries the ontology at runtime to derive what fields should be present for whatever version it finds in the file.

This is a crucial point! The ontology’s version applicability ranges, the applicableFrom and applicableUntil properties we defined, become the test oracle.

The script’s approach for each story file is worth walking through, because it reflects a specific philosophy about what “testing” means in this context:

Read the version byte directly from the raw file bytes, bypassing the parser entirely. This establishes ground truth independently; if the parser misreads the version, the test still knows the correct one.
Query the ontology to determine which fields and bits are expected for that version, filtering out superseded fields and respecting version boundaries.
Run the parser and collect its output.
Compare expected against actual, flagging missing fields, unexpected fields, and the specific tricky cases the ontology encodes: the Flags 1 split, the font field swap, and the extension table.
Report per-file results and a final summary table across all files tested.

The tricky case checks deserve particular attention, because they are not generic field coverage checks. They probe the specific version-conditional branching decisions that reveal whether the parser genuinely used the ontology’s structure or merely pattern-matched on the specification text. A parser that gets the Flags 1 split wrong has mishandled the most consequential version boundary in the header. A parser that gets the font field swap wrong has missed a deliberate meaning reversal at the same byte offset across a version boundary.

These aren’t edge cases; they’re the cases the ontology was specifically built to capture. A crucial testing moment!

One other detail worth noting: the script’s parse function discovery is deliberately flexible. Because the generated parser comes from a language model, the name of the top-level parse function may vary between runs. The script tries a set of common names first, then probes all module-level functions looking for one that accepts a single argument and returns a dictionary. This is a practical acknowledgment that the generated artifact is not fully predictable and a good example of writing test infrastructure that is robust to that kind of variation.

With that as the frame, let’s run it against our test files and look at what we get.

Running the Tests

Zork 1 (Release 88, Version 3)

The first file is zork1-r88-s840726.z3, a Version 3 story file. The output is clean:


  File   : zork1-r88-s840726.z3
  Version: 3

--- Field Coverage (ontology-derived expectations) ---
  PASS [normative] Version Number (offset 0x00)
  PASS [conventional] Release Number (offset 0x02)
  PASS [normative] High Memory Base (offset 0x04)
  PASS [normative] Initial Program Counter (V1-5) (offset 0x06)
  PASS [normative] Dictionary Location (offset 0x08)
  PASS [normative] Object Table Location (offset 0x0A)
  PASS [normative] Global Variables Table Location (offset 0x0C)
  PASS [normative] Static Memory Base (offset 0x0E)
  PASS [normative] Abbreviations Table Location (offset 0x18)
  PASS [normative] File Length (offset 0x1A)
  PASS [normative] File Checksum (offset 0x1C)
  PASS [normative] Standard Revision Number (offset 0x32)
  PASS [conventional] Inform Compiler Version (offset 0x3C)

--- Tricky Case Checks ---
  PASS: Flags 1 version split appears correct
  SKIP: Font fields only present from V5
  PASS: No extension table in V3 output (correct)
  PASS: Conventional field present — release number (0x02)
  PASS: Conventional field present — serial code (0x12)
  PASS: Conventional field present — Inform compiler version (0x3C)
  PASS: Output contains a 'conventional' grouping key

  Summary: 19 PASS  |  0 WARN  |  0 FAIL

A few things are worth noting here. The field coverage section shows the ontology’s normative/conventional distinction surfacing directly in test output: each field is labeled with its source authority, so it’s immediately visible which fields come from the Z-Machine Standard itself and which come from Infocom practice. The serial code at offset 0x12 passes even though it doesn’t appear in the field coverage list above it: it’s caught by the conventional field check in the tricky cases section, which probes for it explicitly.

The font field check reads SKIP rather than PASS or FAIL. This is correct behavior: font fields don’t exist in Version 3, and the script knows this because it derives expectations from the ontology rather than a fixed list. It’s not ignoring the check; it’s recognizing that the check doesn’t apply here.

The Flags 1 split pass is the most meaningful result in this run. Version 3 uses a different set of bit meanings at offset 0x01 than Version 4 and above, and the parser handled that branch correctly. That’s the tricky case the ontology was most specifically built to capture, and it holds on the first file.

Zork 1 Invisiclues (Release 52, Version 5)

The second file is zork1-invclues-r52-s871125.z5, a Version 5 story file. The result is 32 PASS, 2 WARN, 0 FAIL:


  File   : zork1-invclues-r52-s871125.z5
  Version: 5

--- Field Coverage (ontology-derived expectations) ---
  PASS [normative] Version Number (offset 0x00)
  ...
  PASS [normative] Font Width in units (V5, offset 0x26) (offset 0x26)
  PASS [normative] Font Height in units (V5, offset 0x27) (offset 0x27)
  ...
  WARN [normative] Header Extension Table Address (offset 0x36)
  PASS [conventional] Inform Compiler Version (offset 0x3C)

--- Tricky Case Checks ---
  PASS: Flags 1 version split appears correct
  PASS: Font width label present in V5 output
  PASS: Font height label present in V5 output
  WARN: No extension table in V5 output — expected for V5+ (may be absent if pointer is 0)
  ...

  Summary: 32 PASS  |  2 WARN  |  0 FAIL

The increased field count compared to the Version 3 run reflects the ontology doing its job: Version 5 files carry additional normative fields (interpreter number, screen dimensions, colour defaults, terminating characters) and the test script derived all of them from the ontology’s version applicability ranges rather than from a hardcoded list.

The font field results are worth pausing on. In the Version 3 run, those checks were marked SKIP. Here they pass: the parser correctly identified font width at offset 0x26 and font height at offset 0x27 for a Version 5 file. This is the field that the ontology encodes as a swap across the Version 5/6 boundary: in Version 6, those two offsets reverse their meanings. The parser handled the Version 5 side of that swap correctly.

The two WARNs both point at the same underlying issue: the header extension table. The field coverage check flagged the extension table address at offset 0x36 as WARN, and the tricky case check followed with a note that no extension table appeared in the output, adding that the pointer may simply be zero. These two warnings are not independent failures: they are two views of the same observation.

This is actually the expected behavior for this particular file. The header extension table is normative for Version 5 and above, meaning the field pointing to it must be present in the header. But the table itself is optional: a story file that doesn’t use the extension features will set that pointer to zero, and a correct parser should report the pointer value without attempting to read a table that isn’t there. The WARN rather than FAIL reflects that ambiguity: the test script can see that the parser didn’t report extension table contents, but it can’t determine from the output alone whether that’s because the pointer was zero or because the parser silently skipped it. The overall status of WARN rather than OK is honest: something worth looking at, but not a demonstrated failure.

Beyond Zork (Release 57, Version 5)

The third file is beyondzork-r57-s871221.z5, also Version 5. The result is 34 PASS, 0 WARN, 0 FAIL:


  File   : beyondzork-r57-s871221.z5
  Version: 5

--- Field Coverage (ontology-derived expectations) ---
  ...
  PASS [normative] Header Extension Table Address (offset 0x36)
  PASS [conventional] Inform Compiler Version (offset 0x3C)

--- Tricky Case Checks ---
  PASS: Flags 1 version split appears correct
  PASS: Font width label present in V5 output
  PASS: Font height label present in V5 output
  PASS: Extension table present in V5 output
  ...

  Summary: 34 PASS  |  0 WARN  |  0 FAIL

This run is two checks richer than the Invisiclues run (34 versus 32) and the difference is exactly the extension table. Where the previous file produced two warnings around that feature, Beyond Zork produces clean passes on both the field coverage check and the tricky case check. The extension table pointer in this file is non-zero, the parser followed it, and the output reflects that correctly.

This is a useful pair of results to have side by side. The Invisiclues WARNs were not evidence of a parser defect; they were evidence that the previous file simply didn’t use the extension table. Beyond Zork confirms the parser can handle the extension table when it’s actually present. Together, the two runs cover both cases: pointer absent (or zero) and pointer populated.

Note: this is a case of choosing our test and data conditions carefully!

It’s also worth noting what stayed consistent across both Version 5 runs: the Flags 1 split, the font field checks, and all conventional fields passed in both. The parser’s handling of Version 5 is stable across two structurally different story files.

Zork Zero (Release 393, Version 6)

The fourth file is zork0-r393-s890714.z6, the only Version 6 file in our test set. The result is 34 PASS, 0 WARN, 0 FAIL:


  File   : zork0-r393-s890714.z6
  Version: 6

--- Field Coverage (ontology-derived expectations) ---
  ...
  PASS [normative] Routines Offset (offset 0x28)
  PASS [normative] Static Strings Offset (offset 0x2A)
  ...
  PASS [conventional] Player Username (offset 0x38)
  PASS [conventional] Inform Compiler Version (offset 0x3C)

--- Tricky Case Checks ---
  PASS: Flags 1 version split appears correct
  PASS: Both font dimension labels present in V6 output (manual check: 0x26=height, 0x27=width in V6)
  PASS: Extension table present in V6 output
  ...

  Summary: 34 PASS  |  0 WARN  |  0 FAIL

Several things in this output are specific to Version 6 and worth naming explicitly.

The field list includes two fields that didn’t appear in either Version 5 run: Routines Offset at 0x28 and Static Strings Offset at 0x2A. These are Version 6 additions that the ontology tracks with their own applicability ranges, and the test script derived them correctly without any hardcoded version logic of its own.

The conventional field list gains a new entry as well: Player Username at offset 0x38. This is a field that appears in Version 6 Infocom files by practice rather than by mandate, which is exactly the kind of conventional field the ontology captures via Appendix B. Here, the parser surfaced it correctly.

The font field check is the most meaningful result in this run. The tricky case message reads: Both font dimension labels present in V6 output (manual check: 0x26=height, 0x27=width in V6). That parenthetical is significant. In Version 5, offset 0x26 is font width and offset 0x27 is font height. In Version 6, those meanings are reversed: 0x26 becomes font height and 0x27 becomes font width. The test script can confirm that both labels are present in the output, but it notes explicitly that verifying the swap itself requires a manual check. This is important! The script cannot determine from label presence alone whether the parser assigned the correct meaning to each offset. The PASS here reflects what automated checking can confirm; the parenthetical is honest about what it cannot.

A crucial point here: this was all designed to show where human work still has to intersect with AI work.

This is a good place to pause and take stock of what the four runs together have demonstrated.

Running Against the Full Directory

Rather than running each file individually, the script accepts a directory argument and discovers all zcode files within it automatically:

python test.py .\zmachine\zcode\

The per-file output is the same as we’ve already walked through. The part worth showing here is the summary table:


======================================================================
  SUMMARY TABLE
======================================================================
  File                             V   PASS   WARN   FAIL  Status
  ------------------------------  ---  -----  -----  -----  ----------
  beyondzork-r57-s871221.z5        5     34      0      0  OK
  zork0-r393-s890714.z6            6     34      0      0  OK
  zork1-invclues-r52-s871125.z5    5     32      2      0  WARN
  zork1-r88-s840726.z3             3     19      0      0  OK

  Versions tested: [3, 5, 6]
  Files tested   : 4

Three files come back OK and one comes back WARN. As we saw when we ran the Invisiclues file individually, those two warnings both trace to the extension table, a normative feature of Version 5 that this particular file doesn’t populate. No file produced a FAIL.

The bottom two lines of the table are worth reading as a unit: three Z-Machine versions covered across four files. That’s not exhaustive: The Infocom versions 1, 2, and 4 aren’t represented here. The Inform version 7 and 8 are also not covered. However, we do cover the most structurally distinct cases: a Version 3 file with the simpler header layout, two Version 5 files that exercise the extended fields, and a Version 6 file that exercises the font swap. The gaps are real but the coverage is meaningful.

A Reference Implementation

The generated parser gives us something to test. But testing requires a baseline: some notion of what correct output actually looks like. That’s the role of reference.py. This is a hand-written implementation of the same header parser, authored by myself, directly from the specification and ontology rather than generated from them.

The distinction matters. The generated parser was produced by a model working from three sources under a set of prompt constraints. The reference implementation was written by a human (me!) with the same three sources, making every decision explicit and traceable. Where the generated parser might arrive at a correct result through pattern matching on the specification text, the reference implementation is designed so that every line of code has a visible reason (a spec section, an ontology individual, a version boundary) that can be audited directly.

This is a common pattern in safety-critical engineering: you build the thing, and you also build a simpler, more transparent version of the same thing whose correctness is easier to establish independently. The reference implementation doesn’t have to be fast or clever. It has to be obviously right. Then you can ask whether the thing you actually built agrees with it. This is a great example of where the tester and developer skill sets have to merge.

A few design choices in the reference implementation are worth naming before we run it. Fields absent for a given version are absent from the output entirely (not present as None or zero), which means the output structure itself encodes version applicability rather than leaving it implicit. Conventional fields are grouped under a separate key, mirroring the ontology’s hasSourceAuthority distinction. And the font field swap at offsets 0x26 and 0x27 is handled with an explicit comment block that names it as the critical version-conditional case, cites the ontology individuals by name, and makes the V5/V6 branching impossible to miss on a code review.

With that in mind, let’s run it against the same four story files and see what we get.

Zork 1 (Release 88, Version 3)

Running the reference implementation against the Version 3 file produces clean, structured JSON:


{
  "version": 3,
  "scalar_fields": {
    "high_memory_base": 20023,
    "initial_pc": 20229,
    ...
    "standard_revision": { "major": 0, "minor": 0 }
  },
  "flags1": {
    "status_line_type_score_turns": true,
    "story_split_across_two_discs": false,
    "status_line_not_available": false,
    "screen_splitting_available": false,
    "variable_pitch_font_default": false,
    "tandy_bit": false
  },
  "flags2": {
    "transcripting_on": false,
    "force_fixed_pitch": false
  },
  "extension_table": null,
  "conventional": {
    "release_number": 88,
    "serial_code": "840726",
    "serial_code_date": "1984-07-26",
    "inform_compiler_version": null
  }
}

Several things in this output are worth pausing on, because they connect directly to decisions made in the ontology.

The flags1 block uses the Version 1-3 bit layout: the one with status line type, disc split, screen splitting, and the Tandy bit. There’s no overlap with the Version 4+ layout. That branching is exactly what the ontology models as two distinct sets of BitField individuals sharing a single FlagsRegister parent, and the reference implementation makes the same cut at the same version boundary.

The flags2 block contains only two fields: transcripting and force fixed pitch. The Version 5+ bits (pictures, undo, mouse, colours, sound) are simply absent. This is the design choice mentioned earlier: fields that don’t apply to this version don’t appear as None or zero, they don’t appear at all. The output structure mirrors the ontology’s version applicability ranges directly.

The extension_table is null. Version 3 has no extension table, and the reference implementation represents that honestly rather than as an empty dictionary or a zero value.

The conventional block is where the filename pays off. The serial code reads 840726, which the reference implementation parses as a compilation date: 1984-07-26. The release number is 88. Both values appear verbatim in the filename zork1-r88-s840726.z3: the filename convention encodes exactly the fields the header contains, and the parser recovers them correctly. The inform_compiler_version is null, confirming this is an original Infocom file rather than one compiled with Inform.

Zork 1 Invisiclues (Release 52, Version 5)

The Version 5 output is considerably richer in structure, reflecting the expanded header layout that Version 5 introduces:


{
  "version": 5,
  "scalar_fields": {
    ...
    "font_width": 0,
    "font_height": 0,
    ...
    "header_extension_table_address": 0
  },
  "flags1": {
    "boldface_available": false,
    "italic_available": false,
    "fixed_space_style_available": false,
    "timed_keyboard_input_available": false,
    "colours_available": false
  },
  "flags2": {
    "transcripting_on": false,
    "force_fixed_pitch": false,
    "game_wants_pictures": false,
    "game_wants_undo": false,
    "game_wants_mouse": false,
    "game_wants_colours": false,
    "game_wants_sound_effects": false
  },
  "extension_table": null,
  "conventional": {
    "release_number": 52,
    "serial_code": "871125",
    "serial_code_date": "1987-11-25",
    "inform_compiler_version": null
  }
}

The flags1 block has switched entirely to the Version 4+ layout. The Version 1-3 fields (status line type, disc split, Tandy bit) are gone, replaced by capability flags the interpreter sets to advertise what it supports. All of them read false here because this is a story file at rest, not a file being run by an interpreter that has filled in those values.

The flags2 block has gained the full set of Version 5+ game request bits: pictures, undo, mouse, colours, sound effects. All false for the same reason.

The extension table is where this run becomes most informative. The header_extension_table_address in the scalar fields is zero, and the extension_table key is null. This is the reference implementation being precise: the pointer exists in the header because Version 5 requires it, but its value is zero, so there’s no table to read. This resolves the ambiguity left open by the two WARNs in the test run. The test script could see that the generated parser wasn’t reporting extension table contents, but it couldn’t determine whether that was because the pointer was zero or because the parser had silently skipped it. The reference implementation makes the answer explicit: the pointer is zero, the table is absent, and null is the correct output.

The conventional block again confirms the filename: release 52, serial code 871125, parsed as 1987-11-25, no Inform compiler signature.

Beyond Zork (Release 57, Version 5)

The Beyond Zork output shares the same Version 5 structure as the Invisiclues run, but several values are now non-zero, and the extension table is present:


{
  "version": 5,
  "scalar_fields": {
    ...
    "terminating_chars_address": 23570,
    ...
    "header_extension_table_address": 23599
  },
  "flags2": {
    "transcripting_on": false,
    "force_fixed_pitch": false,
    "game_wants_pictures": true,
    "game_wants_undo": true,
    "game_wants_mouse": true,
    "game_wants_colours": true,
    "game_wants_sound_effects": false
  },
  "extension_table": {
    "table_length": 2,
    "mouse_x": 0,
    "mouse_y": 0,
    ...
  },
  "conventional": {
    "release_number": 57,
    "serial_code": "871221",
    "serial_code_date": "1987-12-21",
    "inform_compiler_version": null
  }
}

The flags2 block is the most interesting part of this output. Where the Invisiclues file had all game request bits set to false, Beyond Zork has four of them set to true: pictures, undo, mouse, and colours. This is the game declaring its requirements to the interpreter: Beyond Zork was a more ambitious production than the Invisiclues version, and its header reflects that directly. The sound effects bit is false, which is consistent with Beyond Zork‘s lack of audio features.

The extension table is now populated rather than null. The header_extension_table_address is 23599, the reference implementation followed that pointer, and what it found is a table of length 2, meaning the table contains two data words beyond the length word itself. Mouse coordinates are zero, as expected for a file at rest. The unicode translation table address is zero, meaning Beyond Zork uses the default Z-Machine alphabet. This is exactly the case that resolved the Invisiclues WARN: the pointer was non-zero, the table exists, and the reference implementation read it correctly.

The terminating characters address at 23570 is also non-zero here, where it was zero in the Invisiclues file. Beyond Zork used a custom terminating characters table (part of its more complex input handling) and that address points to it in the story file’s static memory.

The conventional block again confirms the filename: release 57, serial 871221, parsed as 1987-12-21.

Zork Zero (Release 393, Version 6)

The Version 6 output introduces several fields that haven’t appeared in any previous run:


{
  "version": 6,
  "scalar_fields": {
    ...
    "initial_pc_packed": 14449,
    ...
    "font_height": 0,
    "font_width": 0,
    "routines_offset": 7462,
    "static_strings_offset": 27740,
    ...
    "header_extension_table_address": 26470
  },
  "flags1": {
    ...
    "colours_available": false,
    "picture_displaying_available": false,
    "sound_effects_available": false
  },
  "flags2": {
    ...
    "screen_redraw_request": false,
    "game_wants_menus": false
  },
  "extension_table": {
    "table_length": 2,
    ...
    "flags3": {
      "game_wants_transparency": false
    }
  },
  "conventional": {
    "release_number": 393,
    "serial_code": "890714",
    "serial_code_date": "1989-07-14",
    "player_username": null,
    "inform_compiler_version": null
  }
}

The initial PC field has changed its key name: where every previous run showed initial_pc, this one shows initial_pc_packed. This is the ontology’s supersession relationship made visible in output. In Versions 1 through 5, offset 0x06 holds a direct byte address for the initial program counter. In Version 6, the same offset holds a packed address of the main routine. This is a different encoding requiring a different multiplier to resolve to a byte address. The reference implementation uses a different key name to make that distinction impossible to miss.

The font fields are present (font_height at 0x26 and font_width at 0x27) with their Version 6 meanings. Both are zero in this file at rest, so the values themselves don’t demonstrate the swap. What demonstrates it is the key ordering: in the Version 5 runs, font_width appeared before font_height in the scalar fields, because 0x26 is width in Version 5. Here, font_height appears before font_width, because 0x26 is height in Version 6. The reference implementation assigned the correct label to each offset, and the output reflects that silently.

Version 6 adds two new scalar fields that didn’t appear in Version 5: routines_offset and static_strings_offset. These are used to resolve packed addresses in Version 6 story files, where the packing constants differ from earlier versions. Their presence in this output and absence from all previous runs is the ontology’s version applicability ranges working as intended.

The flags1 block gains two Version 6 bits that were absent from the Version 5 runs: picture_displaying_available and sound_effects_available. Both false here, but present in the structure. The flags2 block similarly gains screen_redraw_request and game_wants_menus. The game request bits follow the same pattern as Beyond Zork: pictures, undo, mouse, and colours all true; sound and menus false.

The extension table’s flags3 block now contains an entry: game_wants_transparency. This is a Version 6 feature tracked in the ontology as a per-bit individual with its own version applicability, and it appears in the output only here because it only applies here. Its value is false (Zork Zero doesn’t request transparency) but its presence in the structure is correct.

The conventional block adds player_username, which is null. This field exists in Version 6 story files by convention, holds a username when set by a running interpreter, and is all zero bytes in a shipped file. The reference implementation represents that correctly as null rather than an empty string or a string of null characters.

The filename delivers again: release 393, serial 890714, parsed as 1989-07-14.

What This Demonstrates

Running two implementations of the same parser against the same four story files, across three Z-Machine versions, produces a result worth stating plainly: the ontology worked. Not just as a knowledge artifact, but as an active participant in the pipeline: as specification, as oracle, and as consistency check, depending on which stage of the process you were in.

That’s the claim the previous post set up and this one has tried to make good on. An ontology that only validates itself is internally useful but externally inert. The test here was whether it could drive something: a code generation prompt, a test script’s expectations, a reference implementation’s design decisions. Across all three, it held up.

I actually can’t stress enough how proud I was of this moment as I was working through this post!

The generated parser and the reference implementation aren’t formally compared here, and that’s intentional. The point isn’t to declare one better than the other. The point is that having both reveals something neither could reveal alone. Where they agree, confidence increases. Where they differ (and running them side by side against your own files may surface differences this post didn’t) the ontology gives you a third point of reference to arbitrate between them.

A few honest observations about what the testing didn’t fully resolve. The font swap at offsets 0x26 and 0x27 passed its automated check in the Version 6 run, but the test script noted explicitly that verifying the swap requires a manual check: label presence in output doesn’t confirm correct assignment. The extension table WARNs on the Invisiclues file were explained by the reference implementation, but that explanation required a human to connect the two outputs. And the WARN versus FAIL distinction throughout the test results reflects a deliberate choice to be precise about what automated checking can actually confirm, rather than inflating confidence by calling uncertain results either a pass or a failure.

These aren’t defects in the approach. They’re the approach being honest. A test suite that claims more certainty than it has is more dangerous than one that surfaces its own limits. The ontology-driven testing here is valuable precisely because it derives its expectations from a structured artifact rather than from intuition or hardcoded assumptions. But it still has edges, and naming those edges is part of the work.

Next Steps!

I have an idea that might be worth checking out that takes this same pipeline and shows an LLM directly in action as it actually plays one of those games. This is a bit hard to set up, but you’ll see a working implementation in the next post. Hopefully you’ll find that a genuinely interesting demonstration of where this was all heading.

Stories from a Software Tester

Twice upon a time, in another space, no distance in any direction from here …

The Context

Which Model?

Taking Stock of What We Have

The Generation Script

The System Prompt

The User Prompt

What the Script Does

The Generated Output

Testing the Generated Parser

Running the Tests

Zork 1 (Release 88, Version 3)

Zork 1 Invisiclues (Release 52, Version 5)

Beyond Zork (Release 57, Version 5)

Zork Zero (Release 393, Version 6)

Running Against the Full Directory

A Reference Implementation

Zork 1 (Release 88, Version 3)

Zork 1 Invisiclues (Release 52, Version 5)

Beyond Zork (Release 57, Version 5)

Zork Zero (Release 393, Version 6)

What This Demonstrates

Next Steps!

Leave a Reply Cancel reply