Field note

Local Image Generation with Bonsai Image 4B

May 30, 2026

Locally generated smiling koala portrait

I am watching the current development of AI models with growing fascination and, honestly, a bit of concern. The best frontier models are becoming ridiculously good. Tools built on top of Opus-class models or GPT-5.5, which I use daily through Codex, are no longer "interesting demos" for me. They are part of the normal engineering loop: I plan with them, implement with them, debug with them, review with them, and increasingly treat them as the default interface for serious work. The models are only part of the story, because the surrounding capabilities are improving just as quickly. Computer Use is a good example. It deserves a dedicated post, and I will probably write one soon, because it changes the way agents interact with real applications rather than only repositories and APIs.

At the same time, the cost side is getting harder to ignore. For a long time, I usually advised people not to spend too much energy on local model setups. The quality gap was obvious. If you cared about output quality, tool reliability, reasoning depth, and day-to-day productivity, the answer was simple: use the best hosted models and focus on workflow, not local infrastructure. I still think that quality gap is real, but the economics are changing the conversation. Companies are starting to see serious AI bills. With API-key billing, the dollars can disappear very quickly, especially when agents are exploring, retrying, reading large contexts, generating plans, and iterating through implementation. If you use AI heavily enough, cost stops being an abstract concern and becomes part of the architecture.

That made me revisit an older workflow idea. Maybe the future is not "local models replace frontier models." Maybe the practical future is hybrid. A strong hosted model reads broad project context, reasons through the hard parts, and produces a plan. Then a cheaper model, possibly local, implements well-scoped pieces of that plan, runs checks, and reports back. In other words: use expensive intelligence where it matters most, and cheaper execution where the work is narrower. I do not know yet if that is the right answer, but it is worth testing.

So I bought a machine that makes local experimentation realistic: a MacBook Pro with Apple M5 Pro, 18 CPU cores, a 20-core Apple GPU, Metal 4 support, and 48 GB of RAM. It is not a dedicated GPU workstation, but it is a strong laptop, and that is exactly the point. I want to understand what becomes practical on hardware that can still sit on a desk, travel with me, and participate in normal software work.

Local Image Generation with Bonsai

For the first experiment, I started with images, partly because image generation has a brutally honest feedback loop. A command can succeed, a PNG can be written, metadata can look correct, and the result can still be ugly or useless. You have to look at it. That makes it a nice testing ground for local AI: it forces you to separate "the pipeline ran" from "the output is good enough."

That is why Bonsai Image 4B caught my attention.

I ran into it through PrismML's launch post, and it looked like exactly the kind of model worth trying first: small enough to be relevant for local devices, but still ambitious enough not to feel like a toy. PrismML describes Bonsai Image 4B as a compact family of image-generation models designed for local hardware. The model comes in 1-bit and ternary variants. The 1-bit version is the maximum-compression path; the ternary version keeps a little more representational flexibility and is positioned as the better quality/prompt-fidelity trade-off. According to PrismML, the ternary diffusion transformer is 1.21 GB, compared with 7.75 GB for the full-precision FLUX.2 Klein 4B transformer. Their post also frames local generation in the way that interests me most: once a model fits on the device, every iteration no longer has to be a remote request with round-trip latency and marginal serving cost.

So I decided to try it properly, and I used Codex to drive the setup. This was another case where a top model in a coding agent, in my case GPT-5.5 inside Codex, was extremely useful. I did not just ask it for a theoretical recipe. I let the agent work through the setup like an engineer would: inspect the available options, install dependencies, run commands, look at errors, adjust the plan, and keep iterating until there was a working path. The first route was the official-looking Apple Silicon path through MLX. That was the natural thing to try on a new Mac. It required the usual native-tooling work: Xcode, Python environment details, package installation, model files, and enough command-line glue to actually run an image generation pipeline.

That path technically ran, but it failed the only check that really matters for image generation: the output was basically gray-brown noise. This is where agentic coding is useful, but also where human judgement still matters. Codex could get the commands to execute and reason through likely causes, but I still had to look at the generated image and say: no, this is not a successful result.

The working path came from changing the deployment route rather than endlessly polishing the broken one. I switched to the unpacked Diffusers version of the ternary Bonsai model and ran it through PyTorch on MPS. That meant accepting a larger local model directory, roughly 15 GB on disk, instead of the elegant compressed Apple Silicon payload described in the announcement. But it produced real images. From there Codex helped wrap the working pipeline into a reusable CLI, expose it as bonsai-generate, save generation metadata next to each image, and finally turn the workflow into a Codex skill so I can call it from any project later. That is the practical story behind this post: not just "I tried a local image model," but "a coding agent helped me fight through the local setup until the model became a normal command-line tool."

CLI and Skill Workflow

What I wanted from this experiment was local workflow: a command that works from any project directory, accepts a prompt and output path, saves metadata, and is simple enough for Codex to call without knowing the internal Python implementation. Thanks to Codex, I ended up with a small CLI tool wrapped in a reusable AI skill.

The CLI supports the knobs I actually care about: prompt, size, seed, output file, output directory, and optional reference image. For example, for a slide deck I could generate a neutral section divider like this:

PROMPT=$(
  cat <<'PROMPT'
Minimal cinematic illustration of a local AI model running inside a laptop,
soft dark background, subtle blue and amber highlights, no text, no logo,
generous empty space for slide title
PROMPT
)

bonsai-generate \
  --prompt "$PROMPT" \
  --size 1792x1024 \
  --seed 7101 \
  --output ./slides/assets/local-ai-divider.png

The metadata is saved as JSON next to the output, which is surprisingly useful. I can inspect the prompt, seed, size, device, model directory, load time, and generation time later. That turns image generation from a one-off manual action into something closer to a repeatable build artefact.

The hero image at the top of this post came from the same workflow, just with a more playful prompt:

PROMPT=$(
  cat <<'PROMPT'
Impressive cheerful koala portrait, ultra realistic wildlife photography with
a slightly whimsical expression, koala smiling gently while holding a eucalyptus
branch, soft golden backlight, detailed fur, bright natural eyes, lush green
bokeh forest background, premium editorial animal portrait, no text, no logo
PROMPT
)

bonsai-generate \
  --prompt "$PROMPT" \
  --size 1536x1536 \
  --seed 8101 \
  --output ./public/images/blog/bonsai-hero-smiling-koala-1536.png

That command shape also makes the tool agent-friendly. A coding agent does not need to know the internal Python script; it only needs the command contract. This is why I added a global Codex skill. The skill is not complicated. It tells Codex when to use the generator, where the shared model lives, which sizes make sense, how to pass a prompt, and where to put outputs. I wrote more about the broader idea in AI Testing Skills: a skill is procedural memory, not just documentation.

📘 local-image-generation skill (click to expand)

---
name: local-image-generation
description: >
  Use when the user asks to generate raster images locally with the Bonsai
  generator, especially from any project directory, with configurable prompt,
  resolution, seed, optional reference image, optional reference size, and
  output path.
---

# Local Image Generation

Use the global CLI:

```bash
bonsai-generate \
  --prompt "..." \
  --size 512x512 \
  --output-dir ./generated-images
```

The CLI uses the shared model at:

```text
~/Models/bonsai-image-4B-ternary-unpacked/
```

## Workflow

1. Choose an output location in the current project unless the user gives one.
2. Use `512x512` for quick previews unless the user requests another size.
3. Use `1024x1024`, `832x1248`, or `1248x832` for final outputs.
4. Pass the prompt via `--prompt`.
5. Pass the resolution via `--size`.
6. Use `--image` or `--reference-image` for optional conditioning.
7. Use `--reference-size` when the reference dominates the prompt.
8. Pass `--output` for an exact PNG path or `--output-dir` for a folder.
9. Include `--seed` when the user wants reproducibility.
10. After generation, report the output path and mention the metadata file.

## Notes

- First run in a process pays model load time, usually 20-30 seconds.
- Generation after load is roughly 4 seconds for `512x512`.
- Generation after load is roughly 20 seconds for `1024x1024`.
- `--image` and `--reference-image` are aliases.
- `--reference-size` sets the maximum side of the conditioning image.
- Do not copy the model into projects. Use the shared model path.

In practice, I now create skills regularly. When I notice that I have invoked the same command four or five times, that is usually a sign that the workflow should stop living only in my head or shell history. A short skill gives Codex the local convention and lets me reuse it from another repository later:

$local-image-generation
Generate a 512x512 product-style image for this blog post.

Quality Assessment

My initial quality impression is cautiously positive, but with a very specific use case in mind. I am not trying to replace top hosted image models for everything. I am trying to understand whether a local model can give me useful visual material for blog posts, slide decks, workshops, prototypes, and internal notes. For that job, Bonsai Image 4B is much more useful than I expected.

For reference, a fresh process usually pays around 20-30 seconds of model load time on my machine. Smaller 512x512 previews are fast enough for prompt exploration, while larger 1024x1024, 1536x1536, and wide renders are better candidates for final assets. The 1536x1536 koala image used at the top of this post took 31.21 seconds to load the model and 85.83 seconds to generate. That is not instant, but it is acceptable for an asset workflow where I generate a few candidates, pick one, and move on.

The model works best when the task is forgiving but still visually rich: nature scenes, interiors, architecture, product-style shots, macro details, abstract concepts, glowing objects, atmospheric backgrounds, and playful illustrations. That maps well to my actual needs. I often need an image that sets a tone, fills a slide, illustrates a blog section, or gives a prototype a little more life. In those cases I do not need perfect photorealistic human anatomy. I need a good enough visual direction that I can generate locally and reuse later.

I would not use it as my first choice for generating people. Sometimes the results are fine, but this is where the quality gap to top models is obvious. The usual problems appear: eyes can look slightly off, fingers can become strange, faces can feel almost right but not quite there. If the image depends on a believable person, hands, expression, or detailed anatomy, I would still reach for a stronger hosted model or expect a lot more manual selection.

Showcase Gallery

Below are examples generated with this local Bonsai workflow. I deliberately kept them varied: product-style images, architectural scenes, macro shots, interiors, landscapes, and more playful concepts. I think the easiest way to judge the model is simply to browse the outputs, inspect the prompts, and decide where the quality is already good enough for real work.

Alpine Glass Library

1792x1344 · generation 79.35 s · load 29.60 s · seed 2201

Award-winning architectural visualization of a remote glass alpine library built into snowy cliffs at blue hour, warm amber reading rooms visible through transparent facades, pine trees in foreground, dramatic mountain silhouettes, crisp snow texture, realistic global illumination, elegant composition, ultra detailed, no people, no text, no logo

Local Image Generation with Bonsai

CLI and Skill Workflow

Quality Assessment

Showcase Gallery

Alpine Glass Library

Science Museum Atrium

Local AI Workspace

Smiling Koala Portrait

Neural Cube Still Life

Home Testing Lab

Local AI Test Bench

Benchmark Glass Tokens

Underground Data Archive

Robot Gardener Terrarium

Glass Butterfly on Stone

Floating Glass Greenhouse

Circuit Board Lighthouse

Rice Terraces Sunrise

Citrus Tonic Macro

Japanese Reading Room

Perfume Orchid Still Life

Lantern Garden Canal

Advertisement: AI_Testers 2.0

Related reading

Comments