GGML

Posted on 2025-09-30 Edited on 2026-03-15 In AI Views:

Sources:

A brief introduction to GGML by HuggingFace
A series of in-depth explanations (in Chinese) from Zhihu
GGML Tutorial github repo
GGML official repo

GGML

GGML is a deep learning tensor library written in C and C++ with a focus on Transformer inference.

Key concepts (expanded)

`ggml_context`

See this article for diagram illustration.

A container that owns tensors and graphs. Think of it as an arena allocator: all tensors created inside share the same context, and freeing the context frees everything.

#include "ggml.h"

struct ggml_init_params params {
    .mem_size   = ggml_tensor_overhead() * num_tensors, // only metadata arena
    .mem_buffer = NULL,
    .no_alloc   = true,  // do NOT allocate tensor->data now
};

struct ggml_context *ctx = ggml_init(params);

With .no_alloc = true, tensors only store metadata, not data.
Later you bind them to real memory via a backend.
This separation allows you to first build the graph (structure), then decide memory placement.

`ggml_tensor`

Definition: n-dimensional tensor metadata.

struct ggml_tensor {
    enum ggml_type type;                 // dtype: F32, F16, Q4, …
    struct ggml_backend_buffer *buffer;  // owner memory block
    int64_t ne[GGML_MAX_DIMS];           // shape (#elements per dim)
    size_t  nb[GGML_MAX_DIMS];           // stride in bytes
    enum ggml_op op;                     // producer operation
    struct ggml_tensor *src[GGML_MAX_SRC]; // inputs
    void *data;                          // pointer inside buffer
    // ...
};

Shape/stride: ne[], nb[]
Placement: buffer (memory block) + data (address inside)
Graph linkage: op + src[]

Minimal example:

1
2
3

struct ggml_tensor *A = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, colsA, rowsA);
struct ggml_tensor *B = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, colsB, rowsB);
// with no_alloc=true → A->data == NULL

👉 At this stage, A and B exist as symbols in the graph, but not yet bound to real memory.

`ggml_cgraph`

Definition: computation graph = nodes + leafs.

struct ggml_cgraph {
    int n_nodes, n_leafs;
    struct ggml_tensor **nodes; // activations (values change)
    struct ggml_tensor **leafs; // inputs & weights (constant)
    // ...
};

Minimal example:

1
2
3

struct ggml_cgraph *gf = ggml_new_graph(ctx);
struct ggml_tensor *C = ggml_mul_mat(ctx, A, B); // define op
ggml_build_forward_expand(gf, C);                // DFS, topo order

Leaf tensors = model weights, inputs
Node tensors = intermediate results (activations)

👉 Building the graph = JAX jit tracing, but done manually.

`ggml_backend`

Definition: executor interface. A backend knows how to run ops, but not how to allocate memory by itself.

#include "ggml-backend.h"
#include "ggml-cpu.h"
#ifdef GGML_USE_CUDA
#include "ggml-cuda.h"
#endif

ggml_backend_t backend = NULL;
backend = ggml_backend_cuda_init(0);  // try GPU
if (!backend) backend = ggml_backend_cpu_init();

Run graph:

1	ggml_backend_graph_compute(backend, gf);

👉 Multiple backends can coexist (CPU, CUDA, Metal, RPC). Each gets a logical backend id when scheduling.

`ggml_backend_buffer_type`

Allocator type for a backend: describes how memory should be allocated on this device.

1 2	ggml_backend_buffer_type_t buft = ggml_backend_get_default_buffer_type(backend);

CPU → malloc/free
CUDA → cudaMalloc/cudaFree
Metal → Metal buffer alloc

👉 Think of it as malloc implementation tied to a backend.

`ggml_backend_buffer`

The actual allocated memory block. All tensor->data pointers ultimately live inside a buffer.

Two common cases:

A. Allocate leaf tensors (inputs/weights)

ggml_backend_buffer_t buf =
    ggml_backend_alloc_ctx_tensors(ctx, backend);

// Now leaf tensors A->data != NULL, A->buffer = buf
ggml_backend_tensor_set(A, host_A, 0, ggml_nbytes(A));

B. Manual allocation

size_t size = 1024;
ggml_backend_buffer_t buf =
    ggml_backend_alloc_buffer(buft, size);
void *ptr = ggml_backend_buffer_get_base(buf);

👉 Buffers are device-specific: CPU RAM, CUDA VRAM, Metal heap…

`ggml_gallocr`

Definition: Graph allocator = temporary memory manager for node tensors.

ggml_gallocr_t gallocr =
    ggml_gallocr_new(ggml_backend_get_default_buffer_type(backend));

ggml_gallocr_reserve(gallocr, gf);     // optional: estimate
ggml_gallocr_alloc_graph(gallocr, gf); // assign memory to nodes

Leafs: allocated directly by backend buffer (long-term, weights/inputs)
Nodes: allocated per graph execution via gallocr (short-term, activations)

👉 In short:

leaf → buffer
node → gallocr → buffer

`ggml_backend_sched`

Scheduler for multi-backend execution.

ggml_backend_t backends[2] = {
    ggml_backend_cpu_init(),
    ggml_backend_cuda_init(0),
};

ggml_backend_sched_t sched = ggml_backend_sched_new(backends, 2);
ggml_backend_sched_graph_compute(sched, gf);

Steps:

Traverse graph, assign backend id to each tensor
Split into subgraphs per backend
Insert data copies across backends if needed

Why not just buffer? Altough ggml_tensor has buffer pointer, the buffer only tells you where data sits, not who executes.

Same CUDA buffer might be used by multiple streams (different executors).
Or CPU buffer might be computed locally vs. RPC remotely.

Therefore, we need backend id as routing tag: which executor runs this op.

Step1. How GGML create/load tensors

The first step is to create tensors (model weights or inputs) and bind them to real memory.

[ggml_context]
      │
      ▼
┌─────────────┐        ┌──────────────────┐
│ ggml_tensor │  --->  │ ggml_backend_buf │  --->  [device memory]
│  (A, B, …)  │        │   (CPU/GPU/…)    │
└─────────────┘        └──────────────────┘
      │                       ▲
      │ (ggml_backend_tensor_set)
      ▼
[host data: float[]]

1. Create tensors in a context

struct ggml_init_params params = {
    .mem_size   = ggml_tensor_overhead() * num_tensors,
    .mem_buffer = NULL,
    .no_alloc   = true, // only create metadata
};

struct ggml_context *ctx = ggml_init(params);

// create 2D tensors
struct ggml_tensor *A = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, colsA, rowsA);
struct ggml_tensor *B = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, colsB, rowsB);

At this point:

A->data == NULL
A->buffer == NULL
only metadata (shape/stride/op) is stored.

2. Allocate backend memory for leaf tensors

ggml_backend_t backend = ggml_backend_cpu_init(); // or CUDA/Metal

// allocate memory for all ctx tensors (here A, B)
ggml_backend_buffer_t buf =
    ggml_backend_alloc_ctx_tensors(ctx, backend);

Now:

A->buffer = buf
A->data != NULL (points inside buf)
same for B

👉 This is the step that turns “symbolic tensors” into real memory-backed tensors.

3. Load data into tensors

float host_A[rowsA*colsA] = { ... };
ggml_backend_tensor_set(A, host_A, 0, ggml_nbytes(A));

float host_B[rowsB*colsB] = { ... };
ggml_backend_tensor_set(B, host_B, 0, ggml_nbytes(B));

This copies from host memory into the backend buffer (CPU RAM or GPU VRAM).

Got it ✅ — let’s fully rewrite Step2. How GGML build computation graphs in English, with clear explanation of the differences between GGML vs JAX (both static graph frameworks, but with different workflows). I’ll keep your preferred style: concise prose → code → bullets → summary.

Step2. How GGML build computation graphs

GGML is similar to JAX in that both are static graph frameworks. The key difference is when and how the computation graph is built:

JAX: you write a Python function, and at the first jit call, JAX traces the Python operations to construct a computation graph (HLO).
GGML: you explicitly call C APIs (ggml_mul_mat, ggml_add, …) at runtime, and these calls directly create ggml_tensor objects that represent nodes. Later, you invoke ggml_build_forward_expand to traverse dependencies and finalize the graph.

So in GGML the graph is still built at runtime, but unlike PyTorch eager mode it is built once and reused (not rebuilt on every forward).

1. Computation graph (`ggml_cgraph`)

Each graph records tensors that participate in evaluation.

struct ggml_cgraph {
    int n_nodes, n_leafs;
    struct ggml_tensor **nodes; // intermediate tensors
    struct ggml_tensor **leafs; // inputs & weights
    // grads / grad_accs used for training
};

leafs = constant tensors (model weights, inputs)
nodes = activations, results of operations

The GGML cgraph is a tensor graph where each node or leaf is a tensor. Suppose our model is

1	y = relu(matmul(x, W) + b)

Then the GGML cgraph structure is

(x) ─┐
     ├─> (matmul_out) ─┬─> (add_out) ─┬─> (relu_out)
(W) ─┘                  │              │
(b) ────────────────────┘              │
                                       └── output y

Each intermediate tensor (matmul_out、add_out、relu_out) is a node, which contains how it's caculated (the operation, the pointers to its input).

2. Tensor metadata (`ggml_tensor`)

Each tensor already knows:

its shape (ne[])
strides (nb[])
producer operation (op)
input tensors (src[])

struct ggml_tensor {
    enum ggml_type type;
    int64_t ne[GGML_MAX_DIMS]; // shape
    size_t  nb[GGML_MAX_DIMS]; // stride in bytes
    enum ggml_op op;           // producer op
    struct ggml_tensor *src[GGML_MAX_SRC]; // inputs
    void *data;                // pointer to real buffer
};

Thus, when the graph is finalized, most information (shape, data type, dependencies) is already available.

3. Build the forward graph

Graphs are finalized using ggml_build_forward_expand:

void ggml_build_forward_expand(struct ggml_cgraph *cgraph,
                               struct ggml_tensor *tensor) {
    ggml_build_forward_impl(cgraph, tensor, true);
}

static void ggml_build_forward_impl(struct ggml_cgraph *cgraph,
                                    struct ggml_tensor *tensor, bool expand) {
    const int n0 = cgraph->n_nodes;
    ggml_visit_parents(cgraph, tensor); // DFS
    const int n_new = cgraph->n_nodes - n0;
    if (n_new > 0) {
        GGML_ASSERT(cgraph->nodes[cgraph->n_nodes - 1] == tensor);
    }
}

ggml_visit_parents: depth-first search (DFS) through tensor->src[], adds unseen parents to leafs or nodes, guarantees topological order.
The final graph is a static DAG of tensors.

4. Minimal example

struct ggml_cgraph *gf = ggml_new_graph(ctx);

// computation: C = A @ B
struct ggml_tensor *C = ggml_mul_mat(ctx, A, B);

// build graph (DFS from C)
ggml_build_forward_expand(gf, C);

// now gf->leafs = {A, B}, gf->nodes = {C}

🔎 GGML vs JAX graph building

Framework	When graph is built	How graph is built	User perspective
PyTorch eager	Every forward	Dynamic operator recording	Immediate execution
JAX	At first `jit` call	Trace Python ops → build HLO graph	User writes Python, tracing is automatic
GGML	Runtime, once	Explicit C API calls build tensors + DFS to collect graph	User manually builds the graph

Step3. GGML schedule the running of the graph

After building the computation graph, GGML must:

Allocate memory for intermediate nodes
Partition the graph into per-backend subgraphs
Assign backends and insert data copies when needed
Execute subgraphs in dependency order

This is the job of the scheduler (ggml_backend_sched).

1. Entry point: `ggml_backend_sched_graph_compute`

enum ggml_status ggml_backend_sched_graph_compute(
        ggml_backend_sched_t sched, struct ggml_cgraph * graph) {
    enum ggml_status err = ggml_backend_sched_graph_compute_async(sched, graph);
    ggml_backend_sched_synchronize(sched);
    return err;
}

Calls the async version to launch computation
Then synchronizes (waits for completion)

2. Async execution and allocation

enum ggml_status ggml_backend_sched_graph_compute_async(
        ggml_backend_sched_t sched, struct ggml_cgraph * graph) {

    if (!sched->is_reset && !sched->is_alloc) {
        ggml_backend_sched_reset(sched);
    }

    if (!sched->is_alloc) {
        if (!ggml_backend_sched_alloc_graph(sched, graph)) {
            return GGML_STATUS_ALLOC_FAILED;
        }
    }

    return ggml_backend_sched_compute_splits(sched);
}

If memory not allocated yet → call ggml_backend_sched_alloc_graph
Then run ggml_backend_sched_compute_splits to execute each subgraph

3. Graph allocation: `ggml_backend_sched_alloc_graph`

bool ggml_backend_sched_alloc_graph(
        ggml_backend_sched_t sched, struct ggml_cgraph * graph) {
    GGML_ASSERT((int)sched->hash_set.size >= graph->n_nodes + graph->n_leafs);
    GGML_ASSERT(!sched->is_alloc);

    // step1: split into per-backend subgraphs
    ggml_backend_sched_split_graph(sched, graph);

    // step2: allocate memory for each subgraph
    if (!ggml_backend_sched_alloc_splits(sched)) {
        return false;
    }

    sched->is_alloc = true;
    return true;
}

Two core tasks:

Split the graph (ggml_backend_sched_split_graph)
- Traverse the graph in topological order
- Group ops into subgraphs by backend id
- Insert copy ops automatically across backends (CPU ↔︎ CUDA, CUDA ↔︎ Metal, etc.)
Allocate memory (ggml_backend_sched_alloc_splits)
- For each subgraph, create a ggml_gallocr tied to the backend’s buffer type
- Assign buffers to all intermediate node tensors
- Leafs remain unchanged (already allocated earlier)

4. Execution: `ggml_backend_sched_compute_splits`

After allocation, the scheduler executes each subgraph in order:

Calls the correct backend (CPU, CUDA, …) for each split
Ensures cross-device copies are completed before dependent ops run
Supports multiple model copies (sched->cur_copy / next_copy) for parallelism

5. Worked example

Suppose:

1	C = A @ B + D

A @ B (matmul) should run on CUDA
+ D (add) should run on CPU

Scheduler steps:

Split graph

Split1 (CUDA):
    tmp = A @ B

[insert copy CUDA → CPU]

Split2 (CPU):
    C = tmp + D

Allocate memory

CUDA split → VRAM buffers for matmul output
CPU split → host RAM for addition output

Execute

Run CUDA matmul kernel
Copy result back to CPU buffer
Run CPU addition kernel

🔎 Summary

Leaf tensors → allocated directly via backend buffer
Node tensors → allocated by scheduler (gallocr) per subgraph
Scheduler workflow:
1. Split graph into subgraphs (ggml_backend_sched_split_graph)
2. Allocate buffers (ggml_backend_sched_alloc_splits)
3. Execute each subgraph (ggml_backend_sched_compute_splits)
backend id ensures the correct executor is chosen, even if multiple backends share the same memory.

👉 Compared to Step1 (load tensors) and Step2 (build graph), Step3 is where the graph is materialized into memory and mapped to devices for execution.

GGML

Key concepts (expanded)

ggml_context

ggml_tensor

ggml_cgraph

ggml_backend

ggml_backend_buffer_type

ggml_backend_buffer

ggml_gallocr

ggml_backend_sched

Step1. How GGML create/load tensors

1. Create tensors in a context

2. Allocate backend memory for leaf tensors

3. Load data into tensors

Step2. How GGML build computation graphs

1. Computation graph (ggml_cgraph)

2. Tensor metadata (ggml_tensor)

3. Build the forward graph

4. Minimal example

🔎 GGML vs JAX graph building

Step3. GGML schedule the running of the graph

1. Entry point: ggml_backend_sched_graph_compute

2. Async execution and allocation

3. Graph allocation: ggml_backend_sched_alloc_graph

4. Execution: ggml_backend_sched_compute_splits

5. Worked example

🔎 Summary

`ggml_context`

`ggml_tensor`

`ggml_cgraph`

`ggml_backend`

`ggml_backend_buffer_type`

`ggml_backend_buffer`

`ggml_gallocr`

`ggml_backend_sched`

1. Computation graph (`ggml_cgraph`)

2. Tensor metadata (`ggml_tensor`)

1. Entry point: `ggml_backend_sched_graph_compute`

3. Graph allocation: `ggml_backend_sched_alloc_graph`

4. Execution: `ggml_backend_sched_compute_splits`