GGML

Sources:

  • A brief introduction to GGML by HuggingFace
  • A series of in-depth explanations (in Chinese) from Zhihu
  • GGML Tutorial github repo
  • GGML official repo

GGML

GGML is a deep learning tensor library written in C and C++ with a focus on Transformer inference.

Key concepts (expanded)

ggml_context

See this article for diagram illustration.

A container that owns tensors and graphs. Think of it as an arena allocator: all tensors created inside share the same context, and freeing the context frees everything.

1
2
3
4
5
6
7
8
9
#include "ggml.h"

struct ggml_init_params params {
.mem_size = ggml_tensor_overhead() * num_tensors, // only metadata arena
.mem_buffer = NULL,
.no_alloc = true, // do NOT allocate tensor->data now
};

struct ggml_context *ctx = ggml_init(params);
  • With .no_alloc = true, tensors only store metadata, not data.
  • Later you bind them to real memory via a backend.
  • This separation allows you to first build the graph (structure), then decide memory placement.

ggml_tensor

Definition: n-dimensional tensor metadata.

1
2
3
4
5
6
7
8
9
10
struct ggml_tensor {
enum ggml_type type; // dtype: F32, F16, Q4, โ€ฆ
struct ggml_backend_buffer *buffer; // owner memory block
int64_t ne[GGML_MAX_DIMS]; // shape (#elements per dim)
size_t nb[GGML_MAX_DIMS]; // stride in bytes
enum ggml_op op; // producer operation
struct ggml_tensor *src[GGML_MAX_SRC]; // inputs
void *data; // pointer inside buffer
// ...
};
  • Shape/stride: ne[], nb[]
  • Placement: buffer (memory block) + data (address inside)
  • Graph linkage: op + src[]

Minimal example:

1
2
3
struct ggml_tensor *A = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, colsA, rowsA);
struct ggml_tensor *B = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, colsB, rowsB);
// with no_alloc=true โ†’ A->data == NULL

๐Ÿ‘‰ At this stage, A and B exist as symbols in the graph, but not yet bound to real memory.


ggml_cgraph

Definition: computation graph = nodes + leafs.

1
2
3
4
5
6
struct ggml_cgraph {
int n_nodes, n_leafs;
struct ggml_tensor **nodes; // activations (values change)
struct ggml_tensor **leafs; // inputs & weights (constant)
// ...
};

Minimal example:

1
2
3
struct ggml_cgraph *gf = ggml_new_graph(ctx);
struct ggml_tensor *C = ggml_mul_mat(ctx, A, B); // define op
ggml_build_forward_expand(gf, C); // DFS, topo order
  • Leaf tensors = model weights, inputs
  • Node tensors = intermediate results (activations)

๐Ÿ‘‰ Building the graph = JAX jit tracing, but done manually.


ggml_backend

Definition: executor interface. A backend knows how to run ops, but not how to allocate memory by itself.

1
2
3
4
5
6
7
8
9
#include "ggml-backend.h"
#include "ggml-cpu.h"
#ifdef GGML_USE_CUDA
#include "ggml-cuda.h"
#endif

ggml_backend_t backend = NULL;
backend = ggml_backend_cuda_init(0); // try GPU
if (!backend) backend = ggml_backend_cpu_init();

Run graph:

1
ggml_backend_graph_compute(backend, gf);

๐Ÿ‘‰ Multiple backends can coexist (CPU, CUDA, Metal, RPC). Each gets a logical backend id when scheduling.


ggml_backend_buffer_type

Allocator type for a backend: describes how memory should be allocated on this device.

1
2
ggml_backend_buffer_type_t buft =
ggml_backend_get_default_buffer_type(backend);
  • CPU โ†’ malloc/free
  • CUDA โ†’ cudaMalloc/cudaFree
  • Metal โ†’ Metal buffer alloc

๐Ÿ‘‰ Think of it as malloc implementation tied to a backend.


ggml_backend_buffer

The actual allocated memory block. All tensor->data pointers ultimately live inside a buffer.

Two common cases:

A. Allocate leaf tensors (inputs/weights)

1
2
3
4
5
ggml_backend_buffer_t buf =
ggml_backend_alloc_ctx_tensors(ctx, backend);

// Now leaf tensors A->data != NULL, A->buffer = buf
ggml_backend_tensor_set(A, host_A, 0, ggml_nbytes(A));

B. Manual allocation

1
2
3
4
size_t size = 1024;
ggml_backend_buffer_t buf =
ggml_backend_alloc_buffer(buft, size);
void *ptr = ggml_backend_buffer_get_base(buf);

๐Ÿ‘‰ Buffers are device-specific: CPU RAM, CUDA VRAM, Metal heapโ€ฆ


ggml_gallocr

Definition: Graph allocator = temporary memory manager for node tensors.

1
2
3
4
5
ggml_gallocr_t gallocr =
ggml_gallocr_new(ggml_backend_get_default_buffer_type(backend));

ggml_gallocr_reserve(gallocr, gf); // optional: estimate
ggml_gallocr_alloc_graph(gallocr, gf); // assign memory to nodes
  • Leafs: allocated directly by backend buffer (long-term, weights/inputs)
  • Nodes: allocated per graph execution via gallocr (short-term, activations)

๐Ÿ‘‰ In short:

  • leaf โ†’ buffer
  • node โ†’ gallocr โ†’ buffer

ggml_backend_sched

Scheduler for multi-backend execution.

1
2
3
4
5
6
7
ggml_backend_t backends[2] = {
ggml_backend_cpu_init(),
ggml_backend_cuda_init(0),
};

ggml_backend_sched_t sched = ggml_backend_sched_new(backends, 2);
ggml_backend_sched_graph_compute(sched, gf);

Steps:

  1. Traverse graph, assign backend id to each tensor
  2. Split into subgraphs per backend
  3. Insert data copies across backends if needed

Why not just buffer? Altough ggml_tensor has buffer pointer, the buffer only tells you where data sits, not who executes.

  • Same CUDA buffer might be used by multiple streams (different executors).
  • Or CPU buffer might be computed locally vs. RPC remotely.

Therefore, we need backend id as routing tag: which executor runs this op.

Step1. How GGML create/load tensors

The first step is to create tensors (model weights or inputs) and bind them to real memory.

1
2
3
4
5
6
7
8
9
10
11
12
[ggml_context]
โ”‚
โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ ggml_tensor โ”‚ ---> โ”‚ ggml_backend_buf โ”‚ ---> [device memory]
โ”‚ (A, B, โ€ฆ) โ”‚ โ”‚ (CPU/GPU/โ€ฆ) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”‚ โ–ฒ
โ”‚ (ggml_backend_tensor_set)
โ–ผ
[host data: float[]]

1. Create tensors in a context

1
2
3
4
5
6
7
8
9
10
11
struct ggml_init_params params = {
.mem_size = ggml_tensor_overhead() * num_tensors,
.mem_buffer = NULL,
.no_alloc = true, // only create metadata
};

struct ggml_context *ctx = ggml_init(params);

// create 2D tensors
struct ggml_tensor *A = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, colsA, rowsA);
struct ggml_tensor *B = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, colsB, rowsB);

At this point:

  • A->data == NULL
  • A->buffer == NULL
  • only metadata (shape/stride/op) is stored.

2. Allocate backend memory for leaf tensors

1
2
3
4
5
ggml_backend_t backend = ggml_backend_cpu_init(); // or CUDA/Metal

// allocate memory for all ctx tensors (here A, B)
ggml_backend_buffer_t buf =
ggml_backend_alloc_ctx_tensors(ctx, backend);

Now:

  • A->buffer = buf
  • A->data != NULL (points inside buf)
  • same for B

๐Ÿ‘‰ This is the step that turns โ€œsymbolic tensorsโ€ into real memory-backed tensors.


3. Load data into tensors

1
2
3
4
5
float host_A[rowsA*colsA] = { ... };
ggml_backend_tensor_set(A, host_A, 0, ggml_nbytes(A));

float host_B[rowsB*colsB] = { ... };
ggml_backend_tensor_set(B, host_B, 0, ggml_nbytes(B));

This copies from host memory into the backend buffer (CPU RAM or GPU VRAM).

Got it โœ… โ€” letโ€™s fully rewrite Step2. How GGML build computation graphs in English, with clear explanation of the differences between GGML vs JAX (both static graph frameworks, but with different workflows). Iโ€™ll keep your preferred style: concise prose โ†’ code โ†’ bullets โ†’ summary.


Step2. How GGML build computation graphs

GGML is similar to JAX in that both are static graph frameworks. The key difference is when and how the computation graph is built:

  • JAX: you write a Python function, and at the first jit call, JAX traces the Python operations to construct a computation graph (HLO).
  • GGML: you explicitly call C APIs (ggml_mul_mat, ggml_add, โ€ฆ) at runtime, and these calls directly create ggml_tensor objects that represent nodes. Later, you invoke ggml_build_forward_expand to traverse dependencies and finalize the graph.

So in GGML the graph is still built at runtime, but unlike PyTorch eager mode it is built once and reused (not rebuilt on every forward).


1. Computation graph (ggml_cgraph)

Each graph records tensors that participate in evaluation.

1
2
3
4
5
6
struct ggml_cgraph {
int n_nodes, n_leafs;
struct ggml_tensor **nodes; // intermediate tensors
struct ggml_tensor **leafs; // inputs & weights
// grads / grad_accs used for training
};
  • leafs = constant tensors (model weights, inputs)
  • nodes = activations, results of operations

The GGML cgraph is a tensor graph where each node or leaf is a tensor. Suppose our model is

1
y = relu(matmul(x, W) + b)

Then the GGML cgraph structure is

1
2
3
4
5
(x) โ”€โ”
โ”œโ”€> (matmul_out) โ”€โ”ฌโ”€> (add_out) โ”€โ”ฌโ”€> (relu_out)
(W) โ”€โ”˜ โ”‚ โ”‚
(b) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚
โ””โ”€โ”€ output y

Each intermediate tensor (matmul_outใ€add_outใ€relu_out) is a node, which contains how it's caculated (the operation, the pointers to its input).


2. Tensor metadata (ggml_tensor)

Each tensor already knows:

  • its shape (ne[])
  • strides (nb[])
  • producer operation (op)
  • input tensors (src[])
1
2
3
4
5
6
7
8
struct ggml_tensor {
enum ggml_type type;
int64_t ne[GGML_MAX_DIMS]; // shape
size_t nb[GGML_MAX_DIMS]; // stride in bytes
enum ggml_op op; // producer op
struct ggml_tensor *src[GGML_MAX_SRC]; // inputs
void *data; // pointer to real buffer
};

Thus, when the graph is finalized, most information (shape, data type, dependencies) is already available.


3. Build the forward graph

Graphs are finalized using ggml_build_forward_expand:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
void ggml_build_forward_expand(struct ggml_cgraph *cgraph,
struct ggml_tensor *tensor) {
ggml_build_forward_impl(cgraph, tensor, true);
}

static void ggml_build_forward_impl(struct ggml_cgraph *cgraph,
struct ggml_tensor *tensor, bool expand) {
const int n0 = cgraph->n_nodes;
ggml_visit_parents(cgraph, tensor); // DFS
const int n_new = cgraph->n_nodes - n0;
if (n_new > 0) {
GGML_ASSERT(cgraph->nodes[cgraph->n_nodes - 1] == tensor);
}
}
  • ggml_visit_parents: depth-first search (DFS) through tensor->src[], adds unseen parents to leafs or nodes, guarantees topological order.
  • The final graph is a static DAG of tensors.

4. Minimal example

1
2
3
4
5
6
7
8
9
struct ggml_cgraph *gf = ggml_new_graph(ctx);

// computation: C = A @ B
struct ggml_tensor *C = ggml_mul_mat(ctx, A, B);

// build graph (DFS from C)
ggml_build_forward_expand(gf, C);

// now gf->leafs = {A, B}, gf->nodes = {C}

๐Ÿ”Ž GGML vs JAX graph building

Framework When graph is built How graph is built User perspective
PyTorch eager Every forward Dynamic operator recording Immediate execution
JAX At first jit call Trace Python ops โ†’ build HLO graph User writes Python, tracing is automatic
GGML Runtime, once Explicit C API calls build tensors + DFS to collect graph User manually builds the graph

Step3. GGML schedule the running of the graph

After building the computation graph, GGML must:

  1. Allocate memory for intermediate nodes
  2. Partition the graph into per-backend subgraphs
  3. Assign backends and insert data copies when needed
  4. Execute subgraphs in dependency order

This is the job of the scheduler (ggml_backend_sched).


1. Entry point: ggml_backend_sched_graph_compute

1
2
3
4
5
6
enum ggml_status ggml_backend_sched_graph_compute(
ggml_backend_sched_t sched, struct ggml_cgraph * graph) {
enum ggml_status err = ggml_backend_sched_graph_compute_async(sched, graph);
ggml_backend_sched_synchronize(sched);
return err;
}
  • Calls the async version to launch computation
  • Then synchronizes (waits for completion)

2. Async execution and allocation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
enum ggml_status ggml_backend_sched_graph_compute_async(
ggml_backend_sched_t sched, struct ggml_cgraph * graph) {

if (!sched->is_reset && !sched->is_alloc) {
ggml_backend_sched_reset(sched);
}

if (!sched->is_alloc) {
if (!ggml_backend_sched_alloc_graph(sched, graph)) {
return GGML_STATUS_ALLOC_FAILED;
}
}

return ggml_backend_sched_compute_splits(sched);
}
  • If memory not allocated yet โ†’ call ggml_backend_sched_alloc_graph
  • Then run ggml_backend_sched_compute_splits to execute each subgraph

3. Graph allocation: ggml_backend_sched_alloc_graph

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
bool ggml_backend_sched_alloc_graph(
ggml_backend_sched_t sched, struct ggml_cgraph * graph) {
GGML_ASSERT((int)sched->hash_set.size >= graph->n_nodes + graph->n_leafs);
GGML_ASSERT(!sched->is_alloc);

// step1: split into per-backend subgraphs
ggml_backend_sched_split_graph(sched, graph);

// step2: allocate memory for each subgraph
if (!ggml_backend_sched_alloc_splits(sched)) {
return false;
}

sched->is_alloc = true;
return true;
}

Two core tasks:

  1. Split the graph (ggml_backend_sched_split_graph)
    • Traverse the graph in topological order
    • Group ops into subgraphs by backend id
    • Insert copy ops automatically across backends (CPU โ†”๏ธŽ CUDA, CUDA โ†”๏ธŽ Metal, etc.)
  2. Allocate memory (ggml_backend_sched_alloc_splits)
    • For each subgraph, create a ggml_gallocr tied to the backendโ€™s buffer type
    • Assign buffers to all intermediate node tensors
    • Leafs remain unchanged (already allocated earlier)

4. Execution: ggml_backend_sched_compute_splits

After allocation, the scheduler executes each subgraph in order:

  • Calls the correct backend (CPU, CUDA, โ€ฆ) for each split
  • Ensures cross-device copies are completed before dependent ops run
  • Supports multiple model copies (sched->cur_copy / next_copy) for parallelism

5. Worked example

Suppose:

1
C = A @ B + D
  • A @ B (matmul) should run on CUDA
  • + D (add) should run on CPU

Scheduler steps:

  1. Split graph
1
2
3
4
5
6
7
Split1 (CUDA):
tmp = A @ B

[insert copy CUDA โ†’ CPU]

Split2 (CPU):
C = tmp + D
  1. Allocate memory
  • CUDA split โ†’ VRAM buffers for matmul output
  • CPU split โ†’ host RAM for addition output
  1. Execute
  • Run CUDA matmul kernel
  • Copy result back to CPU buffer
  • Run CPU addition kernel

๐Ÿ”Ž Summary

  • Leaf tensors โ†’ allocated directly via backend buffer
  • Node tensors โ†’ allocated by scheduler (gallocr) per subgraph
  • Scheduler workflow:
    1. Split graph into subgraphs (ggml_backend_sched_split_graph)
    2. Allocate buffers (ggml_backend_sched_alloc_splits)
    3. Execute each subgraph (ggml_backend_sched_compute_splits)
  • backend id ensures the correct executor is chosen, even if multiple backends share the same memory.

๐Ÿ‘‰ Compared to Step1 (load tensors) and Step2 (build graph), Step3 is where the graph is materialized into memory and mapped to devices for execution.