GGML
Sources:
- A brief introduction to GGML by HuggingFace
- A series of in-depth explanations (in Chinese) from Zhihu
- GGML Tutorial github repo
- GGML official repo
GGML
GGML is a deep learning tensor library written in C and C++ with a focus on Transformer inference.
Key concepts (expanded)
ggml_context
See this article for diagram illustration.
A container that owns tensors and graphs. Think of it as an arena allocator: all tensors created inside share the same context, and freeing the context frees everything.
1 |
|
- With
.no_alloc = true, tensors only store metadata, not data. - Later you bind them to real memory via a backend.
- This separation allows you to first build the graph (structure), then decide memory placement.
ggml_tensor
Definition: n-dimensional tensor metadata.
1 | struct ggml_tensor { |
- Shape/stride:
ne[],nb[] - Placement:
buffer(memory block) +data(address inside) - Graph linkage:
op+src[]
Minimal example:
1 | struct ggml_tensor *A = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, colsA, rowsA); |
๐ At this stage, A and B exist as symbols in the graph, but not yet bound to real memory.
ggml_cgraph
Definition: computation graph = nodes + leafs.
1 | struct ggml_cgraph { |
Minimal example:
1 | struct ggml_cgraph *gf = ggml_new_graph(ctx); |
- Leaf tensors = model weights, inputs
- Node tensors = intermediate results (activations)
๐ Building the graph = JAX jit tracing, but done manually.
ggml_backend
Definition: executor interface. A backend knows how to run ops, but not how to allocate memory by itself.
1 |
|
Run graph:
1 | ggml_backend_graph_compute(backend, gf); |
๐ Multiple backends can coexist (CPU, CUDA, Metal, RPC). Each gets a logical backend id when scheduling.
ggml_backend_buffer_type
Allocator type for a backend: describes how memory should be allocated on this device.
1 | ggml_backend_buffer_type_t buft = |
- CPU โ malloc/free
- CUDA โ cudaMalloc/cudaFree
- Metal โ Metal buffer alloc
๐ Think of it as malloc implementation tied to a backend.
ggml_backend_buffer
The actual allocated memory block. All tensor->data pointers ultimately live inside a buffer.
Two common cases:
A. Allocate leaf tensors (inputs/weights)
1 | ggml_backend_buffer_t buf = |
B. Manual allocation
1 | size_t size = 1024; |
๐ Buffers are device-specific: CPU RAM, CUDA VRAM, Metal heapโฆ
ggml_gallocr
Definition: Graph allocator = temporary memory manager for node tensors.
1 | ggml_gallocr_t gallocr = |
- Leafs: allocated directly by backend buffer (long-term, weights/inputs)
- Nodes: allocated per graph execution via gallocr (short-term, activations)
๐ In short:
- leaf โ buffer
- node โ gallocr โ buffer
ggml_backend_sched
Scheduler for multi-backend execution.
1 | ggml_backend_t backends[2] = { |
Steps:
- Traverse graph, assign backend id to each tensor
- Split into subgraphs per backend
- Insert data copies across backends if needed
Why not just buffer? Altough ggml_tensor has buffer pointer, the buffer only tells you where data sits, not who executes.
- Same CUDA buffer might be used by multiple streams (different executors).
- Or CPU buffer might be computed locally vs. RPC remotely.
Therefore, we need backend id as routing tag: which executor runs this op.
Step1. How GGML create/load tensors
The first step is to create tensors (model weights or inputs) and bind them to real memory.
1 | [ggml_context] |
1. Create tensors in a context
1 | struct ggml_init_params params = { |
At this point:
A->data == NULLA->buffer == NULL- only metadata (shape/stride/op) is stored.
2. Allocate backend memory for leaf tensors
1 | ggml_backend_t backend = ggml_backend_cpu_init(); // or CUDA/Metal |
Now:
A->buffer = bufA->data != NULL(points inside buf)- same for
B
๐ This is the step that turns โsymbolic tensorsโ into real memory-backed tensors.
3. Load data into tensors
1 | float host_A[rowsA*colsA] = { ... }; |
This copies from host memory into the backend buffer (CPU RAM or GPU VRAM).
Got it โ โ letโs fully rewrite Step2. How GGML build computation graphs in English, with clear explanation of the differences between GGML vs JAX (both static graph frameworks, but with different workflows). Iโll keep your preferred style: concise prose โ code โ bullets โ summary.
Step2. How GGML build computation graphs
GGML is similar to JAX in that both are static graph frameworks. The key difference is when and how the computation graph is built:
- JAX: you write a Python function, and at the first
jitcall, JAX traces the Python operations to construct a computation graph (HLO). - GGML: you explicitly call C APIs (
ggml_mul_mat,ggml_add, โฆ) at runtime, and these calls directly createggml_tensorobjects that represent nodes. Later, you invokeggml_build_forward_expandto traverse dependencies and finalize the graph.
So in GGML the graph is still built at runtime, but unlike PyTorch eager mode it is built once and reused (not rebuilt on every forward).
1. Computation graph (ggml_cgraph)
Each graph records tensors that participate in evaluation.
1 | struct ggml_cgraph { |
- leafs = constant tensors (model weights, inputs)
- nodes = activations, results of operations
The GGML cgraph is a tensor graph where each node or leaf is a tensor. Suppose our model is
1 | y = relu(matmul(x, W) + b) |
Then the GGML cgraph structure is
1 | (x) โโ |
Each intermediate tensor (matmul_outใadd_outใrelu_out) is a node, which contains how it's caculated (the operation, the pointers to its input).
2. Tensor metadata (ggml_tensor)
Each tensor already knows:
- its shape (
ne[]) - strides (
nb[]) - producer operation (
op) - input tensors (
src[])
1 | struct ggml_tensor { |
Thus, when the graph is finalized, most information (shape, data type, dependencies) is already available.
3. Build the forward graph
Graphs are finalized using ggml_build_forward_expand:
1 | void ggml_build_forward_expand(struct ggml_cgraph *cgraph, |
ggml_visit_parents: depth-first search (DFS) throughtensor->src[], adds unseen parents toleafsornodes, guarantees topological order.- The final graph is a static DAG of tensors.
4. Minimal example
1 | struct ggml_cgraph *gf = ggml_new_graph(ctx); |
๐ GGML vs JAX graph building
| Framework | When graph is built | How graph is built | User perspective |
|---|---|---|---|
| PyTorch eager | Every forward | Dynamic operator recording | Immediate execution |
| JAX | At first jit call |
Trace Python ops โ build HLO graph | User writes Python, tracing is automatic |
| GGML | Runtime, once | Explicit C API calls build tensors + DFS to collect graph | User manually builds the graph |
Step3. GGML schedule the running of the graph
After building the computation graph, GGML must:
- Allocate memory for intermediate nodes
- Partition the graph into per-backend subgraphs
- Assign backends and insert data copies when needed
- Execute subgraphs in dependency order
This is the job of the scheduler (ggml_backend_sched).
1. Entry point: ggml_backend_sched_graph_compute
1 | enum ggml_status ggml_backend_sched_graph_compute( |
- Calls the async version to launch computation
- Then synchronizes (waits for completion)
2. Async execution and allocation
1 | enum ggml_status ggml_backend_sched_graph_compute_async( |
- If memory not allocated yet โ call
ggml_backend_sched_alloc_graph - Then run
ggml_backend_sched_compute_splitsto execute each subgraph
3. Graph allocation: ggml_backend_sched_alloc_graph
1 | bool ggml_backend_sched_alloc_graph( |
Two core tasks:
- Split the graph (
ggml_backend_sched_split_graph)- Traverse the graph in topological order
- Group ops into subgraphs by
backend id - Insert copy ops automatically across backends (CPU โ๏ธ CUDA, CUDA โ๏ธ Metal, etc.)
- Allocate memory (
ggml_backend_sched_alloc_splits)- For each subgraph, create a
ggml_gallocrtied to the backendโs buffer type - Assign buffers to all intermediate node tensors
- Leafs remain unchanged (already allocated earlier)
- For each subgraph, create a
4. Execution: ggml_backend_sched_compute_splits
After allocation, the scheduler executes each subgraph in order:
- Calls the correct backend (
CPU,CUDA, โฆ) for each split - Ensures cross-device copies are completed before dependent ops run
- Supports multiple model copies (
sched->cur_copy / next_copy) for parallelism
5. Worked example
Suppose:
1 | C = A @ B + D |
A @ B(matmul) should run on CUDA+ D(add) should run on CPU
Scheduler steps:
- Split graph
1 | Split1 (CUDA): |
- Allocate memory
- CUDA split โ VRAM buffers for matmul output
- CPU split โ host RAM for addition output
- Execute
- Run CUDA matmul kernel
- Copy result back to CPU buffer
- Run CPU addition kernel
๐ Summary
- Leaf tensors โ allocated directly via backend buffer
- Node tensors โ allocated by scheduler (
gallocr) per subgraph - Scheduler workflow:
- Split graph into subgraphs (
ggml_backend_sched_split_graph) - Allocate buffers (
ggml_backend_sched_alloc_splits) - Execute each subgraph (
ggml_backend_sched_compute_splits)
- Split graph into subgraphs (
- backend id ensures the correct executor is chosen, even if multiple backends share the same memory.
๐ Compared to Step1 (load tensors) and Step2 (build graph), Step3 is where the graph is materialized into memory and mapped to devices for execution.