DEtection TRansformer (DETR)

Sources:

  1. DETR paper: End-to-End Object Detection with Transformers

DEtection TRansformer (DETR)

Notation

Symbol Type Explanation
\(I\) Input image A batch of images, shape \((B, 3, H_{\text{img}}, W_{\text{img}})\)
\(B\) Batch size Number of images in a batch
\(F\) Feature map Backbone output feature map, shape \((B, D, H_f, W_f)\)
\(D\) Feature dimension Channel or embedding dimension (DETR default \(D=256\) after 1×1 conv)
\(H_f, W_f\) Spatial size Height/width of backbone feature map
\(HW\) Token count Flattened spatial token count, \(HW = H_f \times W_f\)
\(X\) Image tokens Flattened feature tokens, shape \((B, HW, D)\)
\(PE\) Positional encoding 2D positional encoding added to image tokens
\(M\) Image memory Encoder output tokens, shape \((B, HW, D)\)
\(N\) # query slots Fixed number of object queries or slots (DETR default \(N=100\))
\(Q_0\) Object queries Learnable query embeddings, shape \((N, D)\) → broadcast to \((B, N, D)\)
\(Q_L\) Final query features Decoder output per-slot features, shape \((B, N, D)\)
\(C\) # classes Number of foreground classes (plus one no-object class)
\(\varnothing\) No-object class Special class meaning “this slot predicts no object”
\(K\) # GT objects Number of ground-truth objects in an image (variable, \(K \le N\))

1. What is DETR?

Fig. 1

Fig. 1: DETR directly predicts (in parallel) the final set of detections by combining a common CNN with a transformer architecture. During training, bipartite matching uniquely assigns predictions with ground truth boxes. Prediction with no match should yield a “no object” (\(\varnothing\)) class prediction.

DETR is the first clean end-to-end object detector built around a Transformer. Its core idea is simple: Treat object detection as a direct set prediction problem.

Given an image, the model outputs a set of objects in one shot. Each object has:

  • a class label class
  • a bounding box box

So DETR does not first generate tons of candidate boxes and then filter/merge them. It predicts the fixed-size final set directly.

DETR achieves this mainly using:

  1. a Transformer encoder–decoder to reason globally and assign objects to slots
  2. a set loss with Hungarian matching to enforce unique, non-duplicated predictions

2. What improves over previous detectors?

A typical traditional detection pipeline looks like:

generate many candidates (anchors/proposals) → classify/regress → NMS to remove duplicates

This pipeline relies on many hand-designed components and heuristics.

DETR simplifies it in three key ways:

  1. No anchors / proposals
    • no dense candidate generation
    • directly predicts the final object set
  2. No NMS
    • deduplication is learned through the set loss + query interactions, not a post-processing step
  3. Fully end-to-end and unified
    • training objective matches inference outputs exactly
    • the pipeline is conceptually clean

One-liner:

DETR moves detection from dense+heuristic pipelines to set+matching.


3. Components and workflow (including object queries)

Fig. 2

Fig. 2: DETR uses a conventional CNN backbone to learn a 2D representation of an input image. The model flattens it and supplements it with a positional encoding before passing it into a transformer encoder. A transformer decoder then takes as input a small fixed number of learned positional embeddings, which we call object queries, and additionally attends to the encoder output. We pass each output embedding of the decoder to a shared feed forward network (FFN) that predicts either a detection (class and bounding box) or a “no object” class.

3.1 Components

DETR has a very structured three-part design:

  1. Backbone (CNN/ViT)
    • extracts an image feature map
  2. Transformer Encoder
    • takes only image tokens (with 2D positional encoding)
    • outputs a global image memory
  3. Transformer Decoder
    • takes a fixed set of object queries (N slots)
    • uses attention to read the image memory
    • each query outputs either an object prediction or no-object

3.2 Workflow (detailed)

It’s easiest to see DETR as two separate streams: an image stream (encoder path) and a slot/query stream (decoder path). They stay separate until the decoder’s cross-attention.


A. Image stream (goes into the Encoder only)

Step A1: Backbone feature extraction

1
2
Image I → Backbone → Feature map F
F shape: (B, D, H_f, W_f)

Typical scale example:

  • input ~800×800
  • stride ~32
  • so H_f≈20, W_f≈20
  • D=256

Step A2: Flatten into image tokens

1
2
X = flatten(F)
X shape: (B, HW, D), HW = H_f × W_f

Example:

  • H_f=W_f=20 → HW=400
  • X shape: (B, 400, 256)

Step A3: Add 2D positional encoding (image tokens only)

1
2
X' = X + PE
shape unchanged: (B, HW, D)

Step A4: Transformer Encoder (global reasoning)

1
2
M = Encoder(X')
M shape: (B, HW, D)

Intuition:

The encoder “reads” the whole image and produces a global memory bank M.


B. Slot / Query stream (goes into the Decoder only)

Step B1: Prepare N object queries (independent learnable slots)

Formally:

1
Q0 = nn.Embedding(N, D)

Example (DETR default):

  • N=100
  • D=256
  • Q0 shape: (100, 256)
  • broadcast to batch: (B, 100, 256)

Important:

  • queries are not computed from the image
  • they never enter the encoder
  • they are the decoder’s input sequence

Intuition:

These are N empty “slots” the model can fill with objects.

Step B2: Transformer Decoder (where queries meet memory)

Each decoder layer performs:

  1. Self-attention on queries
    • (I suppose it) lets slots communicate to avoid duplicate detections
  2. Cross-attention (queries read image memory)
    • each slot searches the memory for “its” object
  3. FFN
    • refines each slot’s representation

After L layers:

1
QL shape: (B, N, D)

(Optional note for SAM3)

SAM3 uses a DETR-like decoder too: learned queries cross-attend to conditioned image embeddings; prompts just condition the slots before they claim objects.

Step B3: Prediction heads

Per query output:

  • class logits (including a no-object class)
  • normalized box (cx, cy, w, h)
1
2
class_logits shape: (B, N, C+1)
boxes shape: (B, N, 4)

3.3 What exactly are object queries (slots)?

Object queries are N learnable embeddings. They function as:

N object slots in the decoder.

After training:

  • some slots align to real objects
  • remaining slots learn to output no-object (∅)

Intuition:

Give the model N empty seats; it decides which seats correspond to objects and which stay empty.


3.4 Concrete example (to make “N slots” tangible)

Suppose an image contains only 3 objects:

  1. person
  2. dog
  3. ball

So GT count is K=3, but DETR still outputs N=100 slots.

Inference output is always length 100:

1
pred = [(class_1, box_1), ..., (class_100, box_100)]

A typical learned allocation might look like:

slot id predicted class meaning
q5 person slot #5 claims the person
q17 dog slot #17 claims the dog
q63 ball slot #63 claims the ball
other 97 slots no-object (∅) empty slots

So:

Out of 100 slots, only 3 are “filled” with objects; the rest predict ∅.


4. Training (very short)

Training has two essential steps:

  1. Hungarian matching (1-to-1 assignment)
    • match K GT objects to K of the N predicted slots
    • remaining N−K slots are matched to ∅
  2. Loss on matched pairs
    • matched slots: classification loss + box loss (L1 + GIoU)
    • unmatched slots: supervised as no-object (∅)

One-liner:

DETR trains by assigning objects to slots, not by filtering candidates.


Appendix

Box vs. normalized box

A box is an axis-aligned rectangle for an object. DETR predicts normalized boxes, meaning coordinates are scaled to [0, 1] by image width/height, so training is stable and resolution-independent.

DETR uses the center-size format:

[ (cx,, cy,, w,, h) ]

  • \(cx, cy\): box center (relative to image size)
  • \(w, h\): box width/height (relative to image size)

Why 4 numbers?

A rectangle needs 4 degrees of freedom: 2 for position + 2 for size. This is equivalent to the corner format \((x_1, y_1, x_2, y_2)\); DETR uses center-size because it’s smooth/easy to regress and avoids constraints like \(x_1 < x_2\).