jumping back into computer vision - pt.1

Date: Nov 16th 2025

This is the first blog post of a few, dabbling back into computer vision for fun.

This weekend I read the following papers:

RF-DETR - link
Grounding DINO - link
3D Mood - link

A big shift that I learned about: abandoning anchoring methods for query based methods

This philosophical shift changes the question that anchors tried to answer "which of the predefined boxes contain an object?" to the more scalable question on "where are the objects in this image" and allowing the model to find them itself. The other important learning from this thinking is that the model is able to learn its own "detectors", it can figure out which attributes in the image content to focus on itself (edges, corners, long lines, colors, etc.)

3D Mood really impressed me, it is a model specifically focused on 3DOD (3D Object Detection). From my notes the pipeline is a DETR (Detector Transformer) but different than something like the SOA RF-DETR because it produces a 2D query that is then used as an input to a 2D bounding box detection head but also as input to a 3D box head that then uses metric depth to produce the 3D detection. I was a bit confused about how camera intrisics work with this model, but I was able to get a decent result from a hosted space.

Car Detection

Parameters for above: fx: 1400 fy: 800 cx: 1024 cy: 576 prompt: car.surfboard.sign.moutain threshold: 0.1

Notably it missed all the non car classes, I assume because it has fewer examples of these in the training set? Not entirely sure, but would love to understand how to make it work for these other classes maybe through some prompting or slight finetuning.

A list of cool techniques I've just read about:

NAS - neural architecture search, specifically weight-sharing neural architecture search trains a supernet and then can be tuned based on a number of heuristics (notably for realtime image models, latency versus accuracy)
batch norm versus layer norm - batch norm has a tendency to collapse at small batch sizes due to the fact that its norm'd over the batch (lol), this is especially important for training on consumer GPUs
predicting offsets versus predicting absolute values - in an example of lifting a 2D coordinate estimate to a 3D coordinate, it is much more stable to train on the 2D coordinate estimation and learn a more stable offset from that value
logarithmic depth prediction - to avoid collapsing or exploding gradients, especially when training on diverse "open world" image datasets it makes sense to train using logarithmic depths and then undo the function to extract your necessary value
CLIP (lol) - contrastive learning pushes the embeddings of dissimilar objects while pulling together embeddings of the similar objects