jumping back into computer vision - pt.1
Date: Nov 16th 2025
This is the first blog post of a few, dabbling back into computer vision for fun.
This weekend I read the following papers:
A big shift that I learned about: abandoning anchoring methods for query based methods
This philosophical shift changes the question that anchors tried to answer "which of the predefined boxes contain an object?" to the more scalable question on "where are the objects in this image" and allowing the model to find them itself. The other important learning from this thinking is that the model is able to learn its own "detectors", it can figure out which attributes in the image content to focus on itself (edges, corners, long lines, colors, etc.)
3D Mood really impressed me, it is a model specifically focused on 3DOD (3D Object Detection). From my notes the pipeline is a DETR (Detector Transformer) but different than something like the SOA RF-DETR because it produces a 2D query that is then used as an input to a 2D bounding box detection head but also as input to a 3D box head that then uses metric depth to produce the 3D detection. I was a bit confused about how camera intrisics work with this model, but I was able to get a decent result from a hosted space.

Parameters for above:
fx: 1400
fy: 800
cx: 1024
cy: 576
prompt: car.surfboard.sign.moutain
threshold: 0.1
Notably it missed all the non car classes, I assume because it has fewer examples of these in the training set? Not entirely sure, but would love to understand how to make it work for these other classes maybe through some prompting or slight finetuning.
A list of cool techniques I've just read about:
- NAS - neural architecture search, specifically weight-sharing neural architecture search trains a supernet and then can be tuned based on a number of heuristics (notably for realtime image models, latency versus accuracy)
- batch norm versus layer norm - batch norm has a tendency to collapse at small batch sizes due to the fact that its norm'd over the batch (lol), this is especially important for training on consumer GPUs
- predicting offsets versus predicting absolute values - in an example of lifting a 2D coordinate estimate to a 3D coordinate, it is much more stable to train on the 2D coordinate estimation and learn a more stable offset from that value
- logarithmic depth prediction - to avoid collapsing or exploding gradients, especially when training on diverse "open world" image datasets it makes sense to train using logarithmic depths and then undo the function to extract your necessary value
- CLIP (lol) - contrastive learning pushes the embeddings of dissimilar objects while pulling together embeddings of the similar objects