Date: Nov 16th 2025

This is the first blog post of a few, dabbling back into computer vision for fun.

This weekend I read the following papers:

A big shift that I learned about: abandoning anchoring methods for query based methods

This philosophical shift changes the question that anchors tried to answer "which of the predefined boxes contain an object?" to the more scalable question on "where are the objects in this image" and allowing the model to find them itself. The other important learning from this thinking is that the model is able to learn its own "detectors", it can figure out which attributes in the image content to focus on itself (edges, corners, long lines, colors, etc.)

3D Mood really impressed me, it is a model specifically focused on 3DOD (3D Object Detection). From my notes the pipeline is a DETR (Detector Transformer) but different than something like the SOA RF-DETR because it produces a 2D query that is then used as an input to a 2D bounding box detection head but also as input to a 3D box head that then uses metric depth to produce the 3D detection. I was a bit confused about how camera intrisics work with this model, but I was able to get a decent result from a hosted space.

Car Detection

Parameters for above: fx: 1400 fy: 800 cx: 1024 cy: 576 prompt: car.surfboard.sign.moutain threshold: 0.1

Notably it missed all the non car classes, I assume because it has fewer examples of these in the training set? Not entirely sure, but would love to understand how to make it work for these other classes maybe through some prompting or slight finetuning.


A list of cool techniques I've just read about: