Original article was published by /u/gold_twister on Deep Learning
I’m working on a project where I need to estimate the 6DOF pose of a known 3D CAD object in a single RGB image – i.e. this task: https://paperswithcode.com/task/6d-pose-estimation. There are several constraints on the problem:
- Usable commercially (licensed under BSD, MIT, BOOST, etc.), not GPL.
- The CAD object is known and we do NOT aim for generality (i.e.recognize the class of all chairs).
- The CAD object can be uploaded by a user, so it may have symmetries and a range of textures.
- Inference step will be run on a smartphone, and should be able to run at >30fps.
- The inference step can either be a) find the pose of the object once and then I can write code to continue to track it or b) find the pose of the object continuously. I.e. the model doesn’t need to have any continuous refinement steps after the initial pose estimate is found.
- Can be anywhere on the scale of single instance of a single object to multiple instances of multiple objects (MiMo). MiMO is preferred, but not required.
- If a deep learning approach is used, the training time required for a new CAD object should be on the order of hours, not days.
- Can either 1) just find the initial pose of an object and not have any refinement steps after or 2) find the initial pose of the object and also have refinement steps after.
I am open to traditional approaches (i.e. 2D->3D correspondences then solving with PnP), but it seems like deep learning approaches outperform them (classical are too slow – Real time 6D pose estimation of known 3D CAD objects from a single 2D image or point clouds from RGBD Camera when objects are one on top of the other?). Looking at deep learning approaches (poseCNN, HybridPose, Pix2Pose, CosyPose), it seems most of them match these constraints, except that they require model training time. Though perhaps I can use a single pre-trained model and then specialize it for each new CAD object with a shorter training step. But I am not sure of this, and I think success probably relies on the specific model chosen. For example, this project says it requires 3 hours of training time: https://github.com/DLR-RM/AugmentedAutoencoder.
So, my question: would somebody know what the state of the art, commercially usable implementation that doesn’t require extensive training time for a new CAD object is?