HVPR: Hybrid Voxel-Point Representation for
Single-stage 3D Object Detection


Qualitative results on the validation split of KITTI. Our predictions and ground-truth bounding boxes are shown in green and red, respectively. Our method localizes small and/or occluded objects well, except the heavily occluded ones, e.g., in the left-bottom of the top-middle image. We also show 2D bounding boxes projected from 3D detection results.


We address the problem of 3D object detection, that is, estimating 3D object bounding boxes from point clouds. 3D object detection methods exploit either voxel-based or point-based features to represent 3D objects in a scene. Voxel-based features are efficient to extract, while they fail to preserve fine-grained 3D structures of objects. Point-based features, on the other hand, represent the 3D structures more accurately, but extracting these features is computationally expensive. We introduce in this paper a novel single-stage 3D detection method having the merit of both voxel-based and point-based features. To this end, we propose a new convolutional neural network (CNN) architecture, dubbed HVPR, that integrates both features into a single 3D representation effectively and efficiently. Specifically, we augment the point-based features with a memory module to reduce the computational cost. We then aggregate the features in the memory, semantically similar to each voxel-based one, to obtain a hybrid 3D representation in a form of a pseudo image, allowing to localize 3D objects in a single stage efficiently. We also propose an Attentive Multi-scale Feature Module (AMFM) that extracts scale-aware features considering the sparse and irregular patterns of point clouds. Experimental results on the KITTI dataset demonstrate the effectiveness and efficiency of our approach, achieving a better compromise in terms of speed and accuracy.

Runtime and accuracy comparison of detection results on the KITTI test set. We compare our model with voxel-based methods on the car class for three difficulty levels. Voxel-based methods using pseudo image representations (Voxel-PI) are shown as circles, and 3D voxel-based methods (Voxel-3D) are plotted as triangles. Our method gives a better compromise in terms of accuracy and runtime for all cases.
SE: SECOND ; PP: PointPillars; TA: TANet; AS: Associate-3D; 3D: 3DIoULoss; SA: SA-SSD; HS: HotSpotNet.


An overview of our framework. The HVPR network inputs point clouds and generates two types of hybrid 3D features via a two-stream encoder: Voxel-point and voxel-memory representations. The former representations are obtained by aggregating point-based features for individual voxel-based ones. For the later ones, we also perform the aggregation but with memory items, instead of using point-based features. That is, we augment the point-based features using a memory module, and exploit voxel-memory representations, i.e., hybrid 3D features, at test time for fast inference. The backbone network with AMFM inputs the voxel-memory representations to extract multiple scale-aware features, and the detection head predicts 3D bounding boxes and object classes.


J. Noh, S.Lee, B. Ham
HVPR: Hybrid Voxel-Point Representation for Single-stage 3D Object Detection
In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , 2021
[Paper on arXiv] [Code will be released soon]




This research was partly supported by R&D program for Advanced Integrated-intelligence for Identification (AIID) through the National Research Foundation of KOREA (NRF) funded by Ministry of Science and ICT (NRF-2018M3E3A1057289), and Institute for Information and Communications Technology Promotion (IITP) funded by the Korean Government (MSIP) under Grant 2016-0-00197.