Real-time 3D Object Detection on Point Clouds
Ajakirja pealkiri
Ajakirja ISSN
Köite pealkiri
Tartu Ülikool
The demand for precise and fast object detection frameworks has increased since the
autonomous vehicle industry started to attract more attention. While the progress made
so far in 2D object detection task with state-of-the-art approaches such as convolutional
neural networks seems promising, we still struggle to obtain the same level of performance
in 3D modalities such as lidar point clouds. The main reasons are that point cloud
is sparse and in 3D while state-of-the-art 2D object detection models work on camera
images. Some of the early works have tried to ease the aforementioned challenges using
either 3D convolutional neural networks or bird’s eye view approaches, nevertheless,
they were not able to achieve the desired level of performance in 3D perception.
PointPillars is one of the recent models running fast with a good accuracy on point
clouds. Its main advantage arises from the way it encodes the points in pillars into spatial
features using PointNet. It basically divides the whole point cloud into grids of vertical
pillars and applies state-of-the-art 2D detection network on this top-down view in which
spatial features are encoded. Even though this operation enables the network to keep the
positional information of the points within each pillar, yet, it does not take into account
the point densities in different parts of the point cloud.
This thesis aims to improve PointPillars network by utilizing the positional encoding
and extending the detection area. Positional encoding helps the network utilize positional
features by introducing two additional input channels before each convolutional and
deconvolutional layer. Additionally, different positional encoding schemes are compared
to have more insight about the effectiveness of the positional channels introduced.
Moreover, this thesis also presents a simple scheme to train 360-degrees model with
ground truths provided for only camera Field-of-View (FOV).
Positional encoding scheme provides better accuracy at a similar speed as the original
network. On the other hand, even though 360-degrees model is supposedly the type of a
model that should be used with lidar, in experiments, it is observed that it outputs many
False-Positives (FPs).
Object detection, 3D human detection, Positional encoding, Data augmentation