In the realm of multimodal multi-object tracking (MOT) applications based on point clouds and images, the current research predominantly focuses on enhancing tracking accuracy, often neglecting the issue of computational efficiency. Consequently, these models often struggle to exhibit optimal tracking capabilities in scenarios demanding high real-time performance. To address these challenges, this paper introduces a fast multi-object tracking model based on multimodal fusion (MF-Net). The model is divided into three primary modules object detection, multimodal fusion, and trajectory matching. Firstly, a 2D detector is used to identify objects in the image and compute their posterior estimate, and a 3D classification network extracts the foreground points of the object from the point cloud. Subsequently, a perspective projection module is then designed to determine the transformation matrix and the minimum number of vertex pairs that map the coordinates of the foreground points onto a 2D plane. Based on the model, a Planar Gaussian Function (PGF) model was constructed to fit small and hard objects that were missed in the image according to the foreground points, thus compensating for the limitations of 2D detectors and ensuring accuracy while reducing training time. Finally, the merged object performs trajectory matching. The performance of MF-Net has been verified through experiments in plenty conducted on publicly available KITTI and nuScenes datasets. In comparison to existing competitive models, our algorithm demonstrates a substantial enhancement in both detection and tracking performance, achieving satisfactory accuracy but showcasing superior real-time efficiency.