Fast 3D reconstruction with semantic information on road scenes is of great requirements for autonomous navigation. It involves issues of geometry and appearance in the field of computer vision. In this work, we propose a method of fast 3D semantic mapping based on the monocular vision. At present, due to the inexpensive price and easy installation, monocular cameras are widely equipped on recent vehicles for the advanced driver assistance and it is possible to acquire semantic information and 3D map. The monocular visual sequence is used to estimate the camera pose, calculate the depth, predict the semantic segmentation, and finally realize the 3D semantic mapping by combination of the techniques of localization, mapping and scene parsing. Our method recovers the 3D semantic mapping by incrementally transferring 2D semantic information to 3D point cloud. And a global optimization is explored to improve the accuracy of the semantic mapping in light of the spatial consistency. In our framework, there is no need to make semantic inference on each frame of the sequence, since the mesh data with semantic information is corresponding to sparse reference frames. It saves amounts of the computational cost and allows our mapping system to perform online. We evaluate the system on naturalistic road scenes, e.g., KITTI and observe a significant speed-up in the inference stage by labeling on the mesh.