In industry 4.0 era, video applications such as surveillance visual systems, video conferencing, or video broadcasting have been playing a vital role. In these applications, for manipulating and tracking objects in decoded video, the quality of decoded video should be consistent because it largely affects to the performance of the machine analysis. To cope with this problem, we propose a novel perceptual video coding (PVC) solution in which a full reference quality metric named Video Multimethod Assessment Fusion (VMAF) is employed together with a deep convolutional neural network (CNN) to obtain the consistent quality while still achieving the high compression performance. First of all, to achieve the consistent quality requirement, we propose a CNN model with an expected VMAF as input to adaptively adjust the quantization parameters (QP) for each coding block. Afterwards, to increase the compression performance, a Lagrange coefficient of Rate-Distortion optimization (RDO) mechanism is adaptively computed under Rate-QP and Quality-QP models. Experimental results show that the proposed PVC has achieved two targets simultaneously: the quality of video sequence is kept consistently with an expected quality level and the bit rate saving of the proposed method is higher than traditional video coding standards and relevant benchmark, notably with around 10% bitrate saving in average.