This article identifies, using a zero-shot method (Gen6d), the 3D-bounding box of a target far-distanced from a UAV. Furthermore, it infers the attached camera’s pose to the drone, based on the underlying training on the visual data. These visual data are used in a YOLO-framework to identify targets belonging to a class. The vertices of the orthogonal 3D-box are used in a visual-servoing scheme on the attached gimbal on UAV. The camera has a varying focal length (zoom) and the indirect objective is to move the UAV close to the target while reducing the zoom factor. Initially, the UAV starts with a large zoom-factor (36×) at a far distance (100m) from the target. The UAV approaches the target using the visual servoing scheme, while reducing its zoom at discrete steps and maintaining its focus. Experimental results indicate the efficiency of the proposed method.