AI Video Editing: a Survey

: Video editing is a high-required job, for it requires skilled artists or workers equipped with plentiful physical strength and multidisciplinary knowledge, such as cinematography, aesthetics. Thus gradually, more and more researches focus on proposing semi-automatical and even fully automatical solutions to reduce workloads. Since those conventional methods are usually designed to follow some simple guidelines, they lack ﬂexibility and capability to learn complex ones. Fortu-nately, the advances of computer vision and machine learning make up the shortages of traditional approaches and make AI editing feasible. There is no survey to conclude those emerging researches yet. This paper summaries the development history of automatic video editing, and especially the applications of AI in partial and full workﬂows. We emphasizes video editing and discuss related works from multiple aspects: modality, type of input videos, methology, optimization, dataset, and evaluation metric. Besides, we also summarize the progresses in image editing domain, i.e., style transferring, retargeting, and colorization, and seek for the possibility to transfer those techniques to video domain. Finally, we give a brief conclusion about this survey and explore some open problems.


Introduction
Video has become a prime format of media in our daily life. We can see video anywhere at anytime, like in elevators and cinema, and on Reddit and Facebook. In the 3rh quarter of 2021, there are 213 million Netflix 1 subscribers and 200 million Amazon Prime 2 subscribers 3 , and over 14 million daily active users of TikTok. And [1] predicts that by 2022, video viewing will account for 82% of all internet traffic. In my view, video consuming is highly like to be under-estimated for COVID-19 makes underline gathering risky and especially inconvenient for international travels, which boosts the demand for online information exchange. Video is becoming indispensable.
Consequently there is a surge in video editing out of diverse motivations. Editing raw footages into a watchable video is the basic aim [2] [3]. Extracting highlights from videos is a tradeoff between saving costs, e.g. time, storage space and internet traffics, and preserving their important information [4][5] [6]. Those videos are typically long and most of their contents are boring, such as surveillance videos [7], egocentric videos [8]. Videos are always generated, and edited for a purpose. However, when those videos are reused on different occasions, it's necessary to re-edit them. For example, films are usually cut of a certain ratio, e.g. 16:9, 4:3. Retargeting is needed when users play them on displays of various ratios and sizes [9] [10]. Another motivation to edit video is to help retrieving information in videos, for that tranditional video retrival needs viewers to watch the video, which is very time-wasting. Sivic et al. [11] first retrive infomation from videos in the manner Google does in text, and it summaries the features of obejcts into a 128-vector, and transfroms the approaches in nature language processing (NLP) field to retrival certaion objects. Besides, likeness, regularity, color histogram, edge histogram, roughness, and so on [45]. A number of tranditional algorithms are about classical feature extraction: MPEG-7 color layout descriptor [46], CENTRIST [47], VLFeat [48], SIFT [49] and HSV color moments [50]. Gist [51] emphises on learning spatial structures, and works well in descripting natural scene. Histograms of Oriented Gradient (HOG) [52] contains 5 steps: gamma or colour normalization, gradient computation, spatial or orientation binning, local contrast normalization and descriptor block partition. High-level features are machine-learning based, and feature extractors generate highlevel features directly from input image and video, like C3D [53], or from low-level features, like Viola-Jones face detector [54]. Feature detectors can be divided by their outputs. Saliency maps could be generated by CNN-based models [23] [55][56] [13], and LSTM-based models [57] [58][59] [24]. [26][60] [61] explore 360 o video saliency. Facial features are detected by VGG-Face model [62], Viola-Jones face detector [54], and other models [63]. Human pose detectors include [64], [65], and Poselets [66]. Object detection can realized with Detect-andtrack [67], Faster R-CNN [68], VGG model [69], GoogLeNet model [70], BVLC CaffeNet [71]. Jiang et al. [72] give a comprehensive study on bag-of-visual-words. Some works analyze the sentiment of videos [73].
Fartash et al. [21] propose VSE++, a visual-sementic embedding that provides convenience across modality. Let φ(i; θ φ ) ∈ R D φ denotes the feature of image i, and ψ(c; θ ψ ) ∈ R D ψ represents the feature of text c, and let the joint embedding space be linear projections: g(c; W g , θ ψ ) = W T g ψ(c; θ ψ ) (2) where W T f ∈ R D φ ×D and W T g ∈ R D ψ ×D . And the similarity between image i and text c is the inner product:

Text
Text acts as a main or support modality in video editing. For example, Pavel et al. [74] propose a semi-automatic system for abstracting digests of informational lecture videos using transcript-based interactions. And he proposes SceneSkim to summarize videos based on their lines.Transcripts is provided or obtained by rev.com 5 . Leake et al. [20] take a perfect transcript and multiple video recordings of a dialogue-driven scene as inputs. It segments recordings into pieces for each line of dialogue. [18] obtains transcripts from audio using rev.com and spots locations and visually significant entities(VSEs) using Google NLP toolkit 6 . [75] uses titles as import cues to summarize videos.
There are also many works ultilizing multi-modality. Wang et al. [32] propose textdriven video editing algorithm that finds keywords in input text. It uses visual-semantic embedding with VSE++ [21] to encode text and image and then calculate the similarity between those two modalities. [76] is able to capture the relations between words in corpus text. And Alayrac et al. [77] use it to extract main steps in instruction videos. Xu et al. [78] propose topic extraction model based on non-negative matrix factorization. Sener et al. 5 of 34 [33] aim at understanding videos using visual and language hints to parse a video into semantic steps with textual description for each one in an unsupervised way. Malmaud et al. [79] perform an alignmnet between textual instruction and instructional cooking videos with text and speech and create two corpuses, one of aligned recipe-video pairs and the other short video clips with a cooking action and a noun phrase labeled. WaveNet [80], a stacked dilated causal convolutional network, transforms text to speech and generate music. Truong et al. [81] combine transcript, the sound source analysis and mouth motion to identify the speaking face [82]. Cour uses dialog information to assign names to faces on the screen [83]. Similar work is [84] that uses subtitle and script text as weak supervision.

Input Video Domain
The typical workflow of editing a video consists of clip segmentation, feature extraction and tagging, clustering by themas and selection according to some criteria. Debudding need to keep pace with the scene and undersocre should be appropriate. While editing multiple input tracks, cinematography is an important factor. Scene changes must obey some rules, otherwise the outputs will cause audiences dizzy. Those rules include 180 o rule. Some researches have already reveal them. For example, Serrano et al. [25] conduct user expriments on movie editing continuity in space, time and action.
Above is the general workflow of video editing. For specific type of videos, editing solutions should adapt for it has fixed patterns. Truong points that make-up videos are always focus on face, and the parts manipulated are from the whole face to eyes, lips and eyebrows [34]. Besides, the make-up productions appear at certain steps. Similarly, Chi concludes that instructional videos for physical tasks are from a mess to a production, and the unused components are reducing over the procedure [85]. Thus, many editing solutions are designed for a certain genre. Here, we will introduce editing algorithms according to their input kinds.

Instructional Videos
Malmaud et al. [86] propose a discrete-time, partially observed, object-oriented Markov Decision Process aimed to reveal latent context in a cooking recipe in text, image or video that always elids many keypoints like action or state transition. This proposal models three conditional probabilities: 1) the probability of next state of objects given temporal action of a semantic frame and state; 2) the probability of action with temporal states and text sentences based on semantic role labeling; 3) the probability of action shown in a video clip. In 2015, Malmaud extends his previous work, and performs an alignmnet between textual instruction and instructional cooking videos with text and speech and create two corpuses, one of aligned recipe-video pairs and the other short video clips with a cooking action and a noun phrase labeled [79]. He pre-processes videos and its user-uploaded textual recipe using a Bayes classifier model to get ASR tokens and parse the recipe text with NLP model to solve zero anaphora problem. There are several ways to align: 1) A HMM is trained to align each step of the recipe to a sequence of words in ASR transcript and apply the Viterbi algorithm to estimate the MAP sequence. Finally align extracted recipe to video segments; 2) another approach to labeling video segments is keyword spotting, namely searching for verbs in ASR transcript and finding its corresponding position in the video with a fixed-size sliding windows. As errors in ASR, keyword spotting does not work well. The third way is to perform keyword spotting for the action in the ASR transcript as before, but use the HMM alignment to infer the corresponding object. However, it can still be refined further. So a visual food detector is trained based on deep learning, feed input video clips to it to get probabilities of all candidates coming from methods above and the match object is selected.
DemoCut [85], a semi-automatic instructional video editing tools for physical tasks, provides a light-weight annotation-based interface, adds temporal effects and visual effects as markers. This system processes single take and single camera footage. Users need to offer five types of markers: step, action, closeup, supply and cut-out. Based on those markers, 6 of 34 Democut automatically applies effects. Alayrac proposes a novel unsupervised learning approach that combines the features of the input video and the associated narration to discover and locate the main steps in instructional videos [77].
Sener et al. [33] aim at understanding videos using visual and language hints to parse a video into semantic steps with textual description for each one in an unsupervised way. This paper discovers semantic steps from a video category and adopts a multimodal joint vision-language model for video parsing. Visual atoms are clusters by joint extension to spectral clustering of object proposals generated with Constraint Parametric Min-Cut algorithm, whereas language atoms are frequent salient words with tf-idf measure. And each frame is represented by the occurrance of atoms. It ultilizes Markov Chain Monte Carlo to learn and infer the Beta process Hidden Markov models for understanding of the time-series information.
Truong et al. [34] propose a multi-model method to automatically generate two-level hierarchical tutorials from instructional makeup videos, which allows users navigate by click and voice commands. Two levels are coarse events about objects and fine one about actions that manipulate those objects. In this meaning, makeup videos' high level is about facial parts: lips, eyes and face. While in the coarse-level event, each fine-grain action step consists of a sequence of demonstrating and narration. Besides, there are non-instructional introduction and conclusion. The system takes videos and its aligned transcript as inputs, oversegments and labels them, and makes shot-phrase pairs. According to product introduction pair, it constructs action steps. And according to the number of times each facial part appears, construct facial part groupings.

First-person and Sports Videos
First-person videos are generated from wearable or portable camares that provide great convenience for non-professional users, so that those videos contains motion blurs, obejct occlusion, shake and other factors that degrade video quality. Thus, camera stabilization is a necessary step. Neel et al. [87] smooth camera motion and speed up videos of hand-held cameras jointly. It first evaluates the matching degree of each frame with its agjacent frames based on sparse feature macth [88] [89], select the optimal path with a dynamic-time-warping algorithm, and smooth the path and render the output video. As Kopf does in [90], scene reconstruction, path planning and image-based rendering are basical steps. Sun aims at generating montages from unconstrained videos that are inconsistent and probablily contains motion blur and shake, like egocentric cameras [91] [92]. The output montageable image contains the salient person with his salient actions from multiple frames. The chanllages behind this algorithm are human body detection and tracking, salient person detection, and action composition. Hamza et al. [93] focus on address the videos generated from wireless capsule endoscopy.
Top two characteristics of sports videos are audio and motion. For example, [31][35] only use audio features such as MFCC, energy, to extract highlights of basketball games. While Li et al. [37] try to analyze football video based on deterministic reasoning and probabilistic inference with additional viusal cues. However, those algorithm only identify and capture simple rules, but latent and complex rules are still unknown. Thanks to the advances in computer vision and machine learning, more and more algorithms or networks are designed to learn those rules and bring out the huge improvement in video editing. Hanjalic et al. first propose excitement model for highlights extraction from sports videos [94]. This model ultilizes motion, scene cut frequency, and energy of audio to fits a smooth curve. And the most exciting moments are desired highlights. Sun et al. [95] propose a novel algorithm to score the highlightness of sports video clips.

Animation and Film
As animation and film are both like to be well edited already, re-editing mainly includes retargeting to fit different ratios, or super-resolution for old movies of poor quality.
Galvane et al. [96] present a detailed formalization of continuity editing for 3D animation, proposes an automatical editing method based on semi-Markov assumption in which parameters can be controlled and validates this method through a user evaluation. Given a 3D animated scene and rushes from M cameras and manually annotated timealigned actions(subject, verb, object), a semi-Markov chain is built on editing graph with node actions and edge costs that are decided by three guidelines(errors in conveying unfolding actions in each shot, violations of continuity editing rules in each cut and errors in shot durations). Then the editing is an optimization problem that can be solved by dynamic programming. A scene from "Back to the Future' is chosen as inputs, 21 viewers show that editing has an impact on the preceived quality of the observed video stimulus, but the preceived quality of the version done by an expert cinematographer is not significantly higher than the new method.
Pavel et al. [19] develop SceneSkim, a system with UI that generates captions, scripts, and plot summaries of movies to support quickly searching and retrival. It aligns speech audio with the caption text with P2FA [28]. It uses search time and accuracy as evaluation metrics. Jain et al. [97] also crops movies. [98] aslo edits videos to different ratios. Khoenkaw et al. [99] use the film gammer of each shot to guide cinematic feature extraction and generate the importance map at server end. At the client end, a cropper retargets the film to desired size. Shots are classifed according to the camera behavior, and objects in each shot is further classified by their movements.

Lecture Videos
Lectures are of limited space and fixed process. The speaker is standing at platform, back to a blackboard, towards to a group of audiences, probably playing slides. For broadcasting, remote seminar and so on, many researchers devote themselves to proposing a fully automatic plan from recording lecture, editing to post-processing.
Mukhopadhyay et al. [2] come up with Cornell Lecture Browser, a automatic system that records lectures and generates multimedia representations. An overview camera records the entire lecture dais, and a tracking camera captures the closeup shot of the speaker. After recording the lecture, editing requires this system solve two key problems: synchronization and automatic editing. The synchoronization between two video tracks is done by adding a synchronization point artificially in one sound channel, the synchronization between slides and videos, and that between the slide titles and the slides is based on feature matching. Cornell Lecture Browser defines three principles to constrain the length of shots. Thus, an edit decision list of shots can be calculated by two passes. At last Dalí algorithm [100]selects the final shots from the edit decision list.
Gleicher et al. [101] gives a detailed analysis about challenges and requiments of the framework for virtual videography in lecture domain. And later Heck et al. [38] present an automatical lecture video editing system, Virtual Videography, consisting of four phases: media analysis, attention model, computational cinematographer and image synthesis component. In the media analysis, the input video is segmented into foreground and background based on color to get a clean board stream, region objects on the board are identified, the instructor's gestures are recognized into three types: pointing left, pointing right and reaching up, and audio analysis determines whether the instructor is speaking at a given frame. The attention model determines which regions are important by a few guidelines, and no complex models are used. In the computational cinematographer, a virtual camera determines the type of a shot in the source footage, camera tracking is applied, two kinds of video effects, ghosting shots and picture-in-picture shots are added, best shot sequence is solved by an optimization problem based on a graph and some transitions like fade, pan and zoom are applied to some shots. In the image synthesis, bicubic interpolation is used to obtain high-quality results and other parameters are adjusted to keep coherence. Note that video aspect ratio changes. There are several points to improve. First, the length of the video is not changed, and editing should remove unnecessary clips. What import regions are is decided simply and some novel algorithms can be used. Zhang et al. [3] propose iCam2, a fully automatic lecture capture system that supports capturing, broadcasting, viewing, archiving and searching of presentations. It equips with two microphone recording the speaker and audiences, and three cameras for the speaker, audiences and slides and uses five-state finite state machine to change shot among the speaker, audiences and slides.
Pavel [74] proposes a semi-automatic system for abstracting digests of informational lecture videos using transcript-based interactions. The generated digest affords browsing and skimming through a textbook-inspired chapter/section organization of the video content. Input video is segmented into chapters, and the chapter is further segmented in sections. Every section consists of several videos segments with corresponding keyframes and brief text summaries. The text transcript of the input video is supplied by the video or obtained with rev.com. Users select video points or text points to segment videos. Or the system uses BSeg twice to automatically segment text into sections and chapters, as well as videos. However, Summaries are written by human.
Shin et al. [44] present Visual Transcripts, a system that transforms a blackboardstyle lecture video with transcript into a visual transcript interleaving visual content with corresponding text. Ranjan et al. [102] propose the system that takes the outputs of several cameras and microphones and a motion capture system for meeting as inputs, and generates an edited output video. The system is iteratively refined according to three criteria and advises from experts. The three criteria are 1) it must capture enough visual information; 2) it must be compelling to watch; 3) it must not require substantial human effort. The final prototype design is shown below. In an informal meeting scenario, three participants with a microphone to record audio and a headband to track location and motion sit around a table, with a whiteboard close to the table, and three cameras record the meeting. Four types of shots, close-up shot, two-person shot, overview shot and shot of artifacts are defined and the transitions between three types except shot of artifacts are also fixed. Note that gaze and speaker history are leveraged for prediction. It also restricts camera control.

Dialogue Videos
Dialogue videos usually contain closeup shot of speakers and is driven by lines. There are two keypoints of dialogue video editing: cinematography for human and text-driven editing. He et al. [103] outline several guidelines explicitly. For example, there are five useful camera distances concerning cutting heights of actors. Cutting at the neck is extreme closeup, under the chest or at the waist closeup, at the crotch or under the knees medium view, the entire person full view and distance perspective long view. But cutting at ankles is very ugly. It also lists some constraints: don't cross the line, avoid jump cuts, use establishing shots, let the actor lead and break movement.
Leake et al. [20] take a perfect transcript and multiple video recordings of a dialoguedriven scene as inputs. It segments recordings into pieces for each line of dialogue, and then concates them as the idioms selected users. There are 13 kinds of basic idioms: avoid jump cuts, change zoom gradually, emphasize character, intensity emotion, mirror position, peaks and valleys, performance fast/slow, performance loud/quiet, short lines, speaker visible, start wide, zoom consistent and zoon in/out. As the concating process is modelled as a Hidden Markov Model, each idiom corresponds to different start probabilities, transition probabilities or emission probabilities. If seversal basic idioms are used, just take an element-wise product of their corresponding HMM parameters start probability, transition probability or emission probability with weights and normalize them. Besides, silence before and after a line can control videos' style and tone.
Berthouzoz et al. [104] present a semi-automatic system with interface to help placing cuts and transition in interview video. Given an interview video, users can obtain its transcription from castingwords.com, and align text to audio with Virage 7 . The system suggests appropriate cut locations by calculating cut suitability score, and peforms cutting where users delete text. The cut suitability score is: where i refers to i th frame, S a (i) is cut suitability score of audio and S v (i) is cut suitability score of video. S a (i) is set to 1 when frame i is between two words, otherwise set to 0. As interview video mainly contains human, S v (i) is in terms of the distance of mouth, eyes and body between two framescite [52] [63]. Once finishes cutting, it generates visible transitions, pauses and hidden transitions, and users decide which one to use. Hidden transition is seamless and needs the system to compute dense optical flow and figure out the number of frames to be interpolated by a data-driven approach. Pauses are generated similarly to hidden transition, and its corresponding audio is copy from the background noise of the environment. While visual transition results in noticeable changes, and consists of jump-cut, fade and so on.
Truong et al. [81] propose ConvCut to generate shareable highlights of 360 o video of social conversations to get rid of headsets. The transcript is obtained from rev.com, and is algined to the audio tracks with phoneme-mapping method. Then ConvCut splits the raw 360 o video into one clip per sentence. With multi-modal analysis, ConvCut gets the information of clips, such as the spatial location of faces [82], the topics of conversation [78], laughter, gestural motion [64][65], facial expression changes. Users select lines and ConvCut automatically edits the corresponding clips into a normal-field-of-view (NFOV) video.
Cheng et al. [105] propose CREATE model to restore the dropped frames of talking video streams whose audio is complete. Given input video with dropped frames and its complete audio, CREATE first aligns frames with audio and figures out which frames are lost using the mounth shape and motion. Based on the rest frames, GAN is used to generate frames that correspond to the audio.

360 o Videos
Sitzmann et al. [26] apply saliency prediction algorithm of 2D videos in VR and find that equirectangular projection works better than cube map and patch-base methods and that ML-Net with equator bias predicts better than naive equator bias and SalNet plus equator bias. Besides, it also explores to predict time-dependent saliency with a window and speed, and such method achieves 0.57 average CC score. Based on previous insights and a test with 0.50 CC score, it defines that head orientation can be used for saliency prediction. Forth and last, it explores several simple applications of VR saliency prediction, automatic alignment of cuts in VR video with maximizing the correlation between the saliency maps of the last frame in the first segment and the first frame of the second one, panorame thumbnails, panorama video synopsis and saliency-aware VR image compression.
In [14], Su et al. define the problem that automatically generation of NFOV videos from a 360 o videos as Pano2Vid, and propose AUTOCAM, a system that takes a dynamic panoramic 360 o videos as inputs and outputs several natural-looking NFOV videos. It defines the minimal clip is a glimpse that is of fixed(like 65.5) horizontal angle and fixed aspect ratio(4:3) and is 5 seconds long. AUTOCAM learns the latent essence of capture-worthiness from a great deal of NFOV videos downloaded from Youtube using convolutional 3D, and then predicts the capture worthiness of all candiates of a 360 o video with a classifier. At the first step, AUTOCAM determines the wortiest glimpse. And find the next best glipse at a certain area around the glimpse for avoiding large jump in continuty. This algorithm is weak supervised. [12] extends AUTOCAM in three aspects. First it generalizes the task of Pano2Vid to allow spatial selections within the 360 video and multiple FOVs (104.3, 65.5, 46.4). A 360 video is divided into many ST-glimpses and their capture-worthiness scores are predicted by a logistic regression. Second it presents a coarse-to-fine search approach that iteratively refines the camera control while reducing the search space in each iteration to find a bset trajectory. Third it explores sampling a time window and forbiding the trajectories of the current iteration from selecting the same ST-glimpses as the solution of previous iterations in the window to generate a diverse set of plausible output NFOV videos. Pavel et al. [13] explore two kinds of shot orientation controls for 360 o videos: viewpoint-oriented and active reorientation technique. Given a spherical 360 o video, cut times for each shot boundary and the location in the panorama of one or more important point within each shot, the viewpoint-oriented technique guarantees viewers to initially see the most important content in the shot at each shot change, whereas active reorientation technique allows viewers to reorient the shot by pressing a button so that the important content lies in their field of view. [15] proposes a novel deep learning-based agent for piloting through 360 o sports videos automatically, which leverages a state-of-the-art object detector Faster R-CNN to proposes a few candidate objects of interest, selects the main object with a recurrent neural network, and regresses a shift in viewing angle to move to the next one.
Lai et al. [106] propose a system for converting a 360 o video into a NFOV hyperlapse sampled non-uniformly in space and time using visual saliency and semantics. The input 360 video is first stabilized and the focus of expansion(FOE) is estimated to track the forward camera motion, and next is over-segmented into temporal superpixels whose saliency score is its distance to other TSPs' color and motion. Fully convolutional network (FCN) [107] is used to label temporal superpixels (TSPs), and the top-3 TSPs with maximum scores are detected as ROIs. And users could decide ROIs by UI. After analyzing video contents, camera path planning consists of three phases. Given the detected ROIs and FOE, the smooth camera path is extracted by minimizing a cost function concerning them. In the second phase, 360 o video is rendered into an NFOV video with a fixed field-of-view and then select a set of optimal frames in terms of saliency scores, frame alignment errors, speed and acceleration penalties. When the optimal frames are selected, zooming effect is added according to user perferences and the size of interesting regions. To stabilize 2D NFOV videos, a set of feature trajectories is extraced by the Harris corners and Brief descriptors [89], three motion models for each frame is computed by the RANSAC method [88], AIC [108] selects the best motion models for each frame, and the single-path scheme by the polish the camera motion [109].
Tang et al. [110] come up with a solution to direct 360 o videos containing 5 steps: feature tracking, keyframe selection, motion estimation, keyframe path planning with cinematographic constraints, joint optimization. They proposes a new motion estimation algorithm based on the feature correspondances to handle the rotation and translation between adjacent keyframes. Besides, it also provides a unified framework to define constraints on the outputs through clicking on the ER projection or recording a guided viewing session.

Surveillance and Multi-view Videos
Surveillance cameras are fixed, and its videos without motion blur or shake. The main problem is to reduce redundancy in surveillance videos. Panda et al. [7] aim at summarizing multi-view surveillance videos without any posits. The multiple surveillance video networks always have an overlapping fields-of-view. It first segments videos into shots according to RGB and HVS color space changes, and extracts their C3D features using [53]. The features of shots are embedded using subspace clustering [111] and then sparse representative selction is performed. Those two steps are performed jointly. Half-quadratic optimization techniques [112] [113] are used to solve that optimization problem. Then it gets the optima; sparse coefficient matrix, a weight curve using L2 norms of the rows in the matrix is generated and optimal summary segments are extracted at the local maxima from the curve. It uses 6 multi-view datasets with 36 videos from [6] [5].
Truong et al. [39] design a semi-automatic editing tool, QuickCut. It takes a collection of raw video footages and an audio reocrding of the narrated voiceover as inputs, as well as a audio recording of users who speak out the editing actions and objects in the scene while watching the footages. And feed such audio annotations to rev.com or Google's free Web Speech API to obtain corresponding text transcripts and QuickCut time-aligns the text to the raw video using Rubin's algorithm. QuickCut interface provides transcript view pane, footage selctor pane and structured timeline. QuickCut [39] segments video based on motion refering to luminance and then refines segmentation by audio annotations. TF-IDF is used to search for relevant raw footage with etxt queries. Given a set of alignmnet constraints, the problem is to place aesthetically pleasing cuts together to minimize the combined const in terms of frame quality(blurry footage, camera shake and jump cut) and transition. Then use dynamic programming to solve this problem with restrictions. This system is not fully aumatically and mainly based on text and audio rather than video itself. Further, it does not propose a feasible speech-to-text solution so that extra labors are needed.
Heck et al. [38] take several recordings of unattended, stationary video cameras and microphones. Arev et al. [114] present an automatic editing of footage taking multiple social cameras' takes as inputs. The overall cut pipelines are 3D camera pose estimation using a standard structure-from-motion algorithm; 3D gaze concurrences estimation using gaze clustering algorithm of Park [60] to extract 3D points of joint attention(JA-point); trellis graph construction with node camera JA pair and its cost in terms of stabilization, camera roll, joint attention and global vector, edge with weights about transition angle, distance between two cameras, speed of JA-point and size of shot; graph contraction according to user preferences on length, multiple sub-scenes, first person viewpoint and algorithm parameters; path computation with an adptive dynamic programming algorithm to control output video length; and at last rendering. 10 different scenes from 3 datasets using 3 to 18 cameras are edited by this automatic algorithm, a baseline method of cutting vevery three seconds to a randomly chosen camera and a professional movie editor. The conclusion got by authors is that the result vidoes of this automatic algorithm and taht of a professional editor are similar in spirit, although understandably, not identical. But JA-point is not well equl to the most important point. Besides, audio is not used in this algorithm.

Web Videos
Web videos are from users around the world, and the quality, size, ratio, content of them varies greatly. Investigating their potential value is a hot topic. Sun et al. [95] take advantages of the huge volume video data online, and select edited videos as better to train a ranking model. Browsing Companion [115] with UI, collects videos of the same topics online, and discovers the relationship among videos using an HMM model trained with bag-of-words, spatial pyramid and color histogram similarities of frames. So when users watch a video, then could shift to other videos that are related to the current frame. Besides, Browsing Companion provides a new solution to abstract highlights from a collection of videos that are unique within videos but common among all videos. Huber et al. [116] propose a transcript-driven B-roll inserting system with a recommendation algorithm for vlogs. First, it analysize popular vlogs online to learn the appropriate locations to insert B-roll with and the relationship between words in transcript and B-roll with SVM . Then it recommand start words and its corresponding B-roll from Giphy 8 and Adobe Stock 9 .
Write-A-Video [32], a system with UI that takes user-edited text as inputs, automatically searches for semantically matching candidate shots from input video repository and assembles the video montage with a hybrid optimization objective consisting of shot-wise, cut-wise, and segment-wise energies. For each themed text, keywords used to label the segmented text and to index video shots are given by users. Input videos are segmented into shots using the difference of their histograms in HVS color space. The similarity between text segments and video shots is computed using the visual-semantica embedding with hard negatives(VSE++) approach. It proposes a cinematography-aware assembly algorithm that depends on 2D-based camera motion estimation and tone analysis. And a dynamic programming solver is used to find the optimal shot sequences. QuickCut [39] supports two main modes, alternatives mode and ordered-grouping mode. The alternatives one only assembles one shot for a segmented script, and the ordered-grouping one optimizes cut positions for a manually determined shot sequence. While, the optimization method of Write-A-Video automatically decides the shot sequence order and cut positions. In addition to QiuckCut, Write-A-Video also considers saturation and brightness, oppposite movements and shot duration.
Wang et al. [117] proposes a novel algirthm to edit online short videos driven by paragrah. Given an input paragraph, its sentences are encoded by a bi-directional LSTM. Web videos are represented by their salient objects encoded by LSTMs or NetVLAD [118], so that a matching model between videos and sentences are trained in supervised way. A proposal module recommands top-k matching videos for each sentence, and sorting module arranges the matched videos according to the storyline of the input paragraph using the Sinkhorn network.

Editing Manipulation
There are several categorization methods for video editing. Editing ranges from volume-level, frame-level to object-level. Volume-level editing inserts or deletes clips of the input video, selects a set of frames or shots, or remove volume patches. Frame-level editing includes colorization, style transferring and so on. While mainly modifying objects within frames is object-level. It also varies with the input video types as discussed above. Here, we classify editing algorithms according to their manipulation: retargeting that changes the size and ratio of video, summarization that shortens the length of video greatly, and adding special effects that mainly changes the content within frames.

Video Retargeting
For different displays require various sizes and ratios of the same video, so that retargeting becomes a practical need. The manipulations of retargeting can be divided into 5 classes: cropping, scaling, browsing, seam carving, and warping [120]. Figure 3 shows the differences between retargeting ways. Figure 4. Retargeting a widescreen recording to smaller aspect ratios. The original recording with overlaid eye gaze data from multiple users (each viewer is a unique color) and the results computed by [98] are shown.

Cropping
Before cropping, the ROI or saliency map of each frame should be figured out. Once determining the ROI, cropper should smooth results by considering the motion on screen and virtual cinematography. Feng et al. [9] crop each frame and pan it to desired size. He combines motion and extended salience and object identification methods from still images to determine the ROI, but does not consider the temporal dependent among frames over the full video. He defines retargeting loss as the sum of information loss, scaling, pixel aspect ratio, face cut cost, edge crowding cost, pan and cut costs and user hint costs, and uses brute-force search to find optimal solution. Jain et al. [97] optimize the path of a croppping window with three primary operations(pan, cut and zoom) based on the collected eyetracking data. It selects the ROI by learning viewers' gaze data with a RANSAC algorithm [88]. And to evaluate this algorithm, it compares the eye tracking data of original films and the retargeted ones. After optimizing cropping window path, Kiess et al. [121] add some seam carving operations. Rachavarapu et al. [98] retarget videos using eye tracking. Its workflow is as below. First it determines where the new cuts are using gaze data via dynamic programming. It optimizes the cropping window path according to the principles of cinematography, and uses L(1) regularized convex optimization solver. Khoenkaw et al. [99] use the film gammer of each shot to guide cinematic feature extraction and generate the importance map at server end. At the client end, a cropper retargets the film to desired size. Shots are classifed according to the camera behavior, and objects in each shot is further classified by their movements. Li et al. [122] seek for an optimal cropping window path that preserves spatial-temporal saliency and faces with a Max-Flow/Min-Cut method. Liu et al. [123] take more factors to generate saliency map: rate of focused attention, total saliency score, and bias from center penalty.

Scaling
Scaling is keeping the ratio of width to height unchanged while modifying them. Li et al. [124] come up with retargeting videos by segmenting video into spatiotemporal grids. They use grid flow to select keyframes, and resize the grid flows in those keyframes. The left frames are resized by simply interpolating their grid contents from the two nearest retargeted key-frames. This algorithm is of low computational complexity and could preserve the shapes of salient objetcs along time axis. Wang et al. [125] resize each frame of a video independently based on their saliency objects, and then optimize their motions for each pathline of the optical flow. Wang et al. [126] first align frames to the same coordinate system by estimating camera motion, resize every frame spatially and temporally coherently, and then reconstruct resulting frames into the original coordinate system.

Warping
Building a mapping between each pair frame of the source video and the traget video is warping. Krahenbuhl et al. [10] proposes a realtime, pixel-accurate warping retargeting method with 2D variant of EWA splatting [127]. Zhang et al. [128] retarget videos in compressed-domain to save runtime. The video is first partially decoded, cropped with the saliency map, warped based on column-mesh to desired size, and finally re-encoded. Yan et al. [129] focus on eliminating jittery artifacts by considering spatial and temporal conherence simultaneously. [130] calculates saliency map based on gradient magnitude, Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 4 January 2022 doi:10.20944/preprints202201.0016.v1 face detection and motion detection, and treats the mapping betwwen source and retargeted pixels as a sparse linear system solved by a least squares algorithm. [131] points that saliency map with motion information is the cause of waving and squeezing artifacts.
To avoid such artifacts, it takes motion into consideration, crops temporally-recurring contents, and warps homogeneous regions to mask deformations and preserve motion. Nie et al. [132] propose an interactive retargeting system that warps video using mean value coordinate warping method, and refines results by summarizing the temporal output based on patch to eliminate distoration. Consequently, this algorithm supports video summarization, completion, and reshuffling. To avoid repeated computation, Zhang et al. [133] compute the importance map for each pixel of every frame, and calculate cumulative shrinkability maps for the x and y directions and store them. Given any target resolution, it can quickly warps videos according to the shrinkability maps. Liu et al. [134] warp stereo 3D videos with disparity constraints. And Shao et al. [135] attempt to warp multi-view videos with depth.

Seam Carving
Seam carving reduces or expands image size by carving out or inserting seams [137]. For 1D seam, the image energy function in [137]: e HoG = | ∂ ∂x I|+| ∂ ∂y I| max(HoG(I(x,y))) Where I is an n × mimage, HoG is from [52]. And a vertical seam is: where k ∈ R, and x is a mapping x : [1, ..., n] → [1, ..., m]. To reduce image size, remove the seam with the least energy loss; To enlarge size, find the seam with the largest energy and average pixels of the seam with their neighbors.
Rubinstein et al. [136] extend 1D seams from 2D images to 2D seam manifolds from 3D space-time volumes of videos as shown in Figure 5. Unlike 1D seams that are calculated by dynamic programming, it solves the 2D seams by finding a minimal cut in the graph. Additionally, it proposes a novel energy function that emphasizes on the energy that caused by removing seams. [138] optimizes backward and forward energy jointly, and proposes isosurface protection and to encode the opacity transfer function. Kaur et al. [139] use Kalman filter to optimize the seam carving, which is theoretical simple and hardware friendly. Hsin et al. [140] manipulate saliency histogram, and propose saliency histogram equalisation-seam carving (SHE-SC) algorithm. They retarget the first frame of a video using SHE-SC, and adapt this algorithm for the rest frame according to their difference Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 4 January 2022 doi:10.20944/preprints202201.0016.v1 from the previous frame. As 3D seam carving is of high computation cost, Furuta et al. [141] propose to find a suboptimal volume seam carving using multi-pass dynamic programming.
Retargeting stereo videos should follow three guidelines: keeping temporal coherence, preventing depth distortion, and minimizing shape distortions of the retargeted video [142]. Nguyen et al. [142] propose to segment stereo videos into groups of frames according to energy cost that includes saliency and stereoscopic confidence, and the seam carving within a group is fixed. It calcutes the left seam carving first. Refer to [120] for more works on video retargeting. There are several solutions to reduce the size of videos. One is reducing frame rate by drop frames, which is called hyperlapse. One is extracting one or multiple keyframes and generating still images. And one is selecting saliency subshots and outputing a shortened video clips. In all, it should follow two main criteria: 1) preserving as much information as possible; 2) reducing the number of frames or sizes as much as possible. Thus, video summary is to seek a balance between information volume and output sizes according to different scene needs. Similar to the editing workflow above, video summary usually consists of segmentation, scoring the importance of each frame or shot, selecting the top significant frames or shots and alternatively output optimization and rendering. For certain videos, pre-processing is a must. For example, generating hyperlapse videos real-time for hand-held cameras requires stablization [87].

Video Summary
Segmentation is a key step. Zhao et al. [144] evenly segment input videos into clips of 50 frames to speed up. Poleg et al. [145] propose to use a novel Cumulative Displacement Curves to segment egocentric videos involing complex motion, and proves that integrated motion vectors work better than instantaneous motion vectors. That Calculating color histogram for frames, and detecting shot boundary when the differences between two adjacent frames is below some threshold, is a simple solution [146]. Potapov et al. [147] propose Kernel Temporal Segmentation (KTS) algorithm. Pavel et al. [13] segment video according to shot changes and the saliency maps. Abdelati Malek Amel et al. [148] detect shot boundaries by calculating the motion intensity using adaptive rood pattern search algorithm between two frames through the whole video and then deciding the threshold. Luo et al. [149] segment videos based on camera motion, eg. pan, zoom, pause. Zhang et al. [143] proposes two LSTM networks to model temporal dependency among frames and extracts keyframes or key subshots. The structures of vsLSTM and dppLSTM is shown in Figure 6.
Evaluating the importance of frames and shots varies a lot. Zhou et al. [150] propose time-mapping, namely transfroming high-frame-rate video to low-frame-rate video. This paper comes up a novel saliency method, a re-timing technique to temporally resample based on frame importance generated by that saliency method, and presents two new temporal filters (adaptive box filter and saliency-based motion-blur filter) to enhance the rendering of salient motion also generated by that saliency method. The novel saliency method is bottom-to-up, and utilizes sentimatic segmengtation and optical flow. The results show that saliency-based motion-blur filter works best. Sun et al. [95] propose a pair-wise ranking model that learns from online videos and scores the highlightness of video clips without constraints. LiveLight [144] first scans and segments the input video based on dictionary. It builds a dictionary of video segments by add the new segment that could not be sparsely reconstructed [151] using present dictionary. Li et al. [152] also propose a dictionary learning approach to segment videos that considering reconstruction loss, group sparsity regularization, patch-level and frame-level structure preservation regularizer. The outputs can be a keyframe, multiple keyframes(static story boards) [154][93], a static storyboarding [155] or a shorter video clip [95]. Many classical video summary works has been concluded by Truong et al. [156]. Since 2006, more deep learning based methods are proposed to summarize videos. Pavel [19] summarizes videos based on their lines. Cong et al. [50] treat the keyframe extraction and video skim problem as a dictionary selection problem. And Hamza proposes to extract keyframes from wireless capsule endoscopy [93]. [92] proposes a novel human detection and tracking method based on poselet and a new saliency detector trained with gaze data. Gong et al. [157] propose sequential determinantal point process (seqDPP) to summarize videos in a supervised manner. It also provides evaluation metrics. Given two summaries, figure out the matched frames whose visual distance is under a threshold, and compute their precision, recall and F-score. And it synthesizes a ground truth summary per video by maximizing the F-score between ground truth and multiple human-annotated ones greedily.

Preprints
For different genre videos, the methods to summarize them are also various. See Figure 7. Let us take egocentirc videos as an example. Wearable video cameras are first introduced by Steve Mann in 1998 [158]. In 2001, Aizawa et al. proposed a system to summarize wearable videos [30] with brain wave cues. Since then, a lots of researches on egocentric video summary arise. Haung et al. [159] use support vector machine to summarize wearable videos. Ghosh focuses on discovering high-level saliency in egocentric video and forming a story board [160]. Lu et al. [8] come up with a new video editing solution that is story-driven and emphasizes causality between subshots. [161] uses web images as a prior to extract keyframes from user-generated videos of poor quality. [162] selects superframes according to their interestingness [163] and proposes SumMe benchmark. And Gygli further improves summary algorithm in a supervised way and jointly optimizes for multiple objectives [164]. In another work [165], Gygli  stablizing egocentric videos and utilizes their shifts to turn 2D videos into stereo. [166] proposes a series of graph based algorithm to detect the shoy boundary, select keyframes, capture the characteristics of a frame, extract features, and cluster keyframes. Xiong et al. [167] use web images as prior to detect the snap points in egocentric videos. Refer to [168] for more reviews on egocentric videos. Apart from egocentric videos, Zhou et al. [169] ultilize human face detection and tracking method to address the character-oriented summarization. Whereas Mindek et al. [170] summrize multiplayer games of 3D scenes based on game rules. More reviews on different genre video summarization can be found in [153]. Except single input video, there are also practical need of video summary about multiple videos. Figure 8 shows the general workflow of multi-view video summarization system. Fu et al. [6] firstly study the multi-view video summarization problem systematically and treats the multi-view video summary problem as a graph labeling task. It segments videos into shots, and computes their important scores based on low-level features(color histogram feature, edge histogram feature and wavelet feature) and high-level features(faces detected by Viola-Jones face detector [54]). By slection, a spatio-temporal shot graph of import shots is constructed, whose edges represent the similarity between nodes. Shots are clustered by randon walks, and multi-objective optimization. More works on multi-view video summary could be found in [171]. Meng et al. [172] propose centroid co-regularization (MSDS-CC) method to select representive visual elements. Whereas Li et al. [154] use SVM algorithm to abstract keyframes and further reduces keyframes with rough set. De et al. [173] aim at reducing inclusion of patches in the result summary videos for fixed-viewpoint multiple cameras by optimal reconstruction. Kuanar et al. [174] also regard multi-view video summarization problem as a graph-theoretic one. Apart from regular steps such as segmentation and feature extraction, it uses Gaussian entropy to drop redundant frames, ultilizes bipartite graph matching to calculate inter-view dependencies, and applies optimum-path forest algorithm to cluster keyframes. Wang et al. [175] learn metrics and output keyframes rather than videos. To reduce the compression and transmission power consumption of wireless video sensors, Ou et al. [5] design a simple but useful algorithm to summarize multi-view videos. It needs to be on-line, with low computational complexity and memory requirements. Additionally, as it is for multi-view sensors networks, the communication overhead between nodes is required to be as little as possible. This algorithm consists of two stages: intra-view stage and inter-view stage. In the intra-view stage, frame features are extracted by MPEG-7 color layout descriptor [46], clustered using simplified Gaussian mixture model(GMM), and the frames with smaller GMM weights and larger variances are selected. In the inter-view stage, only one frame of the same event is kept. [176] also considers the intra-and inter-correlations and focus on realizing sparsity. Zhong et al. [177] use a hypergraph based dominant set clustering method to locate keyframes, and utilizes web images to further reduce redundancy.
Apart from multi-view videos of overlapping view-of-field, Chu et al. [4] abstract visual co-occurrence shots from lots of videos of a certain topic without constraints. It segments videos by the color changes between two frames, and extracts frame features using CENTRIST [47], VLFeat [48] and HSV color moments [50]. The shots are constructed as a bipartite graph. And it proposes Maximal Biclique Finding algorithm to discard the shots that appear only within a single video.

Video Synposis
As video summarization is to extract consecutive or inconsecutive frames from an original video, video synposis further modifies the extracted frames. Besides, shifting interested activities in the time domain is a choice to further condense the inputs. Irani et al. firstly come up with generate synposis for video retrival [178]. Alex et al. [179] are the first one to come up with this concept. Video synopsis reaaranges moving objects at different time and locations at the same frames while keeping their locations unchanged. The workflow it proposes is as below: activity generation, tube rearrangement, background generation, object stitching, solving energy cost function within a given time interval by simulated annealing [180]. Obviously, the keypoint of video synopsis lies in finding optimal temporal positions of selected activities, compared to keyframe extraction. This work has many followers. For example, Pritch et al. [181] do a similar work for web cameras, and it could generate synopsis of limited length for a given time interval. And Kemal et al. give a thorough review on video synopsis about methodoloy [182]. Sun et al. [91] extract the saliency person performing an action and generates a photo montage.
Multi-view video synposis is also a explored problem. Mahapatra et al. [183] present a solution to generate multi-view video synposis consising of 5 steps: common background creation, common plane correspondence, object detection, action recognition and dynamic video synposis. Common background is the top view representation of the multiple cameras, and can be obtained with the help of Google Map for outdoor or by modifiying previous draft sketch. Common plane correspondence is achieved by mapping all cameras to a common coordinate [184]. Object detection is implemented with [185]. Action recognition is realized by SimpleMKL [186]. An energy function in terms of information loss, collision and length of resulting synposis is optimized by Simulated Annealing [180]. Generating a storyboarding needs tracking, semantic segmengtation, extracting keyframes, extending frame layout, annotation layout, compositing and rendering [155]. Except for compression, video synposis can also function as a indexing method. Tang et al. [110] generate montages in which icons or sprites represent saliency events and function as indexes into the input video. Those sprites are extracted by building a Gaussian moxture model with conjugate priors for the background incrementally, and foreground elements are segmented with fast morphological operators.

Video Mosaics
Video masaics also rearrange the space-time volumes of input videos, but it manipulates a sequence of frames and does not modify the content of any frames. Rav et al. [187] propose Dynamosaics that sticks space-time volumes of 2D videos that is moving while recording to generate 360 o videos. And [188] seeks to stitch panoramas or activity synposis from web videos. It first filters videos by camera motions, moving objects and visual quality. Synthesize scene panoramas with fusion using feathering algorithm [189], or further add moving objects on the panoramas.

Special Effects
This section will introduce some interesting applications, that don't belong to previous subsections. Special effects range from obejct manipulation, style transferring, colorization, and so on.
Cliplet is a form between video and image for a small part of it is dynamic and the left is static. [16] first proposes the concept of cliplet. It comes up with an interactive system that support still, play, mirror and loop four idioms of iconic time-mapping fron input video to the target cliplet. After mapping, it refines results by automatic alignment, looping optimization and feathering, simultaneous matting and compositing, and Laplacian blending. Cinemagraphs is a special case of it for its movement is periodic. Bazin et al. [190] extract the target rigid object in input video or image, and users change its physical parameters. Then the new shape will be simulated and fit into original video.
Davis et al. [17] present a novel algorithm to find visual rhythm and beat in a video, and then warp the video with an audio track to form a dance video. Audio is processed by STFT to get its power spectrogram, onset envelope, tempogram and beats. Similarly, video's directogram is obtained by optical flow, and then generate impact envelope, visual tempogram and visual beats. Interpolation strategy is used to keep synchronization between audio and video. Except generating dance videos, it can also change the dance beats to fit another song. In this meaning, music-driven video editing is possible. Bai et al. [191] propose a semi-automated technique for selectively de-animating video to remove the large-scale motions of one or more objects. The user needs to draws three kinds of strokes: green strokes indicate which regions of the video should be de-animated; red strokes which regions should held static; blue strokes which should remain dynamic. This technique consists of two stages: warping and composition. Kanade-Lucas-Tomasi (KLT) tracking [192] is used to follow distinctive points in the input video and those tracks whose durations are less than 15% of the input are removed. Each input frame is divided into 64x32 rectilinear grid mesh. In initial warp, only anchor tracks are warped to minimize an energy function. And in refined warp, the output of the initial warp and floating tracks are used as input to solve the final energy function using least squares. At composing stage, graph cuts are used to perform Markov Random Field optimization.
Zhang et al. [193] propose vid2play, a system that learns the behaviors of tennis palyers with a huge volume annotated database. It could generates interactively controllable video sprites that behave and appear like professional players. Chang et al. [194] present a system that support object-level video editing. It segement objects and generates alpha mattes, and estimates 3D scene information based on structure from-motion (SfM) algorithm [195]. Users can apply 3D transformation to objects, duplicate objects, or even transfer objects across videos. Then the system will model 3D scene, render frames using sparse structure points and composite layers. Kasten et al. [196] propose to edit atlases of videos. Given a natural video with a coarse mask of interest obejcts, it estimates a set of atlases for backgorund and obejcts of interest. Users manipulate one or more atlases, like changing color, and adding texture. Then this algorithm estimates a mapping from each pixel in the video to a 2D point in each atlas, and its opacity, and propages the change consistently across the whole video. Coordinate-based Multilayer Perceptron (MLP) representation is used for mapping, atlases and alphas. This lagorithm is self-supervised and its loss function consists of rigidity loss, consistency loss and sparsity loss.
Fried et al. [197] present two method for puppet dubbing, one semi-automatic appearance-based and one fully automatic audio-based. Besides, the paper also proposes three guidelines for performing puppet speech: 1) each syllable in speech should match to one closed-open-closed segment of puppet lip motions, which is called visual syllables; 2) Lips should be still and closed when the puppet is not speaking, called visual silence syllables; 3) in rapid speech sequences, several spoken syllables may correspond to a single visual syllable. Inputs of the two methods are a given puppet video and a piece of new speech audio whose length is shorter than that of the video. Two approaches both can be divided in 4 steps. Segment the new speech audio track into a sequence of syllables by transcripting it into text with closed captions from YouTube, aligning the transcript to the audio using P2FA, combining the phonemes into syllables and merging short syllables. Segment the puppet video into a sequence of visual syllables using appearance-based or audio-based methods. In the appearance-based method, frames are classified into three categories: open-mouth, closed-mouth and invalid by a network based on pre-trained GoogLeNet or by hand or both. While in the audio-based method, visual syllables correspond to that of the first step. As for alignment audio syllables to visual syllables, three basic guidelines are 1) silence matches to silence; 2) non-silence matches to non-silence; 3) syllable lengths are similar as possible. Then solve a variant of dynamic time warping. At the last step, retime audio syllables using Waveform Similarity Overlap-add, retime visual syllables using nearest-neighbor sampling or optocal flow interpolation and retime audio and visual together by setting new length as their geometric mean. This paper declares their results in supplemental materials, but I could not find them. After all, puppet dubbing is much easier than human dubbing.
Video restoration also divided into special effects here. Lu et al. [198] use both first order and second order nonlocal regularization terms to restore videos of poor quality. Li et al. [199] propose a multiplanar autoregressive model to exploit the correlation in cross-dimensional planes of the group of similar patches of neighboring frames, and a joint multiplanar AR and low-rank based algorithm reconstructs the group. Random Field smoothes the temporally adjacent patches. Bai et al. [200] accelerate Monte Carlo simulation denoiser with the help of GPU.

Datasets and Metrics
This section mainly give a brief review on datasets and evaluation metrics used in editing algorithms.

Datasets
Standard datasets greatly free researchers from labouring data collection and clean jobs. It saves time and money costs, and provoides a benchmark for the algorithms of this domain. Here, we sort 29 datasets from two main sources: challenges and researchers. Table 1 lists out the collected datasets. Sports-360 [15] consisting of 342 360 o videos. They belongs to five sports categories: basketball, parkour, BMX, skateboarding, and dance. The videos of Pano2Vid [14] are collected from Youtube. Jiang et al. [57] builds LEDOV consisting of 32 subjects' fixations on 538 videos at least 720P resolution in 158 sub-categories. Hodosh et al. [214] establish a corpus of images with 5 simple captions. Swedish leaf [201], KTH IDOL [211], 15-class scene category [215], 8-class sports event [207], and 67-class indoor scene recognition [202] are image datasets for feature extraction. Alexander et al. [212] build a Kodak' consumer video dataset of 25 concepts. The videos are from users and Youtube, and the concept labels are annotated by hand. The concepts includes activities(dancing, singing), occasions(wedding, birthday and so on), scene, object, people and sound. Luo et al. [149] select 100 video clips from Kodak' consumer video dataset and annotates their keyframes. Xiao et al. [206] construct a dataset of 360 o panoramic images in 26 place categories to train Support Vector Machine to learn place category and scene viewpoint of images, as well as symmtry of objects.

Evaluation Metrics
Evaluation metrics play an important role in algorithm designing for it gives feedbacks and suggests the drawbacks of the testing solution. There are several metrics commonly used in video editing domain, such as the degree of its automatic, speed, memory requirements. Precision, recall and F-measure(eg. F1-score), accuracy, CC score are popular metrics [7]. The definition of F1-score is: It still is difficult to evaluate quality of generated videos, as discussed above that beauty and logic are hard to programme. Even though, researchers provide two solutions. One is user study. User feedback is the main metric. Su et al. [12] collect human visual saliency while they are watching output videos as the index to reflect how well the proposed algorithm works. Viewers' perferences are also a kind of feedback [13] [44].
The other solution is to define metrics according to specific experiments. Zhang defines precision(P) and recall(R) as equation (2) where A, B are video clips. Su et al. [14] define two classes of metrics. One is HumanCam-based and includes distinguishability, HumanCam-likeness and transferability. The other is HumanEdit-based metrics consisting of mean cosine similarity, frame pooling and mean overlap. Davis et al. [17] come up with synchro-saliency that measures the synchronization of visual and audible events. Potapov gives a method to evaluate the importance of video segment by considering whether the segment contains evidence of the given event category [147]. Sener et al. [33] use intersection over union(IOU) and mean average precision(mAP) to evaluate temporal segmentation: where N is the number of segments, τ * i is the ground truth segment and τ ' i is the predicted segment. To adopt to unsupervised algorithms, Sener utilizes cluster similarity measure that gives ground truth in brute-force searching manner. Liu et al. [216] propose metric to measure the image retargeting quality that is based on SIFT [49].

Image Editing
Though our focus is video editing, image editing is still a valuable research field worth exploring for the algorithms designed for images might can be transplanned to video domain. Here we will introduce recent researches on image editing, but we do not introduce the algorithms that generate images from scratch, like rendering [217].
To enhance the quality of videos is a valuable problem for a lot of images and videos have a long history. Super-resolution is not simply scaling images by interpolation, but emphasizing on improving its quality. It is an underdetermined inverse problem. Dong et al. [218] use deep CNN to learn the mapping from the low-resolution image to highresolution one. [219] [220] give a more detailed review on the super-resolution techniques. Yu et al. [221] propose a noise prior learner NEGAN that denoises, inpaints and colorize legacy photos. More images denoise works utilizing deep network can be found in [222].
Arar et al. [223] apply seam carving in the feature map of the input image, and then reconstructs resulting image with CNN. Retargeting a pair of stereo images need to keep their piexl pairs and stereo structure unchanges by jointly optimization [224]. Dawei et al. [225] use seam carving and warping to retarget a pair of stereoscopic images utilizing their disparity consistency. Colorization (as shown in Figure 10) and style transferring is a prevalent field. Cheng et al. [227] first explore colorization with CNNs. And (DE) 2 CO uses deep learning network firstly [228]. Zhang et al. [229] propose a automatic colorization solution that treats it as a classification task and tries to increase the diversity of resulting colors. Some work need inputs from users. For example, [230] colorizes the grayscale images with a CNN trained on large datasets and sparse hints from users. Different from previous colorization methods that focus on the entire image, Su et al. [226] propose to fuse the object-level and image-level features obtained from two models to determine the final color of each object. Objects are detected by Mask R-CNN [231]and are cropped. The instance model, image-level models [230] and a fusion module are trained in three steps. More reviews on AI colorization can be found in [203] [232]. Style transfer is extracting a texture from the source image domain and transfer it to the target image domain using a deep neural network. Jiang et al. [233] present a novel Ghost module into the GANILLA architecture [234] to learn and transfer the styles of images. Wang et al. [235] propose an interactive tool for extracting alpha mattes of foreground objects, namely, Soft Scissors that combines incremental matte estimation, incremental foreground color estimation, intelligent user interface and robust matting algorithm so that could run in realtime and efficiently. While Kim et al. [236] grab several photo streams of the same theme, aligns them with similarity, and cosegments the shared regions of the aligned images based on image graph jointly. This work aims at finding the common patterns among a sea of web images.
To avoid repetitive efforts, Grabler et al. [237] propose a novel system to generate photo manipulation tutorials and content-dependent macros by recording a demonstration. It consists of demonstration recorder, image labeler and tutorial generator. Demostration recorder record all changes in the interface and the resulting changes in the application state and they will be grouped according to parameters and operations later. Image labeler leverages existing computer vision-based recognition technique to label semantically import regions in images. It generates text description in a fixed style with grouped changes, sceenshot annotations and text generation according to some guidelines, like step-by-step, succinct, text and images combination and grid-based layout.
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 4 January 2022 doi:10.20944/preprints202201.0016.v1 Some works make a step towards image understanding. For example, given an image with several simple captions, Hodosh tries to resolve the entity coreference [214]. However, it only achieves a precision of 46.0% with F-score 49.0%. [163] investigates the interestingness of images.

Summary
Video editing starts from recent 20 years, and grows rapidly with the help of computer vision, natural language processing and other disciplines. We have given a clear introduction about the development history of AI video editing above: dividing editing tasks into different groups according to the number of input tracks, target output, features used, and the kinds of input videos. Common tools and datasets are sorted. As we can see, video editing systems evolves from simple to complex, from single input to multiple, from single feature to multiple. The range of input video domains is wide, from sports game to surveillance, from film to social platform short videos.
However, video editing still does not realize fully automatically. Now editing algorithms heavily rely on perfect inputs, and even pre-process data by human. The features extracted from audio, image and the temporal dependency of consecutive frames, and subtitle, are not easy understanding for computer machine, or enough for algorithm. Besides, designing editing plans that are logically consistent is very hard. Producing an amazing edited video is further impossible today. We hope this survey could provide convenience to researchers and promote video editing development.

Future Work
Even though video editing has already obtained impressive improvements over decades, it is still not intelligent enough. The degree of automation is waiting increase. Hopefully, several reaseaches have appered and perform successfully, which could bring insights in AI video editing. More researches now focus on video understanding, which would greatly help intellegnet video editing. Ramanathan et al. [238] attempt to resolve names for each figures. Haurilet et al. [239] propose a method to label all character appearances in TV series only using subtiles. [240] also identifies characters in TV. Du et. al. [53] achieve great results on action, scene and object recognition. Jean-Baptiste Alayrac [241] tries to discover actions and corresponding object states in videos. Huang et al. [242] try to solve references in instructional videos. Then [243] proposes visual understanding task for the video domain, present a novel visual grounding model that is both reference-aware and weakly-supervised, and provides reference-grounding test set annotations for YouCook2II and RoboWatch instructional video benchmarks. Xia et al. [244] propose an online multi-modal searching machine(OMS) to search persons in videos with their features of face, body and voice. Similar work on person searching in video includes [245] [246].
Video saliency give hints of what audiences want to see. Scherer et al. [208] investigate the impact of audiovisual features in the communication and judgement of politicians in zero acquaintance situations. Authors make several expriments to explore the disparity in the speakers' perception with and without audio using eye-tracking data and the influence of audio features, together with common visual features on the perception of the speakers' qualities. Jiang et al. [57] have three findings: 1) high correlation exits between objectness and human attention; 2) objects, especially, moving objects or parts are more attractive for human attention; 3) sliency maps are smoothly transited across frames. Based on those findings, an object-to-motion model OM-CNN is developed to learn motion features to predict intra-frame saliency and saliency-structured convolutional long short-term memory network SS-ConvLSTM is proposed to learn eye pixelwise transition across frames and center-bias for inter-frame saliency maps. Liu follows this work in [209], and also proposes a deep learning model to predict salient face with transition across frames in multiple-face videos and builds a new multiface video database. By analyzing the database, two findings Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 4 January 2022 doi:10.20944/preprints202201.0016.v1 can be concluded: 1) faces accounting for 5% pixels draw about 80% attention in multiface videos and one face in each frame draws most subjects' attention; 2) humans tend to focus on the face close to the center of videos. While Xia et al. [59] explore saliency with user bias. [247] also investigates the video saliency.

Conflicts of Interest:
The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.