AI generated highlights — Machine learning for automatic video summary generation

7 min readOct 9, 2018

--

Videos constitute a large portion of our information intake these days. In fact, video streaming accounted for 73% of the internet traffic in 2016 and is projected to take up as much as 82% by 2021.

This increase in video usage calls for a robust method to store visual content in a searchable database.

In particular, YouTube reported in 2017 that searches for sports highlight videos has increased by 80% since the year before.

I watch a lot of sports highlights — tennis, soccer, volleyball, basketball, you name it. I thought I could write a software to automatically generate these highlights for myself! So the rest of this blog post elaborates how I wrote a piece of software to take in a full tennis match video (hours long) from YouTube and turn into a highlights (minutes long) video.

I try to provide an overview of how I go about summarizing a tennis match. I will also present some intuition on why the pipeline here is suitable for this task. Each module of the algorithm (see Figure 1) is described in more details below.

Figure 1 | Pipeline of the software, taking in a video of a full tennis match from YouTube, and outputting a summary highlights video from it. The software is composed of a **Scene change detection** module and a **Human detection and tracking** module.

The information of a tennis match video comes in two streams: the audio stream, and video stream. Notice, there is a pattern in the video stream. There is a series of frames that correspond to match play followed by a series of frames that correspond to a break. This pattern repeats (with some outliers).

The task here is to segment the full tennis match video into smaller segments, such that each segment only contains one scene (i.e. no sudden camera movement within each video segment). Some of these scenes will correspond to tennis points that will eventually make it into the final highlights video, while most of the other segments are other scenes irrelevant to the final highlights video.

Once we have the tennis match segmented into a series of video segments, we need a way decide which segments are tennis points. We also need a way to decide which tennis points make it to the final summary highlights. The Human detection and tracking module does exactly that: it defines a metric — based the players movement — to determine if a video segment shows a tennis point and if the tennis point is worthy of the final highlights summary.

Scene Change Detection

The scene change detection module takes in a video and returns a list of frame numbers, after which a scene change happens. Figure 2 shows two sets of consecutive frames. The left panel shows two consecutive frames that do not represent a scene change. The right panel shows two consecutive frames that do represent a scene change. Comparing corresponding pixels in consecutive frames gives an indication of similarity between frames (Equation 1).

**Figure 2 |** shows two sets of consecutive frames. The panel on the left shows two consecutive frames that do not represent a scene change. Any two corresponding pixels show minimal difference. The panel on the right shows two consecutive frames that do represent a scene change. Any two corresponding pixels are considerably different.

Lets make the empirical observation that most consecutive frames do not represent a scene change. So the Scene change detection algorithm first rejects — with high probability — the frames that do not indicate a scene change. This is save computational time and avoid unnecessary calculations and allows us to implement a robust human detection algorithm (see below).

This step is done by comparing corresponding pixels in consecutive frames. In mathematical terms, this is the first derivative of the frames.

where f_t(i,j) is the intensity of the frame t at location (i,j). Frames that do not represent a scene change end up with a small D_t. You can see below that ~90% of the frames have a relatively small D_t and also do not represent a scene change.

Figure 3 | shows the distribution of D_t values in a video. ~90% of the frames have a relatively small D_t and also do not represent a scene change.

Using D_t, we can reject most of the frames as non-scene change frame and only further investigate the other 10% frames as potential scene change frames. We do this by contrast enhancing the 10% scene change candidate frames through a histogram equalization process. The resulting normalized images are used to compute the rate of change of frames (D*), rate of the rate of change of frames (dD*), and the deviation of the frame from the mean image (var*). The variables above are superscripted with a * to indicate that the operations are performed on the histogram normalized image. These operations are mathematically defined below.

A frame is considered to represent a scene change based on the relative values of D*, dD*, and var* compared to a threshold. For more information on scene change detection, you can read [1].

In a nutshell, this operation segments the entire video into many small segments (~10³ for an hour long video), some of which also corresponds to a tennis point.

Our task is now to determine which of these video segments represent a tennis point, and further decide which video segments make it into a highlights summary video. The Human Detection and Tracking module below is a means of doing just that.

Human Detection and Tracking

The goal of the Human Detection and Tracking module is decide which video segments will make it to the final highlights video. We do this by detecting and tracking the players within the video segment and choose the segments with the highest movement. The hypothesis is to choose video segments (also tennis points) with highest amount of movement within them. Here, the amount of movement in the video segment is taken as a proxy for merit of that segment to make it into the highlights summary video.

The amount of movement within the video segment is computed using a human tracking algorithm (more below). The tracker, however, is initialized with the initial location of its objects (ie. people to track) using a human detection algorithm. I used the faster_rcnn_inception_v2_coco convolutional neural network (CNN) from the Tensorflow detection model zoo open source library here.

A note on computational complexity: the CNN architecture is relatively intensive, especially when no GPU power is available. To make this manageable, I only run this human detection module on the first few frames of a video segment. I suggest you experiment with other CNN architectures.

Figure 4 below shows the human detection algorithm on a single frame. The two players (Federer & Nadal) are correctly detected. The algorithm also detected ball kids and lines people.

**Figure 4 |** shows the output of the human detection algorithm on a frame of a video. The two players (Federer & Nadal) are detected as well as other people that appear in the frame. The algorithm was not able to pick up every human.

In every subsequent frame, the position of the bounding boxes is tracked and a score corresponding to the amount of movement observed in the video is assigned to the video segment. Figure 5 shows the path of the bounding boxes throughout a video segment.

**Figure 5 |** shows the track traversed by the detected bounding boxes in a video segment. Some of the bounding boxes do not move — and hence do not contribute to the movement score of the video segment — while other bounding boxes (tennis players) do.

The result of the human detection and tracking module is a set of video segments with an associated score. The segments are sorted based on their score, and the top video segments are put together in a highlights video. The video below shows an example

A final note on the current state of the software: notice in the first point of the above video, the algorithm picks up the movement of the ball kid and counts it towards the total movement in video segment. This is not desirable but perhaps shows the current state of the algorithm, where it does not distinguish between different types of movements.

How to improve

In this project, I took the amount of movement as the proxy to how interesting a tennis point is. While this does make sense as a measure of interest, it is not comprehensive. Sometimes a tennis point makes it into the highlights just by the virtue of when it is played during the match: a break point is perhaps more crucial that a 15–15 point. The message is other criteria, in addition or instead of of human detection and tracking, may be more suitable to decide which video segments should be included in the final summary highlights.

One approach here is to use object character recognition (OCR). Note that there a score box at the bottom right corner of the screen. Also note that there are a ton of highlight videos publicly available on YouTube. These highlight videos can be used as a training data set to train a machine learning model, possibly a neural network, to pick out the features that contributed to a human including a particular video segment.

References

[1] Xiaoquan Yi et al. “Fast pixel-based video scene change detection.” Circuits and Systems, 2005. ISCAS 2005. IEEE International Symposium on.

AI generated highlights — Machine learning for automatic video summary generation

Scene Change Detection

Human Detection and Tracking

How to improve

References

Written by Farhang Tarlan