Reduced-reference Video Quality Metric Using Spatio-temporal Activity Information

Monitoring and maintaining acceptable Quality of Experience is of great importance to video service providers. Perceived visual quality of transmitted video via wireless networks can be degraded by transmission errors. This paper presents a reduced-reference video quality metric of very low complexity and overhead that makes use of frame based spatial (SI) and temporal (TI) activity levels to monitor the effect of channel errors on video transmitted over error prone networks. The performance of the metric is evaluated relative to that of a number of full and reduced reference metrics. The proposed metric outperforms some of the most popular full reference metrics whilst requiring very little overhead.

Spatial activity has been used before in [13][14] to form an RR method that tries to estimate the PSNR of a received sequence. This RR metric was extended t o NR video quality estimation in [15]. The method of [17] employs perceptual weighting parameters in order to estimate the quality of the received video through activity difference values between the transmitted and received videos. This method is fairly c omplex and produces sizeable side information (one value per block of pixels). The rest of the paper is organised as follows. Section 2 describes the proposed RR metric (STIRR). Section 3 describes the evaluation procedure followed and presents the results collected, including performance comparisons. Finally, conclusions and suggestions for future work are given in Section 4.

Research Method
One of the factors affecting the perceptual video quality is the amount of spatial and temporal details in a video [16]. ITU Recommendation P.910 specifies how to calculate SI and TI for the purpose of characterizing video complexity. In order to calculate the value of SI for one video frame, a Sobel filter is first applied on the luminance values. The SI value of frame F n at time n is then equal to the standard deviation of the image resulting from convolving frame F n with the Sobel kernel: TI is calculated by subtracting two successive frames and taking the standard deviation of the resulting residual frame: ( ) is the pixel value at row i and column j of the n th frame. At the receiver end the SI and TI values of the received video are calculated. The STIRR value for each nth frame is equal to the Euclidean distance between the two feature vector (SI,TI) of the transmitted video frame and that of the received video frame: where q=(q 1 ,q 2 ) are the coordinates of the received frame's (SI,TI) vector and p=(p 1 ,p 2 ) are those of the transmitted. Frame by frame STIRR values are averaged over the length of a Group of Pictures (GOP) with the IDR frames being excluded.

Results and Analysis
We are interested in the use of the STIRR metric as an indicator of the effects t hat packet errors (missing packets) have on the quality of compressed video. We assume that the video is transmitted over error prone wireless channels and that the video decoder at the receiver end has error concealment capabilities. The performance of the decoder's error concealment module depends on the actual concealment method, the affected video content, the error resilience of the compressed video, and the severity of the errors (packet error rate and nature of errors). In effect we wish to be able to estimate if the quality of the video after concealment is acceptable or not so that the network link adapts to a more robust mode when the latter is the case.

Simulation Setup
We simulated wireless video transmission over IEEE 802.11n wireless networks by dropping packets according to error patterns produced by a compliant IEEE 802.11n, PHY -layer simulator [17]. The received video streams were decoded and concealed using previous frame copy (PFC) as well as motion copy (MC) concealment. The resolution of the test sequences used (CrowdRun, PrincessRun and DanceKiss) was 1920x1080 at 50 frames per second. Figure 1 shows a plot of the spatio-temporal activity indicators of the three test sequences (HD). DanceKiss at the bottom-left has the lowest spatio-temporal activity while PrincessRun at the top-right has the highest. All sequences contained 500 frames and were encoded using the JM H.264/AVC reference software (JM18.0-high profile) with an IPPP GOP of size 10. IDR frames were assumed to be error free, and one slice was set to be equal to one row of blocks.
The transmission modes and the video bit rates tested are summarized in Table 1. The setting used for the IEEE 802.11n simulation were as follows: MMSE detection, 800ns guard interval, channel model B (Non Line-of-Sight residential environment). For each of the three transmission modes, we tested three packet error rates-1%, 2%, and 4%, -corresponding to three different channel signals to noise ratios. For each packet error rate ten simulation runs (ten error patterns) were performed with a different starting point for the errors.

Results
The hypothesis behind this experiment is that increases in the distortion of the received video due to channel errors would result in increased differences between the STIRR values of the transmitted and received (concealed) video. To test this hypothesis we compared the STIRR difference values with three established objective quality metrics: PSNR, SSIM and VIFP. More specifically we measured the correlation (Pearson correlation coefficient) between the STIRR difference values and the quality of the received video as measured by the three selected metrics. Table 2 and Table 3 show correlation results for the case of previous frame copy concealment and motion copy concealment respectively.
The average correlation for all sequences and all metrics was 0.8 for PFC and 0.78 for MC, with values ranging from highs of 0.952 (CrowdRun, SSIM, MC) to lows of 0.515 (PrincessRun, SSIM, MC).
We additionally evaluated the performance of STIRR with LIVE Video Quality Database [18][19] (wireless transmission errors, motion copy concealment). Six different reference videos were used (Station, Tractor, River Bed, Shield, Mobile & Calendar and Blue Sk y) with four error patterns per reference video. These videos are distorted according to manually adjusted strengths of wireless distortion, in order to ensure that the distorted videos are separated by different levels of perceptual distortion. The SI and TI values of these sequences are also shown in Figure 1.   Table 4 presents a summary of the performance results obtained with the tested FR and RR quality metrics using the LIVE database. The results show that despite its very low complexity STIRR is able to outperform some of the FR metrics tested (PSNR, SSIM, VFIP). Reduced reference metrics STRRED and STIS-SSIM perform better than STIRR but generate significantly more side information and thus incur much more overhead. Overhead is normalised with regards to the number of pixels in one frame (P). In addition our method exhibits very little complexity relative to all other methods (except PSNR) as shown in Table IV. Complexity was measured as the average execution time on an Intel i7-2600 CPU @ 3.40GHz PC and was normalised relative to the execution time of PSNR. All test metrics were realised in Matlab except MOVIE, which is realised in C.

Conclusion
In this paper we described STIRR a very low redundancy reduced reference metric that makes use of the spatiotemporal activity values of a transmitted sequence in order to estimate the quality of the received video in the presence of errors. STIRR was found t o correlate adequately with quality values estimated by a number of full reference objective quality metrics. STIRR was also shown to outperform some full reference metrics when tested on the wireless distortion part of the LIVE video database. Future work will concentrate on improving the performance of the metric through the use of further information regarding the channel and the SI/TI levels of the transmitted sequence.