This paper illustrates a series of experiments aimed at analyzing the QoE for different media-enriched web-based services in a mobile context. An extensive subjective testing campaign was carried out to gauge the satisfaction as expressed by end users. 20 subjects participated in the tests (13 male, 7 female) with ages ranging from 21 to 44 years old (average at 28). Context of use was explored using (1) video property and length (2) mobile access and (3) the methodology for assessing user experience. Users were asked to evaluate the impact of similar visual impairments in two contexts: (1) using common test video sequences and including typical wireless degradations, and (2) including similar loss conditions into video sequences extracted from considered online services.
Video lengths were inferred from the analysis of different online sources, namely BBC news mobile, YouTube Mobile and Facebook. The duration of clips ranged from 20 s to 600 s, with average values between 94 s and 197 s. In [6] authors found average duration of videos hosted in Daum (popular service for user-generated content in Korea) from 30 s for advertisements to 203 s for music videos. Concerning viewing conditions, all sequences were displayed in a mobile handset (screen size of 2.8 inches, resolution of 320*240 pixels) instead of a normalised LCD display, and users were asked to hold the handset on their hands with a free viewing distance (commonly 6-8 times the height of the display). For quality assessments, we decided not to use the recommended continuous assessment method for long video sequences, since quality evaluation tasks may distort the results from a service perspective. Instead, absolute category rating (ACR) was used at the end of each sequence. In addition to quantitative assessments, qualitative evaluations were considered allowing users to add comments and to stop the play out if quality was perceived as unacceptable.
Short video sequences
In the initial phase, users were asked to evaluate the visual quality of impaired short video sequences. Common test video sequences in traditional visual quality studies have been selected, namely "football", "stefan", "carphone" and "suzie". According to the classification in [7], these four video sequences provide a good sample for the different quadrants in the spatio-temporal complexity grid in order to take into account different content types. Video sequences were degraded based on the wireless error model therein presented for mobile Internet connections. Figure 1 illustrates the set of impaired video sequences selected for the aims of this paper and the results from the quality evaluations. For each considered video clip, we show the evolution of the resulting structural similarity index (SSIM) respect to the original sequence, as a means for estimating the severity of the impairments from an objective quality metric standpoint. At the rightmost subplots, we present the results obtained from the subjective tests in terms of mean opinion score (MOS). All the obtained quality assessments are quite poor and are considered unacceptable for a commercial service.
Long video sequences
During the second phase, two different types of videos were considered in order to capture different spatio-temporal characteristics. We first analyse the results with a severe degradation of 6 s in the middle of a "high complexity"-"high motion" video sequence, corresponding to an "nba top ten plays of the week" clip of 100 s. Figure 2 illustrates the degradation pattern and the subjective assessments as provided by users. In general, the quality evaluations are considerably higher compared to short video sequences. In terms of MOS the video clip scores 3.31, which can be considered in the lower range of acceptability for commercial services. However, the variability of quality assessments is substantial compared to short clips with the same group of individuals. These results indicate that user segmentation shall be necessary for an accurate inclusion of users' expectations, as described in [3]. Taking into account the qualitative evaluations provided by users, some of them state that "Very good quality, except a severe degradation in the middle" or "If repeated, it would be unacceptable". As a result, one isolated severe impairment in such service contexts is not enough to abort the session, but the provider should maximise the quality policies to assure an accurate network performance for rest of session lifetime.
The second experiment introduces diverse degradations in a "low complexity" clip, namely a "talkshow" sketch of 120 s. Figure 3 illustrates the different levels of degradations used in the subjective tests. In the left-top plot three different sequences are illustrated, ranging from several light 2 s degradations to one severe 10 s degradation. The associated boxplot (right top) gathers the statistics for the aggregated quality assessments. Obtained results are considerably different to short video clips: the visual quality is perceived from fair to excellent with no comments about acceptability. Hence, results from traditional quality studies are not directly applicable to these media-enriched services. Central plots illustrate the evolution of image impairments with additional degradations (left) and the associated subjective quality scores (right). Although obtained quality scores are lower, the experienced quality is indeed higher compared to short clips with less severe degradations. Two individuals stated that "I would stop the video if degradations persist". Thus, once again the variability of quality scores may advise towards user segmentation for an optimal management. Finally, the left bottom plot shows the SSIM for a highly degraded video sequence. 80% of people who evaluated the sequence stopped the reproduction before the end of the transmission, and the remaining subjects provided the lowest quality score as well. However, as illustrated with vertical dotted lines, the acceptability threshold is variable and difficult to gauge with the limited set of tests.