Konzeption, Realisierung Und Bewertung Eines Push-to-Video-Dienstes

Lehrstuhl für Kommunikationsnetze I
Prof. Dr.-Ing. C. Wietfeld
Diplomarbeit
Konzeption, Realisierung und
Bewertung eines
Push-to-Video-Dienstes
Sulejman Mundzic
Betreuer : Dipl.-Ing. Jörn Seger

Eingereicht am : July 9, 2006
Erklärung
Hiermit versichere ich, die vorliegende Arbeit selbständig und ohne fremde Hilfe angefertigt
zu haben. Die verwendete Literatur und sonstige Hilfsmittel sind vollständig angegeben.
Dortmund, den July 9, 2006

V
Contents
Erklärung III
Abbreviations VII
Abstract VIII
1 Introduction 1
2 Preliminaries 3
2.1 Linux Support for Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 X Window System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Linux Kernel API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 V4L and V4L2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 MPEG Video Codecs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 MPEG Compression Technics . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2 H.261 and H.263 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 AVC/H.264 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Real Time Data in IP Networks . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Push-to-X System 21
3.1 System Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2 Execution Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1 ProtoSim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Push-to-X Client . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.1 Program Flow during Media Transmission . . . . . . . . . . . . . . . 28
3.4 Push-to-X Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Design Approach and Hardware Components 31

4.1 System Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Hardware Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.1 Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.2 Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5 Realization 39
5.1 Video Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.1.1 Video Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1.2 Initiation and Finalizing of Video Transmission . . . . . . . . . . . . 43
VI Contents
5.2 Capture Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.2.1 Opening the Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2.2 I/O Method Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2.3 Memory Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.4 Format Negotiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.5 Video Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.1 GUI - Backend Communication . . . . . . . . . . . . . . . . . . . . . 49
5.3.2 Identifying Pixel Format . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.3 Video Data Transfer to X Window Server . . . . . . . . . . . . . . . 51
5.3.4 Video Data Transfer from Video Layer . . . . . . . . . . . . . . . . . 52
5.4 Codec Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.4.1 Bit Rate Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.5 Supported Pixel Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.6 Network Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.6.1 Peer-to-Peer versus Multicast . . . . . . . . . . . . . . . . . . . . . . 61
5.6.2 Push-to-X Proxy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
6 Tests and Results 67

6.1 Bandwidth Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.1.1 Video Traffic Profile in Comparison to Voice . . . . . . . . . . . . . . 68
6.2 Video Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7 Conclusion and Future Work 76
Bibliography 78
[english]
VII
Abbreviations
ALSA Advanced Linux Sound Architecture

AV C Advanced Video Coding
AP I Application Programming Interface
DCT Discrete Cosine Transform
GOP Group of Pictures
GP RS General Packet Radio Service
GUI Graphical User Interface
IDCT Inverse Discrete Cosine Transform
IP Internet Protocol
ISDN Integrated Services for a Digital Network
IT U International Telecommunication Union
I/O Input/Output
LAN Local Area Network
MAC Media Access Control
MP EG Motion Picture Expert Group
OMA Open Mobile Alliance
PTT Push to Talk
PTX Push to X
QoS Quality of Service
UDP Universal Datagram Protocol
UMT S Universal Mobile Telecommunications System
V 4L Video for Linux
V 4L2 Video for Linux Two
V CEG Video Coding Experts Group
V LC Variable Length Coding
V oIP Voice over Internet Protocol
W AN Wide Area Network
W LAN Wireless Local Area Network
VIII
Abstract
The main objective of this master thesis is the development of a Push-to-Video service. The
Push-to-Video service is a part of the Push-to-X system being developed at the Communi-
cations Network Institute at the University of Dortmund. The video functionality is imple-
mented as a Push-to-X client component which collects source video data from a web-camera
attached to the system, compresses it and transmits over the Internet. It also decompresses
the received video data and displays it on the receiver screen.
The Push-to-X client is designed to run under a Linux operating system. The video compo-
nent uses video for linux kernel API to interface the video capture device, and X Window
system to display the video data. The displaying functionality is embedded in the Qt based
graphical user interface. Video data is compressed using MPEG family of codecs. A proxy
server was developed to support the use of service in narrow bandwidth networks. The devel-
oped system was evaluated with regard to quality of video sequence depending on available
bandwidth and packet loss rate.
1
1 Introduction
The emergence of camera-equipped cellular phones and hand-held PCs together with the
steadily growing bandwidth capacity of the new generation IP converged communication
systems opens up opportunities for the development of new Internet applications providing
real-time video services, which are affordable for everyone. The objective of this master thesis
is to develop an application providing the Push-to-Video service that enables instant see what
I see experience. Push-to-Video service extends the Push-to-X system being developed at
the Communication Network Institute the University of Dortmund. The current Push-to-X
system provides a VoIP based Push-to-Talk service for group communication. The system
has a client server architecture, whereas the server plays a supporting role providing user
and group administration service. The Push-to-X client application uses a “Buddy list”
style GUI with user-defined groups and provides presence and availability information. The
users which are registered by the server for joining a group communicate in a peer-to-peer
half-duplex manner using a one-touch dedicated button.
The voice and video services represent two modes of communication both from user and
technical perspective. From user perspective they vary in their speed of information deliv-
ery and in the amount of attention required to use. In contrast to the voice service the
video service conveys almost instantly the large amounts of information but requires the full
attention of the receivers. From technical perspective they differ in real-time requirement,
data fragmentation and demand on bandwidth. With regard to real-time requirement voice
is more sensitive to data delay and data loss then video. For example voice data is deliv-
ered every 20ms for telephone quality voice while the video data captured with a frame rate
of 25 fps provides broadcasting video quality. However, representing video data requires a
large amount of information, leading to high bandwidth demand. The applications deal with
high bit rates by using one of the video compression technics, which increase computational
complexity and power dissipation. Thus, the battery powered wireless devices with limited
computational resources can represent a bottleneck in communication system. In addition,
the wireless link usually has a high transmission error rate because of external conditions
affecting the transmission chanel.
2 1 Introduction
Collecting the source video data from the camera and displaying it on the screen is in no
way easier then a transmission over the network. In a Linux PC system used as target
platform for Push-to-X service video data captured by the camera typically passes system
USB port and camera device driver located in the kernel memory space before it gets to the
Push-to-X application. In order to keep real-time requirement the data must be transfered
from the camera and encoded in shorter time then the video frame rate. On receiving side
the video data passes through a GUI to be displayed by the X server. The Push-to-X
application ensures the interoperability by providing support for major video data formats
used by underlying video hardware and software.
In order to deal with the complexity of video data processing the Push-to-Video software
is divided in four major components. VideoCapture interfaces web-camera, VideoDisplay
is responsible for displaying video frames, AVCodecHandle is responsible for encoding and
decoding video data and VideoLayer is controlling component providing video functional-
ity to Push-to-Video system. The video functionality of Push-to-X system is described in
following chapters. A summary of the contents of the chapters is as follows. Chapture 2
provides the background information necessary to understand the Linux development plat-
form along with interfaces that it provides to applications for use of the video hardware
including the X Window graphical system used to represent the video data. It also points
out the specifics of the real time data handling in IP networks and givs a review of MPEG
family of codecs and compression technicks. Chapter 3 gives a brief survey of Push-to-X
system with an emphasis on system architecture and Push-to-Talk functionality. Further,
it represents the PTX protocol and its role in communication during the signaling phase
and multimedia transmission. It also gives a brief summary of ProtoSim system from the
aspect of the Push-to-Video implementation. Chapter 4 approaches software development
describing the system requirements and software engineering process. It also describes hard-
ware components used during the implementation phase. Chapter 5 describes in detail the
Push-to-Video software components including a Push-to-X proxy server providing the access
link to Push-to-X clients from a narrow-band network. Chapter 6 presents the results with
focus on video quality and bit rate. Finally, Chapter 7 presents the conclusion of the thesis
with several interesting open problems and directions for future research.
3
2 Preliminaries
This chapter discusses major components involved in capturing, transmission and display-
ing video data on Linux operating systems as well as the concepts and terms necessary to
understand the facility of video communication in IP networks.
2.1 Linux Support for Video
Linux provides a support for video applications by its two Application Programming Inter-
faces (APIs) video for linux and video for linux two . Within linux developer community
these two interfaces are also for short referred to as V4L and V4L2 and this document uses
these two shorter names in further text. As the names suggest V4L2 is newer and is designed
to replace the original V4L interface. However, the many older video drivers, that implement
only the older V4L API are still in use, thus the new applications implement support for
both APIs. The two APIs are described in 2.1.3 section in more detail.
V4L interface does not provide displaying functionality, instead the Linux video applications
use X Window system, the standard Linux graphical system to display bitmaps. The section
2.1.1 deals with X Window system in more detail.
2.1.1 X Window System
X Window system is a network-oriented graphical system developed by X consortium. The

figure 2.1 shows the X system architecture.
X system has a client-server architecture allowing the remote or local running X applications
to use graphical resources on a local X server host communicating over the X protocol. The
X protocol uses reliable TCP protocol as underlying transport protocol for communication
over the network, while for application running on the same host as the X Window server
it uses UNIX sockets for performance reasons. The video applications however require the
4 2 Preliminaries
X server X Client
Xlib Application
Application
Keyboard Application
Figure 2.1: X System Architecture
X server being run on the local host due to a high data throughput. For that purpose X
system provides numerous extensions enabling an efficient data transfer from client to server.
Particularly, for the systems supporting the video displaying hardware X protocol provides so
called Xv extensions to the standard X protocol further improving the transfer and handling
of video data.
X server manages the host I/O hardware resources (a keyboard, a pointing device and one
or more screens) and provides the higher level graphical elements (windows, fonts, lines etc.)
so called X server side resources over the X protocol in a hardware independent way. The
X server side resources used by video applications to display natural images are described
below.
• windows and pixmaps, commonly named drawables, denote resources used as sources
and destinations in graphics operations. Pixmaps are off screen entities used for data
manipulation. Windows contain the data shown on the screen when it is mapped. In X
terminology mapping denotes the act of showing the graphics on the screen. Important
property of drawables is Visual Class which is also a X server resource described below.
• visuals specify the way the image data is presented by the server to display hardware.
Among others, they define how the image color components are allocated in the mem-
ory. There are three basic types of visuals, which belong to one of six visual classes,
from which only the color visuals falling in True Color and Direct Color visual class
are of concern for Push-to-X client application. These visuals do not reinterpret pixel
values being displayed on the screen. Also, an important property of a visual is its
color depth. This is the number of significant bits that are used to encode the value of
2.1 Linux Support for Video 5
a pixel. The visual data structure is not visible to client programs directly, rather it is
represented by another data structure called VisualInfo.
• graphics contexts specify parameters such as line style and foreground color used by X
server for drawing the graphic primitives such as lines, rectangles, and text. They are
of no concern for displaying natural images. However, X system requires a graphical
context when displaying the images in general.
A client references an X server side resource by its ID as shown in figure 2.2.
Server Client
Application
11
00
00
11
00
11 resource ID
C calls
Xlib
server resource resource

representation
X protocol
Figure 2.2: X Resource References
Applications communicate with a server via calls to the intermediary low-level library of C
language routines called Xlib. The client-server communication is asynchronous. It means
that an application sending a request to the server via Xlib does not wait for the server
reply before continuing the program execution. When the server responds, Xlib converts
the X message in a C union structure called Event and stores it in an internal queue. The
application typically polls the Xlib for incoming Events in a so called event-loop. Xlib
provides an image structure that is capable of storing all the data corresponding to a screen
area or pixmap. The major difference between an image and a pixmap is that an image
is a structure on the client side, so its contents can be manipulated directly by the client,
instead of solely through X protocol requests. Xlib provides the routines XGetImage() and
XPutImage() that use the X protocol to transfer the contents of a window or pixmap into
an image structure and to write the contents of an image structure back into a window
6 2 Preliminaries
or pixmap. The video applications running locally on the server machine may use IPC X
extension for transferring the images to server if the Xlib, the X server, and kernel support
it. Using IPC mechanism avoids data copying between client and server memory space and
increases system performance.
2.1.2 Linux Kernel API
Linux application programs access peripheral hardware devices through a device driver. A
device driver is a piece of software usually executed in kernel memory space which provides
a mechanism for utilization of hardware devices attached to the system. A device driver and
an application communicate independently of hardware by implementing a set of C function
referred to as kernel API or system calls. The kernel API define the way for exchanging
data and information between user application and kernel.
The most prominent system calls are open and close for establishing and closing communica-
tion between device and process running the user application, read and write for transferring
data between the device and the user application and ioctl and fctl for controlling the device
behavior by querying and seeing device parameters. In addition, the device driver usually
implements select and pol system calls. These two system calls take advantage of Linux
multitasking execution model for performing I/O operations improving overall system per-
formance.
2.1.3 V4L and V4L2
The part of Linux kernel API for controlling the video devices and transferring the video data
between device driver and user applications is called video for linux API. This application
interface goes through a transient phase. The Video for Linux Two (4L2) specification
should replace initial V4L API. According to V4L2 API specifications V4L2 was introduced
to improve the current V4L design and simplify the video device driver API implementation
at the same time providing an ease to use interface to user applications. Otherwise both
interfaces are functionally identical, thus in further text the main focus lies on the use of
V4L2.
The Linux video applications rely on a driver implementing one of these two APIs. As noted
in section 2.1.2 V4L and V4L2 device drivers are based on standard Linux system calls such
as open, close, read, mmap, etc. and associated data structures, where ioctl operates on
data structures specific to the device it controls. V4L and V4L2 define data structures used
2.1 Linux Support for Video 7
in ioctl system call enabling the efficient data exchange between driver and applications as
described below.
ioctl is a generic and extensible system call for performing different I/O operations defined
as:
int ioctl(int fd, int request, void* data);
V4L and V4L2 APIs use this ioctl system call property as primary means to define Linux
video interface. ioctl takes three arguments a file descriptor denoting the character device
driver to which application wants to communicate, a device dependent request code, and
a pointer to data structure containing information exchanged between calling application
and device driver. The pointer to the data structure is a void pointer, its type is implicitly
defined by device dependent code. The device dependent code also determines whether the
information is passed IN to the driver or OUT from the driver. The memory for data struc-
tures used in ioctl system call is always allocated by the application. V4L and V4L2 APIs
define a set of device dependent codes and associated data structures which are exhaustively
elaborated in Video4Linux Programming and Video for Linux Two API Specification doc-
uments. These documents also specify V4L and V4L2 standard digital image data formats
used to exchange images between drivers and applications, which both sides should interpret
the same way.
The use of V4L2 API for implementation of Push-to-Video service is described in detail in
section 5.2. Here is given a review of pixel formats supported by V4L and V4L2 APIs.
Pixel Formats
In general, pixel formats, specifies all aspects of color encoding including color model, pixel
depth, used scheme for sampling color values, and arrangement of pixel component values
in computer main memory.
With regard to color model the pixel formats supported by V4L and V4L2 APIs fall into two
broader categories: RGB and YUV formats. RGB formats use familiar RGB additive color
model to specify red, green and blue color component values. These formats are designed to
match the pixel formats of typical PC graphics frame buffers. They occupy 8, 16, 24 or 32
bits per pixel. These are all packed-pixel formats, meaning all the data for a pixel lie next
to each other in memory.
8 2 Preliminaries
YUV formats are the formats native to TV broadcast. They separate the brightness infor-
mation from the color information. Y represents the pixel brightness or luminance, while
two chrominance components U or Cb and V or Cr represent blue and red color differences
correspondingly. The green color is encoded in luminance component. The two color models
are related by the following transform equations:
Y = 0.299 × R + 0.587 × G + 0.144 × B

U = −0.299 × R − 0.587 × G + 0.866 × B
V = 0.701 × R − 0.587 × G − 0.144 × B
R = Y +V
G = Y − 0.19421 × U + 0.50937 × V
B = Y +U
The ITU-R 601 and ITU-R 656 recommendations specify digital YUV formats for broad-
casting systems which are also used by most other video applications including the video
compression systems.
The digital systems allocate one byte to represent each component. The YUV components
are then calculated according to the following relations:
Y = (255/219)(Y − 16)
U = (127/112)(Cb − 128)
V = (127/112)(Cr − 128)
YUV formats are either packed formats, with all the data for a pixel lying next to each
other in memory or plain formats with three components separated into three sub-images or
planes.
2.2 MPEG Video Codecs 9
2.2 MPEG Video Codecs
The purpose of video codec is to compress the digital data by removing redundant informa-
tion from the data source prior to the transmission and inserts it back at the receiver side.
In addition, most of the codecs also irreversibly remove information which is less likely or
harder to be perceived by observer on the receiving end.
This chapter discusses the video codecs developed by Moving Picture Experts Group (MPEG)
committee. It gives an overview of compression technics used in MPEG codec family.
2.2.1 MPEG Compression Technics
The MPEG video codecs treat video data coming from a digital camera from two distinct
aspects. The first aspect accounts for corelation between neighbors pixel both within a single
frame and across frames. The second aspect accounts for human eye inability to distinguish
fine spatial details and a diminished sensitivity to detail near object edges.
The figure 2.3 shows the relations between properties of source video data and the major
operations performed in an MPEG encoder shown in figure 2.4.
temporal spatial human eye source data

redudancies redudancies sensitivity entropy
motion
estimation
DCT QUANT VLC
motion
compensation
Figure 2.3: MPEG Processing Stages
The encoder operates in two modes so called INTRA-mode and INTER-mode dividing a
continuous sequence of video frames in so called Groups of Pictures (GOPs) shown in figure
2.5. GOPs are decoded independently of each other contributing to the robustness of an
encoded sequence. Pictures encoded in INTRA mode so called I-picture are filtered for
spatial redundancy using discrete cosine transformation (DCT). Pictures encoded in INTER
10 2 Preliminaries
RATE
CONTROL
IN OUT
DCT QUANT VLC
PREDICTION −1
ME QUANT MC
REFERENCE −1
DCT
FRAME
MOTION VECTORS
Figure 2.4: MPEG Encoder Block Diagram
mode so called P-pictures are coded in reference on previous I- or P-picture accounting for
temporal redundancies. If the coded picture also contains references to following future
picture it is called B-picture, where B stands for bidirectional predicted. Though B-pictures
may substantially reduce the bit rate of the video sequence, they introduce additional latency
and require that the video sequence has to be reordered. Thus time sensible applications
avoid the use of B-pictures.
I P P P P P I P
GOP
Figure 2.5: MPEG Group of Pictures
In order to perform coding, pixels of a picture are arranged in 8×8 blocks, shown in figure 2.6.
A two-dimensional DCT is performed on each picture block translating 64 spatial sampling
values in 64 DCT coefficients according to DCT transform equitation.
N −1 N −1
2 X X (2x + 1)uπ (2y + 1)vπ
F (u, v) = C(u)C(v) f (x, y) cos cos
N x=0 y=0
2N 2N
√1 f oru, v
=0

C(u), C(v) = 2
1otherwise
The inverse discrete cosine transformation (IDCT) is defined as:
N −1 N −1
2 XX (2x + 1)uπ (2y + 1)vπ
f (x, y) = C(u)C(v) F (x, y) cos cos
N u=0 v=0 2N 2N
where x, y are spatial co-ordinates in the image block and u, v are co-ordinates in the coeffi-
cient block.
The value of each DCT coefficient indicates the contribution of a particular combination
of horizontal and vertical spatial frequencies to the original picture block. The coefficient
corresponding to zero horizontal and vertical frequency is called the DC coefficient. The
DCT in itself does not reduce the number of bits required to represent the block. In fact
accurate inverse DCT transformation of 8 bit sampled values requires allocation of 11 bits for
each coefficient. The reduction in the number of bits follows from the fact that, for typical
blocks from natural images, the distribution of coefficients is non-uniform. The transform
tends to concentrate the energy into the low-frequency coefficients and most of the other
coefficients have a small value near zero. The non-uniform coefficient distribution is a result
of the spatial redundancy present in the original image block.
The compression is achieved by selecting a quantization scheme which discards the insignif-
12 2 Preliminaries
icant near zero high-frequency coefficient values and by using a variable length code (VLC)
to encode the remaining coefficients. The quantization process takes advantage of the fact
that the human eye is not able to distinguish fine spatial detail, and is less sensitive to detail
near object edges which are represented by the high-frequency coefficients. Consequently
the high-frequency coefficients are more coarsely quantized than the low-frequency coeffi-
cients. In the quantization process the DCT coefficients are divided by a parameter called
QUANT in the range 1 to 31. Varying the value of QUANT parameter the codec controls
the bit rate. Large QUANT values produce less bandwidth demanding output, while smaller
QUANT values produce higher quality pictures.
The transformed block with DCT coefficients is scanned in a diagonal zigzag pattern starting
at the DC coefficient to produce a list of quantized coefficient values. The scan pattern is
shown in figure 2.6.
DCT
8 x 8 luminance sample values 8 x 8 DCT coefficints

8 bits precision 11 bits precision
Figure 2.6: DCT Coefficients Zigzag Scanning Schema
The list of DCT coefficient values is then entropy encoded using a variable length code (VLC).
VLC accounts for likely runs of zeros and that small coefficient values are more likely then
large ones. VLC is based on Huffman code, that has the property that no complete code is
a prefix of any other.
In INTER-mode, motion-compensated inter-frame prediction exploits temporal redundancy

by attempting to predict the frame to be coded from a previous reference frame. It searches
first in previous reference picture for a block containing information most closely match-
ing the information in the block being coded. Two common methods for finding the best
matching block in the previous frame are the sum of absolute differences (SAD) and mean
squared error (MSE). The block containing fast moving objects tend to produce large SAD
and MSE values respectively. If the selected block is relatively shifted to the block being
coded a so called motion vector representing the object translational motion is prepended
to the predictively encoded DCT block coefficients. Predictively encoded blocks of DCT
coefficients represent the differences in sample values between the chosen block in previous
frame and the current coded block so called DCT coefficient residuals.
The decoder uses DCT coefficient residuals and decoded reference picture to restore current
coded picture. In order to ensure accurate decoding encoder has an internal decoder to per-
form residual calculation on decoded reference picture. Decoded picture and source picture
are not identical, because of small distortions introduced during quantization process.
The figure 2.7 shows the block diagram of an MPEG decoder.
IN OUT
−1 −1 −1
VLC QUANT DCT
−1 DCs REFERENCE
ME MC
FRAME
MOTION VECTORS
Figure 2.7: MPEG Decoder Block Diagram
2.2.2 H.261 and H.263
H.261
The ITU-T standard H.261 is first in the series of video codec standards providing the basis
for the following more sophisticated H.263, H.263+ and H.264 codecs. The codec comprises
an encoder and a decoder where only the decoder is specified by the standard leaving it up
to implementations to develop an encoder suitable for particular application.
The source coder operates on progressive scanned pictures with an approximate frame rate
of 30 fps. The lower frame rates are supported by having non-transmitted pictures between
transmitted ones. It uses so called Common Intermediate Format (CIF) and its integer
multiple Quarter-CIF (QCIF). The dimensions of pictures are given in table 2.1.
14 2 Preliminaries
Picture Format Width in Pixels Height in Pixels

CIF 352 288
QCIF 176 144
Table 2.1: H.261 Picture Formats
The pixel format is expressed as luminance and two color difference components (Y, Cb and
Cr) according to CCIR Recommendation 601. The Cb and Cr plains are sampled at half Y
rate. This format, known as YUV 4:2:0 planer format is described in more detail in section
5.5.
The figure 2.8 shows the picture structured in Group of Blocks, macroblock and blocks
corresponding to four H.261 layers, where the top layer is picture layer.
MACRO BLOCK
PICTURE
Y1 Y2
GOB1
Y3 Y4 Cb Cr
GOB11
1 8
MB1 MB2 MB11
57 64
GROUP OF BLOCKS
BLOCK
Figure 2.8: Picture Data Structures
The smallest picture structure is the 8x8 pixel block. Four Y blocks, one Cb and one Cr
block form a macro block. The 33 macro blocks are arranged in Group of Blocks (GOB)
where a GOB comprises 16 lines of CIF picture.
The codec operates at bit rates between approximately 40 kbit/s and 2 Mbit/s.
H.263
The ITU-T standard H.263 specifies no constraints on frame rate which generally may be
variable. The standard also extends the number of supported picture formats to formats
given in table 2.2. It also enhances motion-compensated INTER-frame prediction used
Picture Format Width in Pixels Height in Pixels

sub-QCIF 128 96
QCIF 176 144
CIF 352 288
4CIF 704 576
16CIF 1408 1152
Table 2.2: H.263 Picture Formats
by H.261 coding algorithm by using half-pixel precision for the motion compensation. In
addition to H.261 the H.263 standard includes four following negotiable coding options.
• unrestricted motion vector mode allowing the motion vectors to point outside of picture,
which is especially useful by camera movement and smaller pictures.
• syntax-based arithmetic coding mode uses arthimetic coding instead of variable length
coding decreasing the bit rate at encoder output.
• advanced prediction mode uses Overlapped Block Motion Compensation (OMBC) for
luminance part of P picture with 8x8 vectors instead of one 16x16 for some macroblocks
in the picture resulting in less blocking artifacts.
• PB-frames mode encodes two pictures P and B in one unit increasing considerably the
picture rate with small increase in bit rate, but also contributes to the video stream
latency.
The maximum number of generated Kilobits for given picture format specified by the stan-
dard are given in table 2.3. However, encoder and decoder may negotiate a different bit rate
Picture Format BPPmaxKb

sub-QCIF 64
QCIF 64
CIF 256
4CIF 512
16CIF 1024
Table 2.3: Maximum Bit Rates for H.263 Picture Formats
by some external means.

16 2 Preliminaries
2.2.3 AVC/H.264
Advanced Video Coding (AVC) is the latest video compression technology in the MPEG-4
standard, also known as MPEG-4 Part 10. The standard is developed jointly by the ISO/IEC
Moving MPEG and the ITU-T Video Coding Experts Group (VCEG). H.264 is its ITU-T
name and the codec is often referred to by its common name as AVC/H.264.
The new standard provides far more efficient algorithms for compressing, preserving good
video quality at substantially lower bit rate compared with its predecessors. It puts emphasis
on simplified design making it relatively inexpensive to implement, for example by using
minimal number of VLC tables for all parameters to be coded. The H.264 is however not
backward compatible with H.263 meaning that an existing H.263 decoder is not able to
decode the bit stream from H.264 encoder, rather it provides a set of profiles and levels
defining the codec capabilities to achieve the conformance to previous standards.
H.264 contains a number of new features providing better cost/benefit equation as well as
more flexibility for application to a wide variety of network environments. The features
interesting for Push-to-Video services are briefly described in the following.
• Variable block-size motion compensation (VBSMC) with block sizes in the range 4x4,
8x8, 8X16, 16X8 and 16x16 blocks enabling very precise segmentation of moving re-
gions.
• New integer transform on 4x4 blocks contributes to both codec efficency and accuracy.
• Additional 2x2 Hadamard transform for the DC coefficients of a macroblock achieves

higher compression ratio in smooth regions.
• An in-loop deblocking filter prevents the blocking artifacts typical for DCT-based image
compression techniques improves the video quality.
• Substantially extended use of spatial prediction increasing the compression ratio.
• Tree structured segmentation for motion estimation increases the compression ratio
adding to codec robustness in network environment.
• More accurate quarter pixel motion vector enables very precise description of the dis-
placements of moving areas.
• Multiple reference pictures allows up to 32 past and future reference pictures to be

used provides precise inter-prediction as well as improved robustness on packet loss
while streaming over the network.
2.3 Real Time Data in IP Networks 17
• The Context-Adaptive Binary Arithmetic Coding (CABAC) method continually up-

dates frequency statistics of the incoming data and adaptively adjusts the algorithm,
improving compression performance.
• Variety of data partitioning, structuring and reordering improving codec robustness

against errors or data losses.
• Redundant Slices (RS) allows an encoder to send an extra representation of a picture

region that can be used if the primary representation is corrupted or lost.
• User definable motion estimation settings controls the balance between encoder speed
and quality.
To allow different classes of applications to choose only subset of a bright range of features
H.264 uses concept of profiles and levels. A profile refers to a set of available tools or
syntactical elements that may be used to perform encoding or decoding of data stream
defining encoder and decoder capabilities. Within a profile a level sets limits such as a bit
rate, memory requirements, etc. The standard defines six profiles from which the baseline
profile is suitable for Push-to-Video service in mobile networks requiring minimal latency
and processing power.
2.3 Real Time Data in IP Networks
In the traditional circuit-switched network, the voice data is transmitted in a reserved voice
chanel occupying a unique time-slot with fixed bandwidth guaranteeing a connection with
consistent delay characteristics.
IP makes no guarantees concerning
• reliability,
• flow control,
• error detection or error correction.
The result is that datagrams could arrive at the receiver out of order, with errors, or even
not arrive at all. The effects of the IP properties on real time application can be described
in terms of latency, jitter and packet losses.
18 2 Preliminaries
The application may improve Quality of Service (QoS) beyond IP Best Effort policy by
cautionary striking the balances between parameter describing latency, jitter and packet
losses.
Jitter is the variance in the inter arrival time of data packets expressed as standard deviation.
The figure 2.9 shows the data packets effected by fluctuating network conditions.
SENDER IP NETWORK RECEIVER
Figure 2.9: Jitter in IP Network
The jitter usually arises from congestion in the network. For delay sensitive traffic higher
values of jitter results in discarding the packets by the application at receiver. The application
may alleviate the effects of jitter by implementing a jitter buffer at receiver which temporarily
holds the incoming packets but also introduces additional delay in the transmission. If the
buffer size is too small then dropped packets degrade the quality of service. If the buffer
size is too large then the delayed packets also degrade the quality of service. A buffer size
introducing a delay of 30 to 50 ms is usually acceptable for voice conversation.
Packet loss is a result of temporal environmental disturbances on physical media copper and
optical cables or temporal errors and disturbances in network routers. In order to minimize
the effects of packet losses the sender may choose to partition the data frames in smaller
units to minimize the amount of data affected by a packet loss. The receiver may choose to
use the packet received just before the lost packet instead of the lost one, or perform more
sophisticated interpolation to eliminate any interruptions in the audio and video stream. The
codec may use shorter self contained data sequences on expenses of audio or video quality.
Latency or delay is the time interval that a packet needs to arrive at receiver. The ITU-T
G.114 recommendation specifies a time interval of 150 ms as desired one-way latency to
achieve high-quality voice based on the fact that most full duplex user notice round-trip
delays when they exceed 250 ms. The latency time interval can be relaxed for instant voice
and video messaging using half-duplex or one way communication. However, it should not
exceed well beyond 500 ms for a good quality.
The figure 2.10 shows most important contributors to overall end-to-end latency.
Network latency is the time needed to traverse a WAN network. In general, this delay
depends on router hops that are traversed between end-points. For estimating the router
2.3 Real Time Data in IP Networks 19
SENDER
OUTGOING
CODEC PAKETIZATION
QUEUE
NETWORK
UPLINK BACKBONE DOWNLINK

TRANSMISSION TRANSMISSION TRANSMISSION
RECEIVER
INCOMING JITTER
DECODER
QUEUE BUFFER
Figure 2.10: Latency in a VoIP Network
hops the utilities trace route can be used.
Sender latency is introduced by the codec. Most of the codecs especially audio codecs use
compression algorithms which require buffering a certain amount of samples before perform-
ing encoding and sending the compressed data out to the network. This added delay is
usually in the range of 30 ms. Reducing the codec latency results either in reduced quality
or increased bandwidth being used.
Receiver latency is introduced by a jitter buffer used to smooth the effects of the fluctuating
network conditions. Delaying the arriving packets in a jitter buffer increases the resilience of
the codec to packet loss and delayed packets but it also increases the latency. Typical jitter
buffer delays are in the range of 80 ms.
In addition, a very important factor affecting voice and video quality is the total network
20 2 Preliminaries
load. When the network load is high, and especially for networks with statistical access such
as Ethernet, jitter and frame loss typically increase. For example, when using Ethernet,
higher load leads to more collisions. Even if the collided frames are eventually sent over the
network, they were not sent when intended to, resulting in excessive jitter. Beyond a certain
level of collisions, significant frame loss occurs.
21
3 Push-to-X System
This chapter presents the Push-to-X system. It starts with the System Architecture Overview,
next it reviews the design concepts of the Push-to-X system run time environment. Further
it gives an in-depth presentation of the Push-to-X client. Finally, it describes the Push-to-X
protocol.
3.1 System Architecture Overview
The figure 3.1 shows the Push-to-X system architecture.
The architecture is to be viewed from two distinctive perspectives. From the underlying IP
network perspective it is a network of peer-to-peer clients involved in group communication
aided by a server. The server provides to users the access to group cells, while peer-to-
peer communication allows grate flexibility in respect to usage of network resources. This
architecture unlike for example the proposed PoC architecture does not use an application
server for multiplexing the client media data introducing the possible network traffic load
hot spots.
From service perspective it provides the functions enabling the exchange of media streams
in group cells of Push-to-X clients administrated by a Push-to-X server. The functional
units can be described in terms of three hierarchically structured contexts within which the
Push-to-X communication occurs.
The top Server Context defines the operations on server domain level determining client’s
register status, group membership and presence status. Within Server Context clients reg-
ister, request for membership in a group, announce its presence which is exported by the
server to group members, or invite the other clients to join the communication in a group
cell.
The Peer Context and Media Context define the operations on group cell level. Within Peer
Context a client exchange messages to gain control over communication chanel examine the
22 3 Push-to-X System
Push−to−X Server
PTX Layer
Server Com
Group
Administration
Push−to−X Client Push−to−X Client

PTX Layer PTX Layer
Signaling Server Com Context Signaling
Server Com Server Com
Peer Com Context
Peer Com Peer Com
Media Context
Media Media
Audio Audio
Video Video
Communication Flow
Server Coordination Server Coordination
Get Group Get Group

Members Members
Reserve Data Flow Control

Free
Chanal Chanel
Transmit
Data Flow
Receive
Media Media
Figure 3.1: Push-to-X System Architecture
readiness of group members to receive the media message and synchronize delivery of media
messages, while within lowest Media Context the client controlling the communication chanel
transmits the media data to group cell members.
3.2 Execution Environment
The Push-to-X run time environment is based on two simple yet powerful concepts - Events
and Layers. Events are means to request the execution time slice by the process controlling
scheduler. Layers represent functional units performing a specific task such as a processing
3.2 Execution Environment 23
of a protocol data packet or processing of media data. The applications access the protocol
stack through main application layer.
The figure 3.2 shows the order of execution when processing a data packet.
Layer Scheduler
3 Task Queue
TransferItem Queue
Event
Scheduler Loop
TransferItem
2 5
time slice end
Event Handling Routine 4
Figure 3.2: Runtime Execution Order
The Layers are stacked, where a layer can have one lower and more upper layers. Each layer
has two queues where the lower and upper layers enqueue the TransferItems representing
data packets to be processed by one of its handling routines. The purpose of TransferItems
is to avoid the copying of data packets when they are passed up or down the protocol stack.
It improves system performance based on the fact that lower layers do not deal with data
packet pay load, rather they add or remove the layer protocol header to or from the packet.
Queuing the data packets generates an event signal informing the scheduler to call the layers
data packet processing routine later on once the current executing routine has finished the
data processing.
The execution environment also provides mechanism to pass information across layers by
means of control messages. The sending layer passes the message using Control plain which
is part of network stack shown in 3.3 to the receiving layer which must register with control
plane to receive the specific control message.
The handling routines are not preempted by the scheduler. Once the processing routine
gains the control it executes in its time slice until the data packet is processed. This mech-
anism prevents possible race conditions introduced by programming, but also requires that
the handling routines do not exceed the execution time slice of 20 ms posed by execution
environment.
3.2.1 ProtoSim
Push-to-X systems are designed to run on numerous packet switched networks. Simulation
Environment enables us easily to track and model the changes in client behavior dependent
on underlying network as well as to measure the impact of Push-to-X generated traffic on
network behavior.
The central simulation component is the ProtoSim. It is integral part of every Push-to-X
host. ProtoSim follows OSI layered network model, where the network is described in terms
of seven functional layers forming a protocol stack. The figure 3.3 shows how a ProtoSim
application builds its protocol stack from the heap of protocol layers available with the
ProtoSim development library.
Static Preallocated Components

Run Time System
ControlPlane Network Stack
Prototype Protocol Layers Lower Event Lower Event

Available PTX PTX
ServerCom PeerCom
on the Heap Data Packet
Application PTX addr ID PTX addr ID
Layer
UDP Layer
UDP Layer UDP address ID
IP Layer
IP Layer IP address ID
MAC Layer
MAC Layer
MAC address ID
Figure 3.3: ProtoSim Stack Building
NetworkStack connects an upper layer to a low layer by its address ID. Based on the protocol
address ID the layers pass the data packets up and down the protocol stack. By adopting
3.3 Push-to-X Client 25
this approach changing the network type is as easy as adding or removing the layers from
the protocol stack.
The figure 3.4 shows how the ProtoSim simulates data transmission.
Sender Receiver Receiver
Upper Layers Upper Layers Upper Layers
Bottom Layer Bottom Layer Bottom Layer

Modeling Chanel Modeling Chanel
Routing Routing Routing

Unit Unit Unit
destination IP IP destination IP destination

multicast multicast multicast
address address address
address address address
no match match
Figure 3.4: Transmission Simulation
Before serializing the packet and sending it over physical medium, the ProtoSim bottom
layer introduces an intermediate step warping the packet to be transmitted in a IP packet
and then multicasting it over the network. On receiving side, the Routing Unit examines
the destination addresses of the incoming packet. If the packet destination address matches
the address of the host network interface address the packet is passed to the Chanel Mod-
eling Unit, otherwise it is ignored. Chanel Modeling Unit accounts for characteristics of
the connecting network (packet delay and packet losses) before forwarding it to the upper
layers.
3.3 Push-to-X Client
The figure 3.5 shows the Push-to-X client structure.
The Push-to-X client consist of two main components, a Backend and a Frontend. The
Backend, in turn, consists of a component handling network communication towards both
the group peers and the server, two media components interfacing audio and video hardware
and processing media data. Frontend represents the Push-to-X service to the user in form
of a Graphical User Interface (GUI) realized with Qt. Qt is described in more detail in
section 5.3. Here we will mention, that the use of Qt requires the Backend to be executed in
PTX Client Backend PTX Client Frontend
PTX client GUI

signaling video data audio data
PTX layer audio layer video layer video dispaly
capture Qt framework
ProtoSim interface
Xlib
NETWORK API ALSA API V4L API
Linux kernel
Figure 3.5: Push-to-X Client Structure
separate thread, which communicates with GUI by means of internal messages using Linux
kernel pipe facilities and Qt mechanism for monitoring activities on a file descriptor.
Since the Backend design is based on the ProtoSim execution environment concept of layers,
each component is realized as a layer, where Backend itself is the main application layer.
All layers are connected to the main application layer. PTX layer handles the Push-to-X
protocol described in section 3.4, where server and peer communication functionalities are
implemented as two independent modules. Video layer handles the video hardware using
V4L kernel API, while audio layer handles the audio hardware using ALSA kernel API.
Both media layer are independent on other backend layers supporting the client modular
structures. They exchange the media data with main application layer as data packets
representing PTX message body. Main application layer uses the PTX layer to transmit the
media packets over the network. The main application layer controls media layers by means
of control messages. This communication differs from communication model described in
section 3.2 in so far that control messages transfer immediately the execution control to
message receiver. The control messages are used to control recording or playing of media
streams including the starting and stopping of media flow.
The figure 3.6 depicts the Push-to-X client GUI.
The GUI comprises the following main elements:
• The select field with the name of the active group.

3.3 Push-to-X Client 27
Figure 3.6: Push-to-X Client
• The buddy list containing the names of the members of the active group along with
member’s presence status.
• The push button, which is central element allowing the user to utilize the Push-to-X
service.
The buddy list is maintained by the module handling the communication with the server,
which manages all data structures representing the groups for which the user has been
registered.
3.3.1 Program Flow during Media Transmission
The user who wishes to start a Push-to-X conversation presses the push button. This
action causes the GUI to send an intern message through GUI backend communication
chanel to the PTX layer peer communication module, which is responsible for signaling
within a group cell. The peer communication module attempts to reserve a chanel for media
transmission by exchanging Push-to-X protocol messages with group peers. The successful
chanel reservation is indicated by a beep tone to the sending user, while receiving clients
display a red point in the buddy list at sending user name. On successful chanel reservation
the peer communication module passes the control to the main application layer. The main
application layer instructs first the sound layer to start audio recording by sending a control
message which causes the sound layer to initialize the sound card for audio recording. After
initializing the sound card the audio layer returns the control to the main application layer
which then sends the control message to the video layer. The video layer initializes the web
camera and returns the control to the main application layer. In the initialization process
both audio and video layer pass the file descriptors returned by audio and video device driver
respectively to the scheduler provided with ProtoSim run time environment. Based on the
file descriptors the scheduler uses the kernel select mechanism to pass the control to audio
and video layer when the audio and video data are available in audio and video devices. The
media layers then transfer the data from device driver and deliver it to main application layer
using queuing mechanism described in section 3.2. The audio data is recorded in intervals of
20 ms, while video data is captured at frame rate of 16 fps. The main application packages
the media data in Push-to-X protocol messages for each peer in the group and passes it to
PTX layer which sends them over the network. All media data are compressed prior the
packetization.
The media packets arrived to the receiving side client are first handled by PTX layer which
delivers them to media layer using queuing mechanism described in section 3.2. The media
layers then present the media content to the user.
3.4 Push-to-X Protocol
The Push-to-X protocol is an application layer protocol primarily designed to minimize pro-
tocol overhead when delivering small time-critical voice data packets. It uses UDP protocol
as underlying transport protocol.
3.4 Push-to-X Protocol 29
A Push-to-X message comprises a message header and a message body. The message header
is only one byte long, while the message body size is limited by the maximal size of an UDP
datagram.
ack
Type Subtype SELFCONTAINED MESSAGE BODY
PTX Header
Figure 3.7: PTX Message
The message header is divided in three fields. The type field identifies the application
functional component responsible for message processing, while the subtype field describes
the message body payload. A one bit long acknowledge field indicates that the message is
sent in response to the preceeding message.
A Push-to-X message belongs either to signaling messages or media payload messages.
PTX Message
MEDIA PAYLOAD
CONTROL
TRANSPORT
Figure 3.8: PTX Message Types
The signaling messages are exchanged in Push-to-X client-server communication as well as

Push-to-X peer-to-peer communication, while the media payload messages are sent in the
context of a peer-to-peer signaling sequence. The purpose of a peer-to-peer signaling sequence
is to provide the so called Floor Control mechanism and avoid data packet collisions. Floor
Control Signaling Sequence between two peers is shown in figure 3.9.
peer A peer B
occReq
uest(se
nder na
me)
est : Ack
occRequ
occupy
()
media s
tream
media s
tream
release
()
Figure 3.9: Floor Control Signaling Sequence
The Push-to-X protocol defines the following message types:
• serverCom messages are used in client server signaling process for user, group and
session administration not effecting the media data flow.
• peerCom messages are used in signaling process between two clients to initiate and
close down the media data stream.
• audioData messages cary the voice data.
• videoData messages cary the video data.
The multiplexing of audio and video data streams takes place on the PTX protocol message
level with video and vioce data transmitted in separate messages. This simplifies the filtering
of video data in mixed groups which include the clients with disabled video service.
31
4 Design Approach and Hardware

Components
This chapter describes design approach followed for the realization of Push-to-X service. It
starts with rough system requirements specification. Next, it gives a design flow. Finally, it
gives an overview of hardware components.
4.1 System Requirements
The figure 4.1 shows the requirements the video components must meet in the form of a
mind map.
A Push-to-X Client should preserve capabilities for multi-network type communication, in

particular it should provide seamless communication between wireless (UMTS, GPRS and
WLAN) and wire-line users over short and long distance networks (LAN and WAN).
A Push-to-X Client will not rely on underlying network providing a guaranteed Quality of
Service, instead it should provide QoS on application level by own means. In particular it
should deal with delayed packets and be tolerant on packet losses. It will also smooth the
fluctuations in network conditions and overcome short-term network disturbances such as
temporarily loss of network connection.
A Push-to-X Client should be capable of adjusting video stream output bit-rate dependent
on available network bandwidth. In the narrow band networks it will generate low bit-
rate output and it will benefit the users with access to High Speed Data Networks (HSDN)
providing the best video quality.
A Push-to-X Client will consider power dissipation in wireless devices by choosing not com-
putationally intensive algorithms for video compression.
The Push-to-X client should equally support web-cams and displays from different vendors.
The support of different web-cam and display models should be transparent to the user. The
32 4 Design Approach and Hardware Components
Configurable
MS Windows OS
Multi Platform
LAN UNIX / Linux
WAN Multi−Network Capable Battery
Power Dissipation
UMTS
Grid
WLAN
Delay Processing Power

QoS
Push−to−X Client
Packet Loss
Low Bitrate X Window System

Bandwidth Demand
Display Dependances
HSDN Frame Buffer
Modularity
V4L Camera Driver

Camera Dependencies V4L2 Camera Driver
Windows Camera Driver
Figure 4.1: Push-to-X Client Requirements as MapMinder
Push-to-X client will detect and run the hardware components without any user interven-
tion.
The Push-to-X Client should be configurable and provide the user with an option to enable
or disable the video service.
The Push-to-X Client should have a modular structure to allow optional installation of the
video components on the systems with no hardware support for video service. Additionally
the modular software structure will ensure a smooth migration from a system not supporting
video service to a full featured system preserving the settings used in previous version. It
will also provide a basis for easy porting to the platforms other than Linux.
4.2 Design Flow 33
4.2 Design Flow
During the design of the Push-to-Video service we have followed the software engineering
process based on an incremental model which generally includes planning, development and
release of software in sequence of increments or stages where each additional increment adds
operational functionality, or capability not available in previous releases. In this process we
identified four major components comprising the video functionality. The VideoLayer is the
main component exposing the video functionality to Push-to-X client main application. The
VideoLayer relies on VideoCapture, VideoDisplay and AVCodecHandler component. The
VideoCapture controls the web camera, which delivers the video content. The VideoDispaly,
which is embedded in Push-to-X client GUI, is responsible for displaying the video content.
AVCodecHandler performs the compression and decompression of video data by using an
MPEG codec. We divide development process in two incremental stages: The goal in the
first incremental stage is to implement a preview functionality showing the captured frames
on the local display as they are delivered from the camera by realizing the VideoCapture and
VideoDispaly component. The goal in the second and final incremental stage is to enable
transmission of video data over the network by realizing AVCodecHandler, which ensures
the bit rate control and robustness of video transmission. The figure 4.2 shows the overall
incremental design flow during the development of whole Push-to-Video functionality.
The design flow follows the path of video data transmission starting at the sender capture
component and ending at receiver display. The advantage of this approach is that a mini-
malist version can be released quickly and then evolved to meet the real needs of the user.
For example, before considering the codec and networking aspects we implement first video
preview functionality at sender. We test the functionality and assess the perceived smooth-
ness of movements contained in captured video sequence dependent of capturing frame rate.
Then we measure how the new functionality impacts the overall system performance. Based
on this feedback we refine the requirements specification for the frame rate of the input
video.
4.3 Hardware Components
4.3.1 Camera
The source video data is captured by a Philips PCVC740K camera shown in the picture 4.3.
The camera is attached to a PC USB 1.1 port.
3
3 store
encoding video
5 to a file
5
1 transmission
capture display
2
5
5
4 read
decoding video
from a file
4
Figure 4.2: Incremental Design Flow
The table 4.1 gives the technical specification for a Philips PCVC740K camera.
4.3 Hardware Components 35
Figure 4.3: Philips PCVC740K camera
Optical specifications
Sensor CCD
Pixels 640 (H) x 480 (V)
Still image resolution 1280 (H) x 960 (V)
Illumination less then 1 lux
Integrated lens F2.0
Resolution/performance
Output resolution frame rate
VGA 640 × 480 Up to 30
CIF (352 x 288) Up to 30
SIF (320 x 240) Up to 60
QCIF (176 x 144) Up to 60
QSIF (160 x 120) Up to 60
Pixel format I420 (Philips intern), IYUV (YUV 4:2:0 plain)
Table 4.1: PCVC740K Technical Specifications

The most important figures in the 4.1 table are picture formats and pixel formats supported
by the camera. Both CIF picture formats and YUV pixel formats which are required by
an MPEG source coder are supported by the camera. Thus the source video can be fed
directly to the MPEG encoder. For the cameras which do not support the YUV 4:2:0
plain format the video transformation must be performed (picture cropping, pixel format
conversion) which has drawbacks requiring additional processing power. Especially, in real-
time conditions CPU time is a valuable resource, as all operations on current frame including
source encoding must be completed before the next frame arrives.
The Linux support for the camera is provided with pwc-10.0.7a device driver. The driver
implements both Linux video interfaces V4L and V4L2. It is available as Linux kernel module
which can be dynamically loaded and binded in a running system using Linux command
modprobe -a pwc. Optionally, an image format and frame rate can also be specified on
command line when loading the module. The Push-to-X client accesses the capture driver
as character device located at /dev/video0 in Linux file system with 81 and 0 as major and
minor node number.
4.3.2 Display
The displaying hardware is represented by the X Window server. In order to examine the
capabilities and properties of the X Window server used in our development environment we
use xdpyinfo utility. A shortened xdpyinfo out-print is shown below.
linux:# xdpyinfo
name of display: :0.0

version number: 11.0
vendor string: The X.Org Foundation
X.Org version: 6.8.2
bitmap unit, bit order, padding: 32, LSBFirst, 32
image byte order: LSBFirst
number of supported pixmap formats: 7
supported pixmap formats:
depth 1, bits_per_pixel 1, scanline_pad 32
4.3 Hardware Components 37
number of extensions: 29
Extended-Visual-Information
GLX
MIT-SHM
X-Resource
XFree86-DGA
XFree86-Misc
XFree86-VidModeExtension
XVideo
default screen number: 0
number of screens: 1
screen #0:
dimensions: 1024x768 pixels (347x260 millimeters)
resolution: 75x75 dots per inch
depths (7): 24, 1, 4, 8, 15, 16, 32
depth of root window: 24 planes
number of colormaps: minimum 1, maximum 1
default colormap: 0x20
default number of colormap cells: 256
preallocated pixels: black 0, white 16777215
options: backing-store NO, save-unders NO
current input event mask: 0xfa4031
KeyPressMask EnterWindowMask LeaveWindowMask
KeymapStateMask StructureNotifyMask SubstructureNotifyMask
SubstructureRedirectMask FocusChangeMask PropertyChangeMask
ColormapChangeMask
number of visuals: 8
default visual id: 0x23
visual:
visual id: 0x23
class: TrueColor
depth: 24 planes
available colormap entries: 256 per subfield
red, green, blue masks: 0xff0000, 0xff00, 0xff
significant bits in color specification: 8 bits
visual:
visual id: 0x27
class: DirectColor
depth: 24 planes
available colormap entries: 256 per subfield
red, green, blue masks: 0xff0000, 0xff00, 0xff
significant bits in color specification: 8 bits
The xdpyinfo utility provides information about supported pixmap formats, the extensions to
X standard specification, the screen resolution, available visuals as well as other information.
X visuals are described in more detail in section 2.1.1. Here we recall, that visuals specify
the way the image data is represented by the server to displaying hardware.
The xdpyinfo out-print shows, that X Window server can not display the bitmaps in YUV420P
pixel format coming from an MPEG decoder. Thus a conversion in a chosen RGB format
must be performed prior to forwarding the decoded frames to X Window server. In order
to avoid degradation in quality of displayed images we will choose a pixmap format using
24 bits per pixel or 8 bits per pixel component for TrueColor visual class. Eight bits per
pixel component is also the number of bits allocated to represent a pixel component used in
MPEG standard.
The xdpyinfo out-print also shows that X Window server supports MIT-SHM extensions.
MIT-SHM extension enables efficient exchange of data between an X Window client and an X
Window server, when they run on the same host using the MIT shared memory interprocess
communication (IPC) mechanism. The Xlib provides C functions allowing an application
to map the physical memory location used by X Window server in its own virtual memory
space. After the mapping, both the application and the server can access the shared physical
memory location through their own virtual memory space. The Xlib also synchronizes the
access to the shared memory location. In order to improve the Push-to-X client performance
we chose to use X MIT shared memory extensions if it is available on the target host.
39
5 Realization
This chapter describes in detail the realization of the Push-to-Video service in an existing
Push-to-X based system enhancing the multi-medial funcionality provided with design base
supporting the voice transmission.
5.1 Video Layer
Video Layer is a Push-to-X client component which encapsulates the entire video function-
ality of the Push-to-X service. Its main task is to collect video data coming from a camera
attached to the system on the sending site, encode and adapt the video data for transport
over the network, decode and convert received data in a format to be represented on the
receiving system screen and display the video to the user. By performing these operations
Video Layer relies on a camera device driver implementing video for linux API and two
video codec libraries provided with ffmpeg project libavcodec and libavformat. It also uses X
window system to display the video to the end user.
Video layer relies on three supporting components shown in figure 5.1 to perform four major
operations along the data path from a camera on the sender side to a display on the receiver
site.
All components including video layer are implemented as C++ classes. VideoLayer class
controls the camera and displaying hardware via VideoCapture and VideoDisplay classes.
It is connected to network using facilities provided by ProtoSim development library. main
application layer controls VideoLayer class via control messages provided by ProtoSim devel-
opment library. VideoFrame class represents video data used by video hardware. FrameSlace
class represents video data packetized for network transmission.
The arrows represent the flow of data. Starting at VideoLayer sending side, the VideoCap-
ture collects the source video data captured by the camera, by using the V4L video driver
installed on the system. The collected frames represented by VideoFrame class are passed to
40 5 Realization
startCapture stopCapture
VideoLayer VideoLayer
Control Path Control Path
Transfer to Transfer from
PTX Layer PTX Layer
VideoCapture AVCodecHandler AVCodecHandler VideoDisplay
VideoSlice VideoSlice
Collecting VideoFrame Encoding PTX PTX Decoding VideoFrame Displaying
UDP UDP
IP IP
Data Flow
Figure 5.1: Video Layer Structure
AVCodecHandler which encodes them by using libavcodec and libavformat API. In the pro-
cess encoded frames are packetized in a file format suitable for transport over the network.
The packets with video data represented by FrameSlace class are passed to Video Layer which
performs further fragmentation to make it fit in a PTX message. Video Layer then transfers
the data packet to main application layer by using ProtoSim transferItem class. The main
application layer uses PTX layer to prepend PTX message header to the data packet indi-
cating video as message pay load type and transfer it with PTX protocol running on top of
UDP/IP protocol to receiving peer client. The receiving side performs reverse operations.
main application layer uses PTX layer to determine video layer as destination media layer of
PTX message based on information provided in PTX message header. It removes the PTX
message header and sends the packet to video layer which extracts the video data. The video
data are decoded by AVCodecHendler class using a decoder provided with libavcodec and
libavformat libraries. In the decoding process the decoder gathers the decoded video data in
order to perform the following de-fragmentation of video data in video frames represented by
VideoFrame class. AVCodecHendler converts the video frame in a format supported by the
X server. Video Layer uses a messaging mechanism supported by linux kernel pipe facilities
to pass the video frames to VideoDisplay class, which is embedded in the Push-to-X client
GUI realized with Qt library. VideoDisplay class uses X system Xlib to transfer the video
frames to the X server which displays the video on receiving client screen.
5.1.1 Video Parameters
The Video Layer does not directly manipulate the video data, rather it controls the sup-
porting components which perform actual operations on video stream based on set of the
5.1 Video Layer 41
following significant parameters.
• Frame rate denotes the number of video frames captured in a time interval expressed
in frame per second (fps).
• Picture format, denotes width and height of a picture expressed in number of pixels.
• Pixel format denotes color space to describe the color components, organization of
pixel values in main memory and depth or number of bits representing the pixel value.
• Codec ID denotes the codec used to compress video stream.
• Bit Rate denotes the number of bits produced by source coder.
The first three parameters determine to the data throughput of the source video, while bit
rate pertains to the number of bits produced by video encoder. Frame rate, picture format,
and bit rate strike the balance between desire for best video quality, demand on network
bandwidth and available hardware resources.
Frame rate plays the critical part in perceiving discrete sequence of pictures showing a moving
object as continuous motion, where the higher frame rates are more bandwidth demanding
while providing video sequences showing smoother motions.
We choose a frame rate of 16 fps based on experience that frame rate of 16 fps is the threshold
frame rate for perceiving smooth motion. Exceptionally, we allow frame rate of 15 fps along
with MPEG-1 video codec which operates at 30 fps and provides support for lower frame
rates by having non-transmitted pictures between transmitted ones. The upper frame rate
limit follows from constraints given by the requirements (our desire) to support the video
transmission over narrow-band-networks with bandwidth of 64 kbit/s. Choosing the higher
frame rate results in higher source video data throughput and consequently demands more
bandwidth. In general, a higher source video data throughput requires more processing
power which increases the unwanted power dissipation.
The frame format is the key factor contributing to data throughput of source video having
the strongest impact on bit rate at encoder output. Video Layer provides the picture format
to VideoCapture component in format negotiation procedure during the camera setup. The
format of captured pictures, apart from being more comfortable for the viewer, also should be
in accordance with the picture format supported by video codec which encodes the captured
pictures. Otherwise libavcodec encoders either refuse to process the pictures larger in format
or the picture cropping must be performed to fit the smaller format supported by encoder
vesting precious computational power. An exhaustive list with formats supported by MPEG
family of codecs is given in section 2.2.2. Here, we recall that MPEG family of codecs
42 5 Realization
supported by libavcodec library operate on pictures in so called CIF format (352x288) and
its integer multiples. In addition, libavcodec library also supports so called SIF format
(320x240) which is supported by many web cameras available on the market.
We choose the (320x240) format as the default format used in the negotiation procedure
during camera setup, (and allow the user to choose a different format if she uses the Push-
to-Video service in a network with sufficient available bandwidth.) The actual format then
used for the source video capturing depends on the negotiation outcome with VideoCapture
component.
In general, the pixel format describes the way the graphical hardware and software interpret
the piece of memory containing a digital image.The pixel formats supported by video layer
are described in detail in section 5.5.
Because, video layer components may generally operate on pictures in different pixel for-
mats, we take the following approach to enable exchange of video data between video layer
components.
In order to enable exchange of video data between AVCodecHandler and VideoDisplay,

at the program start video layer queries VideoDispaly for the pixel format supported by
underlying X window system and passes this information to AVCodecHandler which ensures
that decoded video frames are available in the format supported by X window system.
The exchange of video data between VideoCapture and AVCodecHandler is easier, due to
AVCodecHandler support for variety of pixel formats. VideoCapture simply indicates used
pixel format in VideoFrame class representing captured video frame. The AVCodecHandler
then performs conversion of video frames if needed before passing it to encoder. The pixel
format used by VideoCapture is negotiated in camera setup procedure described in section
5.2.4.
Video layer provides support for H.261, H.263 and H.264 codes, where actual used codec is
provided to AVCodecHandler when it is initialized.
The source video data can be encoded in two ways in regard to bit rate and video quality.
Either encoder produces a video sequence with the same picture quality under variable bit
rate or encoder ensures the constant bit rate on expense of video quality by varying the
quantization scale factor. In order to enable the use of video service in narrow-bandwidth
networks, we choose to encode the source video data at constant bit rate with option for the
user to specify the wished bit rate choosing between 40, 64 and 128 kbit/s.
5.2 Capture Interface 43
5.1.2 Initiation and Finalizing of Video Transmission
A video transmission is initiated and ended by means of control messages. A Push-to-Video

transmission is started when video layer receives a startCapture control message sent by main
application layer. startCapture causes the video layer to set up the camera for capturing
process via VideoCapture interface. If the camera setup succeeds VideoCapture returns to
video layer a file descriptor representing the camera. VideoLayer then passes the returned file
descriptor to the scheduler which in turn returns the control to video layer each time when
the video data captured by the camera is available by VideoCapture Interface. In case the
camera setup does not succeed the video layer cancels the operation with an error message.
The main application layer ends an ongoing Push-to-Video transmission using the same
mechanism. It sends a stopCapture control message to video layer. stopCapture causes the
Video Layer first to stop capturing process via VideoCapture interface and subsequently to
instruct the scheduler to invalidate previously passed file descriptor representing the capture
device. On the receiving site the main application layer immediately passes incoming video
data using PTX layer to video layer which presents it to the user.
5.2 Capture Interface
The capture interface transfers the source video data from a camera to application by using
a video device driver implementing either V4L or V4L2 Linux kernel APIs. The support
for the V4L is provided with v4lCapture class, while the support for V4L2 is provided
with v4l2Capture class. The both classes are inherited from VideoCapture class providing
a generic interface to user application. The v4lCapture and v4l2Capture differ in C data
structures they use, while they provide identical functionality. Thus, the following text is
focused on implementation principles common to both classes. It uses the term VideoCapture
to refer to both v4lCapture and v4l2Capture implementations.
Programming a V4L device involves the following steps:
• Opening the device and querying the device driver capabilities.
• Selecting the I/O method for transferring the captured data from driver to application
memory space.
• Buffer allocation for data transfer.
• Picture format negotiation.

44 5 Realization
• Transferring the captured data from the driver to application.
• Closing the device
5.2.1 Opening the Device
VideoCapture opens the video device driver in its open method. It opens the video device
driver referenced by /dev/video0 file in Linux file system using Linux open() system call. A
successful open call returns a file descriptor which is used in subsequent calls to the driver,
otherwise an error code is returned which is passed to the caller. After successfully opening
the video device VideoCapture queries the video driver capabilities to determine whether
the device driver supports V4L API using the ioctl() system call. In addition, this call
determines the I/O methods supported by the driver. If the device driver does not support
the requested V4L or V4L2 API it returns an error code. VideoCapture closes device driver
and returns the error code to the caller. If the ioctl() system call querying the video device
capabilities is successful VideoCapture is ready for capturing process.
5.2.2 I/O Method Selection
V4L defines following I/O methods to read from or write to a device.
• Read/Write
• Memory Mapping (Streaming I/O)
• User Pointer (Streaming I/O)
The three methods may significantly differ in the data transfer efficiency, as the streaming
I/O methods memory mapping and user pointer avoid copy of data from driver to application
memory space. However, if the driver supports DMA (Direct Memory Access) the read/write
method need not be less efficient in regard to CPU time consumption than other streaming
methods.
The read/write I/O method is the default memory method selected by the driver, when
the driver provide support for it. However, in order to avoid data copying from driver to
application memory space VideoCapture switches to the memory mapping I/O method after
querying the driver capabilities when the driver provide the support for this I/O method.
In addition an advantage of the Memory Mapping I/O method is that the captured frame
is timestamped with the current system time using the nanosecond resolution timers. This
saves a call to gettimeofday() for each captured frame, otherwise used with the read/write
I/O method, while providing better time accuracy. VideoCapture switches in the memory
mapping I/O method by mapping the in driver preallocated buffers in user memory space
using mmap system call. This procedure is described in detail in 5.2.3 section.
5.2.3 Memory Buffers
VideoCapture allocates the buffers for storing the captured video data depending on the
selected I/O method. Thus, the read I/O method uses one memory buffer allocated on the
fly in user memory space just before the routine is called, while the memory mapping I/O
method shares the buffers used by the driver to store video data transfered from the camera.
The shared memory buffers reside in kernel memory space. The use of memory buffers
in conjuction with the memory mapping I/O method is described in following on example
of V4L2 implementation. When the driver supports the memory mapping I/O method,
VideoCapture requests and initializes the buffers prior mapping them in the application
memory space after successfully opening the driver. First, it passes to the driver the number
of buffers it wishes to map in the v4l2 requestbuffers data structure with ioctl() system call.
VideoCapture uses two buffers as the larger number of buffers introduce longer latencies in
picture capture time and picture display time. It then uses another structure the v4l2 buffer
to represent the buffers to be mapped in subsequent mmap system call. In fact this is an
array of two buffers each representing a mapped buffer. It then maps each of two buffers. The
mmap system call returns a pointer pointing to the buffer in driver memory. VideoCapture
stores the returned pointers in an internal maintained array, as references used at later time
when transferring the video data from the driver to the application. The v4l2 buffer has a
data field storing the video frame capture time. The figure 5.2 shows the memory allocated
buffers.
Application
Kernel
Device
buffer buffer
Driver
memory mapping
Figure 5.2: Mapped Buffer

46 5 Realization
5.2.4 Format Negotiation
The V4L API requires application to negotiate the streaming format prior to initiating a
streaming by video device driver. VideoCapture implements format negotiation with driver
in its setformat method. The method takes wished image format as an input parameter. The
format specifies the image size in pixels and pixel format. The pixel formats are detailed
described in chapter 5.5. VideoCapture tries to negotiate format with video driver by passing
the picture format in ioctl() system call. If the device driver does not support the specified
pixel format it returns an error cod. If it supports the pixel format, but can not capture
pictures with wished picture width and height it indicates the smaller supported picture
width and height to the application. The outcome of format negotiation is returned to the
caller.
5.2.5 Video Data Transfer
VideoCapture uses read/write and memory mapped I/O methods provided with V4L API
to realize the transfer of the video data from video driver to the application. VideoCapture
chooses the memory mapped I/O method as preferred way to transfer the video data from
the driver to the application. This method allows applications directly to access buffers
residing in kernel memory space used by the driver to store the video data coming from
video device. The driver synchronizes access to the buffers by maintaining two states for
each buffer. A buffer in dequeued state is exclusively accessible by application and a buffer
in enqueued state is exclusively accessible by the driver. Both by VideoCapture allocated
buffers are initially in dequeued state, inaccessible by the driver. The figure 5.3 shows the
sequence diagram of data transfer.
VideoCapture initiates data transfer when its startvideo method is called. VideoCapture
instructs the driver to put each buffer in enqueued state, which triggers the driver to transfer
the data from video capture device to the buffers. The startvideo method returns device
driver file descriptor to the caller. The caller uses the file descriptor in the select system
call, which waits to be signaled when the video data is available in one of the buffers. When
the driver has transfered captured video data from capture device to one of the buffers the
driver signals it to the waiting select system call. VideoCapture provides video data with
nextframe method. In nextframe method, VideoCapture first instructs the driver to put by
the application previously used buffers (bing in dequeued state), if any in enqueued state.
(This operation ensures a loop effect triggering the driver to transfer the next captured
frame from the capture device.) It then instructs the driver to put the buffer containing the
VideoCapture VideoLayer
setFormat( Pixelformat, width, height )
formatSet( Pixelformat, width, height )
startVideo()
loop
dataAvailable( )
nextFrame( VideoFrame )
stopVideo()
Figure 5.3: Sequence Diagram of Data Transfer
captured video data in dequeued state making it available to the application. The nextframe
method returns the pointer to the buffer containing the captured video data to the caller.
If the video driver does not implement memory mapping I/O method the VideoCapture falls
back to conventional read method. VideoCapture does not use user pointer I/O method
to transfer the video data. The applications using this I/O method allocate the buffers in
virtual or shared memory and then pass the pointer to the allocated memory to the driver
using Linux ioctl system call. In this way the application gains the control over the allocated
memory lay out or may pass the buffers in a interprocess communication. Neither feature is
used by the Push-to-X client.
48 5 Realization
5.3 Display
Figure 5.4 shows the Push-to-X client display. The display is integrated in the Push-to-X
client Graphical User Interface (GUI), which is realized with Qt. Qt is a cross-platform
graphical widget toolkit produced by the Norwegian company Trolltech. On platforms with
X Window graphical system Qt is built on top of Xlib library providing higher level graphical
constructs to user applications. However, the Push-to-X client realizes the video functionality
by direct use of the X Window system graphical primitives within a Qt widget. The figure
5.5 shows the interactions between Qt, Xlib and Push-to-X application.
Figure 5.4: Push-to-X Client Display
The Display component is implemented in VideoDisplay class, which is inherited from Qt’s
Qwidget class. Important property of Qwidget class is that it does not hide X Window system
resources including an Id identifying X server and an Id identifying the widget window, thus
allowing the use of Xlib API.
Programming the VideoDisplay involves the following steps:
• Querying the X server properties and determining the pixel formats supported by the
server
• Establishing the communication with Push-to-X client Backend
• Transferring the video data from Video Layer to X VideoDisplay
• Transferring the video data from VideoDisplay to X server

5.3 Display 49
Qt application X Server
GUI
Backend
Display
Video Layer
Xlib
Linux Kernel
Figure 5.5: Interactions between Qt, X Window System and Display
5.3.1 GUI - Backend Communication
Though Qt is a convenient and smart tool for building the graphical elements, it introduces
complications in conjunction with single threaded Push-to-X client execution environment
running the Backend code which also like Qt is built around a main event loop. The Push-
to-X client design solves this by executing the single thread dependent backend code in a
separate Qt thread forming two contexts and shown in figure 5.6.
As a consequence of using multi-threads the Push-to-X client must also define the commu-
nication mechanism between components executing in separate threads.
The communication between the GUI and the backend is implemented by means of so called
transfer messages using asynchronous communication model. The transfer message commu-
nication is based on UNIX pipe mechanism on backend side and QSocketNotifier mechanism
on GUI side. On the backend side the mechanism is implemented in TransferInterface mod-
ule, which is part of the main application layer. TransferInterface module uses an incoming
queue to store messages sent from GUI to backend and an outgoing queue to store messages
sent from backend to GUI. The modules involved in GUI - Backend communication define
own set of transfer messages, which are ordinary C++ classes to store information they
want to exchange. Each message has an identifier, which enables TransferInterface mod-
ule to associate the message with a message handling routine. A simplified use of transfer
messages is then given in following. GUI components communicate with backend compo-
nents by performing the following four step procedure. GUI component first instantiate the
50 5 Realization
Backend Interface
read write
Backend
GUI
Video Layer
Display
incoming
shared
message
queue
Signal Event Handling Routine
Backend context GUI context
Figure 5.6: GUI - Backend Communication
transfer message it wants to send to a backend component. Next, it fills the message with
information it wants to pass. Next, it enqueues the message in TransferInterface module
incoming queue. Finally, it signals the en-queuing action to backend TransferInterface mod-
ule by writing a signal code on UNIX pipe. This action results in an “event” which triggers
the backend scheduler to call the incoming queue handling routine next time the backend
gets execution time slice. The incoming queue handling routine then delivers the message
to destination backend module by calling another message handling routine associated with
transfer message identifier.
On the program start the VideoDisplay component uses transfer message mechanism to pass
the pixel format supported by X Window server to Video Layer. The Video Layer uses this
information to transfer the video frames in the pixel format supported by X Window server to
VideoDispaly component. The communication in backend - GUI direction is implemented
in a similar way by using TransferInterface module outgoing queue and QSocketNotifier
mechanism. The mechanism principles are described in more detail in section 5.3.4.
5.3.2 Identifying Pixel Format
Determining the pixel format used by the X server to display the video data takes place in
constructor of VideoDisplay class. For that purpose we use the Xlib functions XGetVisualInfo
and XListPixmapFormats. We first call the XGetVisualInfo function, which returns a list
5.3 Display 51
with XVisualInfo structures describing the visuals available on the X Window server. The
X Window server uses a visual to display bitmaps (See 2.1.1). The first default XVisualInfo
in the list specifies the visual used by QWiget parent class, which is also used for displaying
the video data.
Next we call the XListPixmapFormats function, which returns a list of XPixmapFormatVal-

ues structures specifying the pixel formats the server is able to display. Then we look in the
XPixmapFormatValues list for the pixel format used by the default visual by comparing each
depth field of XPixmapFormatValues structures with depth field of the default XVisualInfo
structure. If we find the XPixmapFormatValues whose depth field matches the depth field
of XVisualInfo structure, then the depth identifies the pixel format used by the server to
display pixmaps. The identified pixel format is passed to Video Layer in the first message
after the communication between VideoDisplay and VideoLayer described in section 5.3.1
has been established.
Apart from being used as a means to identify the pixel format the default XVisualInfo is
also used in the call to XCreateImage() function described in section 5.3.3.
5.3.3 Video Data Transfer to X Window Server
VideoDispaly class transfers the video data to X server using XImage Xlib data structure,
where XImage fully specifies the format of the video frame to be displayed by X server. Xlib
API also provides C functions for creating and destroying XImage structure. The figure 5.7
shows transferring of a video frame to X server, where the video frame is a pixmap from X
Window system perspective.
First we allocate a memory buffer for storing the video frame data. Next we create XImage
structure with XCreateImage() function. The function takes among others the visul and the
pointer to previously allocated memory buffer as its arguments. The visul is used by Xlib
to set the XImage fields describing the format of the bitmap. Then we copy the video data
in allocated memory buffer and call XPutImage() function which causes the Xlib to transfer
the data to X Window server.
Since the X client and X server are two processes each running in its own virtual memory
address space transferring the data between them implies the data copying across the two
processes. In order to increase performance of data transfer for X applications running on
the same machine as X server and avoid data copying across client and server memory space
Xlib provides XShmSegmentInfo data structure using operating system facilities for IPC such
52 5 Realization
XServer
VideoDisplay
XImage
XPutImage
Pixmap Copy X protocol
Xlib
Copy
Pixmap
Figure 5.7: Video Frame Transfer to X Server
as shared memory. VideoDisplay class uses XShmSegmentInfo data structure when creating
the XImage data structure.
5.3.4 Video Data Transfer from Video Layer
The figure 5.8 shows the components involved in transfer of video frames along with the
sequence diagram.
Transfer of video data to VideoDisplay is based on Qt’s mechanism for monitoring activity
on a file descriptor implemented in Qt’s QSocketNotifier class. VideoDispaly uses QSocket-
Notifier to associate the file descriptor provided by VideoLayer at the program start with its
vidDevMessageReader method. QSocketNotifier ensures that vidDevMessageReader method
is called whenever a write operation occurs on the file represented by the file descriptor
previously passed by VideoLayer. VideoLayer does not use the file itself, which is a UNIX
unnamed pipe to pass the bulk of video data to VideoDispaly, rather it uses the file to signal
the queuing of the video frames to be displayed by VideoDispalay in a FIFO queue allocated
in global memory space. Since the queue resides in the global memory space, it is accessible
by both the VideoLayer executing in backend thread context and VideoDispaly executing in
a GUI thread context.
The transfer of video frames occurs as follows: VideoLayer executing in backend threads first
stores the video frame to be displayed in the FIFO queue located in global memory space.
It then signals the queuing of the video frame by writing a one byte long signal code to the
UNIX pipe. This results in the call to VideoDisplay videoDevMessageReader method, next
5.4 Codec Interface 53
Pass the Pipe File Descriptor
Backend Enqueue GUI

Video Frame
VideoLayer VideoDispaly
File Descriptor QtSocketNotifyer
vidDevMesageReader
Queue Dequeue
Video Frame
Linux Kernel UNIX Pipe
VideoLayer Kernel Queue VideoDisplay
Program Start
createPipe()
passFileDescriptor()
enqueVideoFrame()
writeToPipe()
send signal
dequeVideoFrame()
Figure 5.8: Transfer of Video Frames
time when VideoDisplay executing in GUI thread becomes execution time slice by Linux
kernel. VideoDisplay safely dequeues the video frame and transfers it to X Window server
to be displayed on the user screen.
5.4 Codec Interface
The core components of video compression are libavecodec and libavformat libraries. The
two libraries are distributed as part of ffmpeg project and they are available as source code
under GNU license.
The libavecodec library is optimized for use ona a variety of hardware platforms with multi-
media extensions such as Intel’s MMX and AMD’s 3DNow, which enable efficient performing
54 5 Realization
of the very time consuming encoding and decoding operations. The time needed to perform
encoding and decoding operations is critical for Push-to-X clients. As the time needed to
encode a frame may not exceed the time interval between two frames, too long encoding
operation causes the Push-to-X client to lower the frame rate which decreases the temporal
quality of a sequence. At the same time too long decoding operations lead to congestion
eventually resulting in data packet drops and consequently to a decrease in video quality.
The libavecodec supports a very large amount of video codecs, including elementary H.261
and very fast H.263, video conferencing codecs. It also provides to a large extent the support
for the most advanced JVT/H.264 codec, which is because of its finer spatial and temporal
granularity of video data especially suitable for use in Push-to-X system running on top of
IP network where the network fluctuations and packet losses are commonplace.
libavformat supports wide range of multimedia file formats from very elementary mpg format
to most advanced, very flexible, and extensible mp4 multimedia file format, defined as part
of MPEG-4 standard (MPEG-4 patr 12). Push-to-X client uses a file format packetization
to include timing, fragmentation and continuity information in the video streams.
The figure 5.9 shows the main libavformat data structures in relation to mpg file format.
AVOutputFormat
AVstream
TIME BASE
GOP
coded frame
GOP
GOP
END OF SEQUENCE
Figure 5.9: libavformat Data Structures in Relation to mpg File Format
mpg files contain one or more media streams, where a video stream is a set of independent
video sequences. Each video sequence starts with a sequence header, includes one or more
GOPs and ends with an end of sequence code. The sequence header contains the time
information including sequence time base and sequence bit rate. The sequence time base is
the fundamental unit of time expressed in seconds in terms of which frame timestamps are
interpreted. For the content with a constant frame rate, the time-base is 1/frame-rate and
timestamp increments are identical 1. Decoder uses time-base information to timely restore
video sequence and handle the packet losses.
mp4 file format is based on Apples QuickTime multimedia storage format. One of the
aspects addressed in mp4 file format is the separation of data and meta data, where meta
data denotes any additional information of the media data that are to be sent or stored. It
introduces the notion of hits which are instructions necessary for the correct fragmentation
and time stamping of data to be streamed. The essential feature of mp4 files for Push-
to-X system is that it is independent of any particular delivery protocol while enabling
efficient support for Push-to-X protocol which provides a thin layer between UDP and media
streams.
libavecodec is capable of encoding the raw video frames coming from live source such as web
camera. The encoder operates on frames in YUV420P pixel format while the support for
other pixel formats is provided by a set of the format conversion routines. The encoded
frames are passed to libavformat, which performs the packetization. The parameters of
the input and output video streams as well as coding process are controlled by a set of
parameters passed to encoder and decoder in two C data structures AVCodecContext and
AVFormatContext.
The Push-to-X client incorporates the two libavecodec and libavformat in AVCodecHandler
class. Its main purpose is to hide the video compression complexity by providing generic
easy to use interface towards the VideoLayer. The figure 5.10 shows the interface between
VideoLayer and AVCodecHandler.
VideoLayer
AVCodecHandler
Initialization of Control
Parameters
raw video frames contigeus flow ofvideo data packets
Encoding
Decoding
Figure 5.10: AVCodecHandler Interfaces towards VideoLayer

56 5 Realization
VideoLayer and AVCodecHandler exchange video data via VideoFrame data structure repre-
senting raw video frames and FrameSlice data structure representing contiguous equal sized
video data packets. VideoLayer instantiate AVCodecHandler after identifying pixel format
supported by the displaying hardware. In process of initialization, AVCodecHandler sets the
key parameters controlling the encoding and decoding processes. It initializes the encoder
by instantiating AVFormatContext structure controlling the packetization of encoded video
frames which in turn instantiate AVCodec structure responsible for encoding of raw video
frames and AVCodecContext which is central structure for controlling the coding process.
The key codec parameters such as bit rate, frame rate, picture width and picture height,
pixel format, group of picture size, quantization level - quantizer, motion estimation al-
gorithm and prediction method are set in this structure. The issues to be considered by
setting these parameters, their impacts on video quality and interdependencies are discussed
in section 5.1.1. The decoder initialization then follows the same steps.
The figure 5.11 shows the interactions between major libavecodec and libavformat components
involved in the coding process.
AVOutputFormatContext
AVOutputFormat
AVCodecContext BIT RATE CONTROL

TIME BASE / fps AVFrame
AVFrame BIT RATE
AVPicture OUTPUT
AVCodec BUFFER coded frame
YUV raw data
DATA FLOW
Figure 5.11: libavecodec and libavformat Control and Data Flow
AVOutputFormat is structure representing the output format of a video stream, which is

in turn represented by AVStream structure. AVFormatContext is the structure defining
the set of parameters associated with output format and set of functions manipulating the
AVOutputFormat and AVStream.
5.4.1 Bit Rate Control
The bit rate of compressed video streams is the result of several interdependent factors shown
in the figure 5.12. The amount of details in a single frame of a sequence is represented by
DCT coefficients. Quantization parameters, quantization table and a quantization squealing
factor QUANT define the spatial quality of sequence. Frame rate defines the temporal quality
of a sequence. The amount of motion in a sequence is represented by motion vector in a
single P-frame. Group of Picture is parameter defining the fraction of the I- and P-frames
in a sequence.
bit rate
frame rate format
amount of motion amount of detail
GOP size VLC tables

robustness video quality
QUANT
Figure 5.12: Bit Rate Interdependencies
Picture containing more details generate shorter run of zeros in DCT encoded picture blocks
resulting in longer VLC sequence and an increase in bit rate. In the quantization process the
DCT coefficients are divided by a parameter called QUANT and quantized using one of the
quantization tables. Higher quantizer values reduce the number of bits required to encode a
picture and introduce more distortion in coded video, while lower quantizer values increase
the number of bits to encode the picture increase the picture quality. Using a quantization
table, which coarsely quantizes the DCT coefficient decreases the bit rate and degrades the
quality of the picture.
58 5 Realization
Higher frame rate increases the bit rate and improves temporal quality, while lower frame
rate decreases the bit rate and degrades the temporal quality of a sequence.
A video sequence containing fast moving objects generates larger motion vector values in
P-frames represented by a longer VLC sequence resulting in an increase in bit rate.
Increasing the GOP size increases the fraction of less bandwidth demanding P frames in the
video stream and makes the system more vulnerable to packet losses. Choosing the right
GOP size is therefore a tradeoff between video stream robustness against packet losses and
demand on network bandwidth.
Considering the above mentioned factors, it is obvious that constant bit rate can be achieved
only on expense of video quality. libavecodec uses a mechanism which by modifying the
quantization squealing factor QUANT keeps the bit-rate at about constant level. The quan-
tization squealing factor QUANT is incremented or decremented in the range 1 to 31 on
macroblock level depending on level the output buffer is filled.
5.5 Supported Pixel Formats
The table 5.1 specifies the pixel formats used by Push-to-X client to exchange image data
between VideoLayer, VideoCapture, AVCodecHandler and VideoDispaly components.
Format ID Color Space Arrangement Sampling Depth

of Pixel Components Schema
in Memory
PIXFMT RGB15 RGB Packed 4:4:4 15
PIXFMT BGR24 RGB Packed 4:4:4 24
PIXFMT BGR32 RGB Packed 4:4:4 32
PIXFMT YUYV YUV Packed 4:2:2 16
PIXFMT YUV422P YUV Plain 4:2:2 16
PIXFMT YUV420P YUV Plain 4:2:0 12
PIXFMT UYVY YUV Packed 4:2:2 16
Table 5.1: PTX Client Supported Pixel Formats
Push-to-X client uses RGB formats primarily to pass digital images to X Window sys-
tem. The running X window system should support one of these formats as they match
5.5 Supported Pixel Formats 59
the pixel formats of typical PC graphics frame buffers. The conversion of RGB formats
in PIXFMT YUV420P format is performed according to the equitations given in section
2.1.3.
PIXFMT YUV420P pixel format shown in figure 5.13 has an outstanding role as used com-
pression algorithms operate on this format. This is a planer format with color components
(Cb and Cr) scanned at half horizontal and vertical resolution then Y component. The three
components are separated into three sub-images or planes. The Y plane is first. The Y plane
has one byte per pixel. The Cb (or U) plane immediately follows the Y plane in memory.
The Cb (or V) plane is half the width and half the height of the Y plane (and of the image).
Each Cb belongs to four pixels, a two-by-two square of the image. For example, V00 belongs
to Y00, Y01, Y10, and Y11. Following the Cb plane is the Cr plane, just like the Cb plane.
If the Y plane has pad bytes after each row, then the Cr and Cb planes have half as many
pad bytes after their rows.
Y00 Y01 Y02 Y03 Y10 Y11 Y12 Y13 Y20 Y21 Y22 Y23 Y30 Y31 Y32 Y33 V00 V01 V10 V11 U00 U01 U10 U11
luminance samples
chrominance samples
Figure 5.13: PIXFMT YUV420P Sampling Pattern and YUV Plains
Push-to-X client trys to negotiate PIXFMT YUV420P format with video device driver as
its preferred choice to avoid format conversion in the next image compression stage. If the
X Window system and graphical hardware provide support for PIXFMT YUV420P format
it is also used for displaying.
60 5 Realization
5.6 Network Aspects
The PTX group communication model assumes a 1:n communication relation having one
sender and one or more media stream receivers. The base design uses a simple communication
model by establishing separate peer-to-peer communications with each group member and
multiplexing the media streams on the sender side. In comparison to voice representing video
data requires a much larger amount of information emphasizing the issue of scalability.
We now consider Push-to-X communication model in the IP network shown in the figure
5.14 from a scalability perspective.
100/kbit/s
f
1000 kbit/s
Subnetwork B
bit/s
b 4 0k
d
a
200 kbit/s
e
c
Subnetwork A
Figure 5.14: Push-to-X Communication Model
The network consists of a subnetwork A and subnetwork B connected with a router. The
subnetwork A is an Ethernet based network with a total available bandwidth of 1000 kbit/s
shared between its hosts, while the subnetwork B provides a bandwidth of 100 kbit/s to each
of its hosts.
We see that the streams carrying the same data content partly go along the same path on
their ways to the receivers and accordingly each consumes its part of the network bandwidth.
The client a communicates to the group members with media streams requiring 40 kbit/s of
bandwidth. In the process the client a uses 200 kbit/s of their network A bandwidth, and
40 kbit/s of the network B bandwidth.
5.6 Network Aspects 61
We now consider the scenario shown in figure 5.15, where the client e wants to communicate
to the group members.
100/kbit/s
f
1000 kbit/s 80 kbit/s
t/s Subnetwork B
kbi
b 40
d
a
e
c
Subnetwork A
Figure 5.15: Connection Originating from B Network
It is easy to see, that in this deployment the client e can not communicate to all group
members with media streams requiring 40 kbit/s of bandwidth. The network B limits the
size of the group to three group members in communication with media streams requiring
40 kbit/s of bandwidth.
In order to enable a group communication where the underlying type of the network does
not limit the size of the group we consider two approaches, one using IP multicasting and
other introducing a proxy server in the Push-to-X network.
5.6.1 Peer-to-Peer versus Multicast
The advantage of using the IPv4 protocol multicasting capabilities is obvious. The sending
Push-to-X client does not send multiple media streams to each and every Push-to-X group
member rather it sends single stream to an IP multicast address and let the underlying
IP protocol take care of delivering the PTX packets to the receiver. In order to receive
the media stream the receiving clients of Push-to-X group must then join an IP multicast
group identified by the IP multicast address, where IPv4 allots the address space for multicast
62 5 Realization
communication in the range 224.0.0.0 through 239.255.255.255 from which the range 224.0.0.0
through 224.0.0.255 is reserved for internal use by IP layer.
However, IPv4 multicast communication has severe limitations in WAN networks. The IPv4
multicast address space is scoped to define the number of routers the multicast packets can
cross. The IPv4 provides two ways of specifying the scope the IP packets destined to a
multicast address can reach. Traditionally, the TTL IP header field has a scoping meaning
when used in the context of an IP multicast address, while administrative scoping uses the
ranges of the IPv4 multicast address space. However, either way in the end relies on the
administrative network configuration by defining the following five multicast scopes.
1. node-local
2. link-local
3. site-local
4. organization-local
5. global
The node local scoped packets have a meaning only on a multi homed host as they are not
send to data link layer, while the link local scoped packets are never forwarded by a router.
Site-local and organizational-local scops are defined by administrator configuring the router
not to forward packets having site or organizational-local scope. In addition, we will mention
here thet IPv6 protocol does not brings any changes as it also uses the same administrative
scoping.
In general, we can say that the use of IP protocol multicasting capabilities for Push-to-X
service limits the service on administrative network boundaries. Because we want to provide
a global service available to everyone with an access to IP network we do not use the IP
multicast as underlying technology in Push-to-X system.
5.6.2 Push-to-X Proxy
The figure 5.16 shows the Push-to-X Proxy Server in the network described in the section
5.6.
PTX Proxy Server is the intermediate PTX host along clients communications path for
the purpose of multiplexing and forwarding the media streams. Deployment of PTX Proxy
Server in the A network decreases the traffic load in the B network and saves the link
100/kbit/s
f
bit/s
40 k
1000 kbit/s
Subnetwork B
b
PTX Proxy
d
a
e
c
Subnetwork A
Figure 5.16: PTX Proxi Server in a Push-to-X Network
bandwidth between client f and router. However, its deployment also is not without costs.
The precondition for this communication is the client addressing. In order to keep the
protocol overheads to a minimum, the PTX protocol does not foresee the transport protocol
independent addressing, rather it extends the underlying IP/UDP addressing scheme to
route the PTX packets to destination PTX layer. This type of addressing is not suitable for
use together with a proxy server, thus we extend the protocol to make it suitable for use
with a proxy based on analysis of the sequence of Pusht-to-X messages exchanged during
peer-to-peer communication.
The figure 5.17 shows the sequence of PTX protocol messages along with media pay load in
a peer-to-peer communication.
The message occRequest starting the message sequence contains only the the sender name
in its body. This information is not sufficient to route the messages via proxy to destination
address. The proxy requires either client destination IP address and UDP port directly
provided in the message body or the message body must contain additionally at least the
destination client user name, allowing the proxy to query the administration server for the
user destination IP address and UDP port. Because we do not want to put additional
strain on administration server, thus we extend the occRequest message body to include
the client destination IP address and UDP port. Though this extension now allows proxy
server to forward the occRequest message to the destination client, the sending client still
64 5 Realization
f a b
occ occReq
Req uest(se
ues nder na
t(se me)
nde
r na
me)
: Ack est : Ack

est occRequ
occRequ
occu occupy
py() ()
med media s
ia st tream
ream
med
ia st media s
ream tream
relea release
se() ()
Figure 5.17: Push-to-X Sequence Diagram
must send a message for each single group member. Thus we extend further the body of
occRequest to include the IP addresses and UDP ports for each group member. As this change
does not affect the handling of occRequest message on the receiving client the occRequest
acknowledgment message remains unchanged.
The handling of the following occupy, release and media messages is determined by the fact
that Push-to-X is a state oriented protocol. Thus all following messages are exchanged in
reference to the first occRequest message. This means that these messages remain unchanged,
but the PTX Proxy Server must maintain the protocol states. The figure 5.18 shows the
state diagram of Push-to-X protocol.
The figure 5.19 shows the sequence of Push-to-X protocol messages along with media pay
load in a peer-to-peer communication using an intermediate Proxy Server.
In order to further investigate the effects of the use of a PTX Proxy Server, we reconsider
occRequest occupy
pending
idle occuped
occupy
time out
time out
time out
playing
release media streams
media strams
Figure 5.18: Push-to-X State Diagram
the first scenario shown in the figure 5.15 where the client a wants to communicate to the
group members now using the intermediate Proxy Server shown in the figure 5.20.
Obviously, the use of a Proxy Server increases the bandwidth consumption to accommodate
an additional data stream to the PTX Proxy resulting in total bandwidth consumption of
240 kbit/s in the network A.
In order to avoid the unnecessary additional bandwidth consumption for the Push-to-X com-
munication originating from networks providing sufficient bandwidth we allow optional use
of the PTX proxy server defined by client configuration. The user configures the Push-to-X
client to use Proxy Server by providing the proxy host location (IP address) on the com-
mand line at program start. The UDP port number is not provided as the PTX Proxy
Server uses standard port 5799. An alternative client configuration is that the client config-
uration is directed by Push-to-X administration server as part of user registration process.
If the proxy server is directed by the administration server, the server provides the client
with proxy server IP address based on the client location (client IP) and the location of the
group members (their client IP addresses). This much flexible approach takes away from
the user the necessity of the knowledge of the network configuration, but adds significant
complexity to the PTX administration server implementation. In particular, it requires the
communication between the PTX administration server and the PTX proxy server. Because
of its complexity this solution requires a project on its own and might be considered in some
future work.
66 5 Realization
f PTX Proxy a b
occReq
uest(se
nder lis
t)
occ occReq
Req uest(se
ues nder na
t(se me)
nde
r na
me)
Ack
est :
equ ck
occR uest : A
occReq
ck
uest : A
occReq
ck
uest : A
occReq
occupy
()
media
stream
occu occupy
py() ()
media
stream
mde mdeia st
ia st ream
ream
release
()
mde mdeia st
ia st ream
ream
relea release
se() ()
Figure 5.19: Push-to-X Sequence Diagram with an Proxy Server
100/kbit/s
f
bit/s
40 k
1000 kbit/s
Subnetwork B
PTX Proxy
b
240 kbit/s
d
a
e
c
Subnetwork A
Figure 5.20: Connection Originating from B Network using a PTX Proxy Server
67
6 Tests and Results
This chapter presents the tests performed with the developed Push-to-X system. The tests
cover the two most significant factors for network video applications, the quality of trans-
mitted video sequence and bandwidth consumption.
6.1 Bandwidth Consumption
The evaluation of the Push-to-Video service is tested with ProtoSim. The tests are performed
in a simulated Ethernet based IP network providing sufficient bandwidth to accommodate the
traffic load generated by Push-to-X clients participating in a six user group communication
administrated by Push-to-X server also connected to the same network.
Table 6.1 shows the client’s parameters affecting the bandwidth consumption of the output
video stream.
Video Codec MPEG-4

Picture Format SIF 320 × 240
Frame Rate 16 fps
GOP size 16
Wished Bit Rate 128000 bit/s
Pay Load Max Packet Size 2000 Byte
Table 6.1: Client’s Parameters
The test lasts approximately 180s and starts with all six clients participating in group com-
munication. The figure 6.1 shows the traffic load generated in test scenario where three
clients leave the group one at a time after sending a Bay message to the remaining group
members.
In order to emphasis the effects of relative shorter control messages exchanged between both
peers and server, the diagram shows the traffic load expressed as the number of the IP
packets sent over the network, providing the sufficient accuracy for a qualitative analysis
68 6 Tests and Results
Figure 6.1: Traffic Load in the Six Member Group Communication
of test results. The test results are according to our expectations. The diagram shows the
typical slight increase in traffic load prior to the bulk media data flow due to exchange of floor
control messages between clients. The diagram also shows a sharp rise in bandwidth demand
at start of media data flow followed in the same way by a sharp drop. This is due to bit rate
control mechanism, which gradually increments the quantization scale factor QUANT, at the
moment the bit rate exceeded the limits. The quantization mechanism also compensates for
the bit overflow delivering in average the bit-stream at about constant bit rate. The same
is valid for fast movements, which are represented by high picks in the traffic load diagram.
Furthermore, the diagrams shows two picks not related to peer communication. They are
the result of regular client online status interrogations by the server. Finally the diagram
shows the linear decrease in network bandwidth demand as the group size shrinks.
6.1.1 Video Traffic Profile in Comparison to Voice
The following tests show the differences in voice and video traffic profile. The table 6.1 shows
the client’s parameters affecting the traffic profile used during the tests.
Audio Capture Interval 20ms

Frame Rate 16 fps
Table 6.2:
In the test scenario the first three clients join the six user group one at a time by sending a
6.1 Bandwidth Consumption 69
voice Hello message to the remaining group members then the same test is performed where
the clients use video service.
Packet Rate
The figure 6.2 shows the IP packet rate of voice data and the figure 6.3 shows the IP packet
rate of video data.
Figure 6.2: Voice Packet Rate
Figure 6.3: Video Packet Rate
The traffic load is expressed as number of IP packets to emphasize the effects of small voice
packets frequently send to avoid the longer latency contributed by the sender. The voice
service shows significantly higher network access rate in comparison to video traffic. In
networks with statistical access such as Ethernet this traffic profile leads to more collisions
resulting in higher excess jitter rates.
Bit Rate
The figure 6.4 shows the bit rate of voice data and the figure 6.5 shows the bit rate of video
data.
Figure 6.4: Voice Bandwidth Consumption
Figure 6.5: Video Bandwidth Consumption
The traffic load is expressed as the number of bytes shows that the video service has a
significantly higher demand on the network bandwidth than voice. The test results also
indicate large IP packet overheads for voice data. Multiplexing video and video data in a
single packet could decrease IP packet overheads.
6.2 Video Quality 71
6.2 Video Quality
Although the number of methods for an objective evaluation of a video sequence exist, the
results obtained with these methods are not sufficient to indicate the quality of compressed
video. The subjective assessment of compressed video is still the only reliable way to evaluate
the quality of compressed video. We present the quality of compressed video data delivered
by Push-to-X client by example of two “Andreas” video sequences. The two video sequences
are in SIF format which contains 320x240 pixels per frame. We use MPEG-4 codec to
encode the source video data. The first sequence is encoded at the bit rate of 120 kbit/s,
while the second video sequence is encoded at 40 kbit/s. An INTER frame is a good choice
for observing the effects of motion-compensated INTER-frame prediction. Both sequences
are low motion sequences containing small amount of motion in the head and face area with
camera slightly moving for the duration of video. These two sequences represent typical
video data stream from Push-to-X client.
The figure 6.6 shows the screen capture of original source picture and the figure 6.7 shows
the screen capture of decoded INTER picture, encoded at 120 kbit/s.
The INTER Coded Video Frame shows some degradation in video quality, whereas the
blocking unpleasant artifacts are visible, but not disturbing for the receiver. Encoding the
video frame at lower bit rates results in more blocking artifacts.
The figure 6.8 shows the screen capture of original source picture and the figure 6.9 shows
the screen capture of decoded INTER picture, encoded picture at 40 kbit/s.
The INTER Coded Video Frame shows significant degradation in video quality, whereas the
disturbing blocking artifacts are clearly visible to the video receiver.
Figure 6.6: The Source Video Frame

Figure 6.7: INTER Coded Video Frame at 120 kbit/s

Figure 6.8: The Source Video Frame

Figure 6.9: INTER Coded Video Frame at 40 kbit/s

76
7 Conclusion and Future Work
The Push-to-Video service has been successfully implemented in the Push-to-X system, en-
hancing its multi-medial functionality. The Push-to-X system with integrated Push-to-Video
service was tested in multi-type networks. The performed tests show, that the system enables
seamless video communication between wireless (UMTS, GPRS and WLAN) and wire-line
users over short and long distance networks (LAN and WAN). The use of video service in
networks providing a bandwidth of 120 kbit/s delivers a very good video quality also for
video sequences containing fast moving objects, while in networks providing a bandwidth
of 40 kbit/s the service was still acceptable. The tests were performed with web-cams and
displays from different vendors. The Push-to-X client was capable of detecting the video
hardware components without any user intervention. Through the software modular struc-
ture and support for wide range of pixel formats the use of different web-cam and display
models is transparent to the user.
The Push-to-Video service uses PTX protocol as the application protocol with very minimal
protocol overheads of one byte for transferring the video packets over the network. However,
the traffic profile of Push-to-X system is highly determined by PTX protocol design which
separates voice and video data on PTX protocol message level. This gives rise to high
disproportion in packet rate and bit rate for voice and video data. Multiplexing voice and
video data in a single packet could further decrease overall packet overheads including IP and
UDP packet overheads. In addition, PTX protocol provides unidirectional flow control from
sender to receiver. Providing a bidirectional flow control, which allows the media stream
receiver to send feedback information to the sender could increase the protocol robustness
against jitter and packet losses.
Push-to-X system implements a group communication model by multiplexing the Push-to-

X messages on sender side. This limits the service to the users with sufficient bandwidth
available to accommodate multiple media streams for all group members. In order to pro-
vide support for users in narrow bandwidth networks Push-to-X system introduced a PTX
Proxy Server which is deployed in a wide bandwidth network. The PTX Proxy Server
enables shifting of traffic load from a narrow bandwidth network to a network providing
77
sufficient bandwidth to accommodate multiple media streams by performing multiplexing

and forwarding of Push-to-X messages on behalf of Push-to-X clients.
78
Bibliography
[1] J”orn Seger, Der Protokollsimulator Protosim, http://www.kn.e-technik.uni-
dortmund.de/content/Mitarbeiter/Seger/Protosim.html .
[2] Andreas Herbert Wolff, Dienstg”uteermittlung von Gruppenkommunikation unter Ein-
beziehung von Teilnehmerexperimenten, Lehrstuhl f”ur Kommunikatiosnetze; Studien-
arbeit
[3] X.org Foundation, X Consortium http://www.x.org .
[4] Christophe Tronche, The Xlib Manual, http://tronche.com/gui/x/xlib .
[5] Adrian Nye, Volume One: Xlib Programming Manual, http://fbim.fh-
regensburg.de/sgi/SGI Developer/books/XLib PG/sgi html/index.html .
[6] Trolltech, Qt Overview http://www.trolltech.com/products/qt/index.html .
[7] Michael H Schimek, Bill Dirks, Hans Verkuil, Video for Linux Two API Specification,
http://v4l2spec.bytesex.org/spec/book1.htm .
[8] Alan Cox, The Video4Linux Book; Video4Linux Programming,
http://kernelnewbies.org/documents/kdoc/videobook/v4lguide.html .
[9] Alessandro Rubini and Jonathan Corbet, Linux Device Drivers, second ed., June 2001;
http://www.xml.com/ldd/chapter/book/ .
[10] W. Richard Stevens, UNIX Network Programming, Volume 2: Interprocess Communi-
cations, Prentice Hall PTR; August 1998; ISBN: 0130810819 .
[11] International Telecommunication Union, ITU-R Recommendation BT.601-5 “Studio En-
coding Parameters of Digital Television for Standard 4:3 and Wide-Screen 16:9 Aspect
Ratios”, http://www.itu.ch .
[12] Society of Motion Picture and Television Engineers, SMPTE 170M-1999 “Television -
Composite Analog Video Signal - NTSC for Studio Applications”, http://www.smpte.org
.
[13] MPEG Committee, MPEG Pointers and Resources, http://www.mpeg.org .
[14] International Telecommunication Union, ITU-T Recommendation H.261 “Video Codec
for Audiovisual Services at px64 kbit/s”, International Telecommunication Union; March
1993
[15] International Telecommunication Union, ITU-T Recommendation H.263 “Video Coding
for Low Bitrate Communication”, International Telecommunication Union; May 1996
Bibliography 79
[16] vcodex, H.264/MPEG-4 White Papers, http://www.vcodex.com/h264.html

[17] FFmpeg, FFmpeg Documentation, http://ffmpeg.sourceforge.net/documentation.php .
[18] John Watkinson, The MPEG Handbook. MPEG-1, MPEG-2, MPEG-4, second ed., Fo-
cal Press; November 2004; ISBN: 0240516567 .
[19] Richardson, Iain E. G., H.264 and MPEG-4 Video Compression: Video Coding for Next
Generation Multimedia, John Wiley & Sons; August 2003; ISBN: 0470848375 .
[20] Peter Symes, Digital Video Compression, McGraw-Hill/TAB Electronics; October 2003;
ISBN: 0071424873 .
[21] Claude Elwood Shannon, The mathematical theory of communication University of Illi-
nois Press; 1949; ASIN: B0006AS2IG
[22] W. Richard Stevens, UNIX Network Programming UNIX Network Programming, Vol-
ume 1, Prentice Hall PTR; January 1998; ISBN: 013490012X .
[23] Andrew S. Tanenbaum, Computer Networks, Fourth Edition Prentice Hall PTR; August
2002; ISBN: 0130661023
[24] Heikki Kaaranen, Siam”ak Naghian, Lauri Laitinen, Ari Ahtiainen, Valtteri Niemi,
UMTS Networks: Architecture, Mobility and Services John Wiley & Sons; August 2001;
ISBN: 047148654X
[25] Open Mobile Alliance, Push to talk over Cellular (PoC) - Architecture,
http://www.openmobilealliance.org .
[26] Prof.-Dr.-Ing. Bernhard Walke, Mobilfunknetze und ihre Protokolle, TEUBNER B.G.;
August 2001; ISBN: 3519264307
[27] Zheng Wang, Internet QoS: Architectures and Mechanisms for Quality of Service Mor-
gan Kaufmann; March 2001; ISBN: 1558606084
[28] Vinton G. Cerf, Henry Sinnreich, Alan B. Johnston, Robert J. Sparks SIP Beyond VoIP:
The Next Step in the IP Communications Revolution VON Publishing LLC; October
2005; ISBN: 0974813001
[29] Erich Gamma, Richard Helm, Ralph Johnson, John Vlissides, Design Patterns: Ele-
ments of Reusable Object-Oriented Software Addison-Wesley Professional; January 1995;
ISBN: 0201633612
[30] Richard H. Thayer, Mark Christensen, Dixie Garr, Software Engineering, Vol. 2: The
Supporting Processes Wiley-IEEE Computer Society Pr; March 2002; ISBN: 0769515576

Konzeption, Realisierung Und Bewertung Eines Push-to-Video-Dienstes

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Konzeption, Realisierung Und Bewertung Eines Push-to-Video-Dienstes

Uploaded by

Copyright:

Available Formats

Lehrstuhl für Kommunikationsnetze I

Prof. Dr.-Ing. C. Wietfeld

Konzeption, Realisierung und

Betreuer : Dipl.-Ing. Jörn Seger

Dortmund, den July 9, 2006

4 Design Approach and Hardware Components 31

5.2 Capture Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6 Tests and Results 67

7 Conclusion and Future Work 76

ALSA Advanced Linux Sound Architecture

2.1 Linux Support for Video

2.1.1 X Window System

X Window system is a network-oriented graphical system developed by X consortium. The

Figure 2.1: X System Architecture

A client references an X server side resource by its ID as shown in figure 2.2.

server resource resource

Figure 2.2: X Resource References

2.1.2 Linux Kernel API

2.1.3 V4L and V4L2

int ioctl(int fd, int request, void* data);

Y = 0.299 × R + 0.587 × G + 0.144 × B

2.2 MPEG Video Codecs

2.2.1 MPEG Compression Technics

temporal spatial human eye source data

Figure 2.3: MPEG Processing Stages

Figure 2.4: MPEG Encoder Block Diagram

Figure 2.5: MPEG Group of Pictures

values in 64 DCT coefficients according to DCT transform equitation.

The inverse discrete cosine transformation (IDCT) is defined as:

8 x 8 luminance sample values 8 x 8 DCT coefficints

Figure 2.6: DCT Coefficients Zigzag Scanning Schema

In INTER-mode, motion-compensated inter-frame prediction exploits temporal redundancy

The figure 2.7 shows the block diagram of an MPEG decoder.

Figure 2.7: MPEG Decoder Block Diagram

2.2.2 H.261 and H.263

Picture Format Width in Pixels Height in Pixels

Table 2.1: H.261 Picture Formats

MB1 MB2 MB11

Figure 2.8: Picture Data Structures

given in table 2.2. It also enhances motion-compensated INTER-frame prediction used

Picture Format Width in Pixels Height in Pixels

Table 2.2: H.263 Picture Formats

Picture Format BPPmaxKb

Table 2.3: Maximum Bit Rates for H.263 Picture Formats

by some external means.

• Additional 2x2 Hadamard transform for the DC coefficients of a macroblock achieves

• Substantially extended use of spatial prediction increasing the compression ratio.

• Multiple reference pictures allows up to 32 past and future reference pictures to be

• The Context-Adaptive Binary Arithmetic Coding (CABAC) method continually up-

• Variety of data partitioning, structuring and reordering improving codec robustness

• Redundant Slices (RS) allows an encoder to send an extra representation of a picture

2.3 Real Time Data in IP Networks

IP makes no guarantees concerning

• error detection or error correction.

SENDER IP NETWORK RECEIVER

Figure 2.9: Jitter in IP Network

UPLINK BACKBONE DOWNLINK

Figure 2.10: Latency in a VoIP Network

hops the utilities trace route can be used.

3.1 System Architecture Overview

The figure 3.1 shows the Push-to-X system architecture.

Push−to−X Client Push−to−X Client