Fragment mp4 Box

4 Object-structured File Organization
4.2 Object Structure

所有 Box 都包含 size、type 这两个基本字段。
syntax
aligned(8) class Box (unsigned int(32) boxtype,

optional unsigned int(8)[16] extended_type) {
unsigned int(32) size;
unsigned int(32) type = boxtype;
if (size==1) {
unsigned int(64) largesize;
} else if (size==0) {
// box extends to end of file
}
if (boxtype==‘uuid’) {
unsigned int(8)[16] usertype = extended_type;
}
}
Semantics
size
整数，指定这个 box 所占空间的字节数大小，包含自身 size、type 等字段所需空间加上内嵌

的其它 box 所占空间；
如果 size 为 1，那么实际大小放在 largesize 字段；
如果 size 为 0，那么表示这个 box 是文件的最后一个 box，它的内容直至文件尾（一般只在

Media Data Box 才会这样）。
type
指定 box 的类型；
标准 box 的类型使用 4 个可打印字符来表示，便于识别；
当类型为 uuid 时，用户可进行类型拓展。
很多 Box 都包含 version、flags 字段。
syntax
aligned(8) class FullBox(unsigned int(32) boxtype, unsigned int(8) v, bit(24) f)
extends Box(boxtype) {
unsigned int(8) version = v;
bit(24) flags = f;
}
Semantics
version
box 版本号；
flags
flag 列表；
4.3 File Type Box

指示该 MP4 文件符合的规格，必须在第一个，而且只有一个，并且只能被包含在文件层，不能被
其他 box 包含。
syntax
aligned(8) class FileTypeBox

extends Box(‘ftyp’) {
unsigned int(32) major_brand;
unsigned int(32) minor_version;
unsigned int(32) compatible_brands[]; // to end of the box
}
Semantics
major_brand
brand 主版本号；
用 4 个可打印字符表示；
minor_version
brand 次版本号；
compatible_brands
兼容的 brand 版本列表；

8 Box Structures
8.1 File Structure and general boxes
8.1.1 Media Data Box
存放最终要解码的音视频等媒资 raw 数据
Syntax
aligned(8) class MediaDataBox extends Box(‘mdat’) {
bit(8) data[];
}
Semantics
data
媒资 raw 数据；
8.2 Movie Structure
8.2.1 Movie Box
该 box 包含了文件媒体的 metadata 信息，moov 是一个 container box，具体内容信息由子 box

诠释。
Syntax
aligned(8) class MovieBox extends Box(‘moov’){

}
moov 在 mp4 文件中有且只有一个，moov 中会包含 1 个 mvhd 和若干个 trak。
8.2.2 Movie Header Box
mvhd 定义了整个 movie 的特性，例如播放时长、播放速度、声量大小等
Syntax
aligned(8) class MovieHeaderBox extends FullBox(‘mvhd’, version, 0) {
if (version==1) {
unsigned int(64) creation_time;
unsigned int(64) modification_time;
unsigned int(32) timescale;
unsigned int(64) duration;
} else { // version==0
}
template int(32) rate = 0x00010000; // typically 1.0
template int(16) volume = 0x0100; // typically, full volume
const bit(16) reserved = 0;
const unsigned int(32)[2] reserved = 0;
template int(32)[9] matrix =
{ 0x00010000,0,0,0,0x00010000,0,0,0,0x40000000 };
// Unity matrix
bit(32)[6] pre_defined = 0;
unsigned int(32) next_track_ID;
}
Semantics
version
box 版本号（0 或者 1）
creation_time
创建 presentation 的 UTC 时间。从 1904 年开始算起, 用秒来表示；
modification_time
最近修改 presentation 的 UTC 时间。从 1904 年开始算起, 用秒来表示时间；
timescale
时间刻度；
将 1s 切割后的时间单位，例如 timescale 为 1000，则一个时间单位时长是 1ms；
duration
presentation 时长，值为 tracks 中时长最长的那个 track 的时长；
rate
播放速度；
1.0 (0x00010000) 表示正常播放速度；
volume
音量大小；
1.0 (0x0100) 为最大音量；
matrix
视频转换矩阵；
(u,v,w) 值是 (0,0,1)，对应 16 进制 (0,0,0x40000000).
next_track_ID
presentation 的下一个 track id，应该比当前的 track id 大；
8.3 Track Structure
8.3.1 Track Box
track 表示一些 sample 的集合，对于媒体数据来说，track 表示一个视频或音频序列。
Syntax
aligned(8) class TrackBox extends Box(‘trak’) {

}
一个 mp4 文件可以包含一个或多个 tracks，它们之间相互独立，各自有各自的时间和空间信息。

每个 track box 都有与之关联的 media box。trak box 要求必须有一个 trak header box(tkhd) 和一
个 media box(mdia)。
8.3.2 Track Header Box
这里面有一个字段，track_ID，用于唯一的表示当前track。另一个字段，duration，用于记录当前
track 的播放长度。
Syntax
aligned(8) class TrackHeaderBox
extends FullBox(‘tkhd’, version, flags){
if (version==1) {
unsigned int(32) track_ID;
const unsigned int(32) reserved = 0;
}
template int(16) layer = 0;
template int(16) alternate_group = 0;
template int(16) volume = {if track_is_audio 0x0100 else 0};
template int(32)[9] matrix=
{ 0x00010000,0,0,0,0x00010000,0,0,0,0x40000000 };
// unity matrix
unsigned int(32) width;
unsigned int(32) height;
}
Semantics
version
version is an integer that specifies the version of this box (0 or 1 in this specification)
box 版本号（0 或 1）；
flags
flags is a 24‐bit integer with flags; the following values are defined:
flags，24 bits；
Track_enabled
Indicates that the track is enabled. Flag value is 0x000001. A disabled track (the low bit
is zero) is treated as if it were not present.
是否启用这个 track。最低位为 1 时表示启用。

Track_in_movie
Indicates that the track is used in the presentation. Flag value is 0x000002.
presentation 中是否使用到这个 track；第二位为 1 时表示是。
Track_in_preview
Indicates that the track is used when previewing the presentation. Flag value is
0x000004.
在预览 presentation 时，是否用到此 track；第三位为 1 时表示是。
Track_size_is_aspect_ratio
Indicates that the width and height fields are not expressed in pixel units. The values
have the same units but these units are not specified. The values are only an indication
of the desired aspect ratio. If the aspect ratios of this track and other related tracks are
not identical, then the respective positioning of the tracks is undefined, possibly
defined by external contexts. Flag value is 0x000008.
表示宽度和高度字段没有用像素单位表示。这些值具有相同的单位，但这些单位没有被
指定。这些数值只是对所需长宽比的指示。如果此轨道和其他相关轨道的长宽比不一
致，那么轨道的各自定位就没有定义，可能是由外部环境定义的。标志值为0x000008。
creation_time
creation_time is an integer that declares the creation time of this track (in seconds since
midnight, Jan. 1, 1904, in UTC time).
创建 track 的 UTC 时间。从 1904 年开始算起, 用秒来表示。
modification_time
modification_time is an integer that declares the most recent time the track was modified
(in seconds since midnight, Jan. 1, 1904, in UTC time).
最近修改 track 的 UTC 时间。从 1904 年开始算起, 用秒来表示。
track_ID
track_ID is an integer that uniquely identifies this track over the entire life‐time of this
presentation. Track IDs are never re‐used and cannot be zero.
用于唯一标识此 track，作用域是 presentation 的整个生命周期；不可以复用，并且不能为

0；
duration
duration is an integer that indicates the duration of this track (in the timescale indicated in
the Movie Header Box). The value of this field is equal to the sum of the durations of all of
the track’s edits. If there is no edit list, then the duration is the sum of the sample durations,
converted into the timescale in the Movie Header Box. If the duration of this track cannot
be determined then duration is set to all 1s.
duration（持续时间）是一个整数，表示该轨道的持续时间（在 "电影标题框 "中显示的时间

刻度）。该字段的值等于该轨道所有编辑的持续时间之和。如果没有编辑列表，那么持续时
间就是样本持续时间的总和，并转换为 "电影标题框 "中的时间刻度。如果无法确定该轨道的
持续时间，那么持续时间将被设置为所有1秒。
layer
layer specifies the front‐to‐back ordering of video tracks; tracks with lower numbers are
closer to the viewer. 0 is the normal value, and ‐1 would be in front of track 0, and so on.
layer 指定了视频轨道的前后顺序；数字越小的轨道越靠近观众。0 是正常值，而 -1 会在轨

道 0 的前面，以此类推。
alternate_group
alternate_group is an integer that specifies a group or collection of tracks. If this field is 0,

there is no information on possible relations to other tracks. If this field is not 0, it should be
the same for tracks that contain alternate data for one another and different for tracks
belonging to different such groups. Only one track within an alternate group should be
played or streamed at any one time, and must be distinguishable from other tracks in the
group via attributes such as bitrate, codec, language, packet size etc. A group may have
only one member.
alternate_group 是一个整数，用于指定轨道的组或集合。如果该字段为 0，则没有关于与其

他轨道的可能关系的信息。如果这个字段不是 0，那么对于包含彼此交替数据的轨道来说，
它应该是相同的，而对于属于不同此类组的轨道来说，它应该是不同的。在一个交替组中，
任何时候都只能有一条轨道被播放或流传，并且必须能够通过比特率、编解码器、语言、数
据包大小等属性与该组中的其他轨道区分开来。一个组可以只有一个成员。
volume
volume is a fixed 8.8 value specifying the track's relative audio volume. Full volume is 1.0
(0x0100) and is the normal value. Its value is irrelevant for a purely visual track. Tracks may
be composed by combining them according to their volume, and then using the overall
Movie Header Box volume setting; or more complex audio composition (e.g. MPEG‐4 BIFS)
may be used.
volume 是一个固定的 8.8 值，指定轨道的相对音频音量。全音量是1.0（0x0100），是正常

值。它的值对于纯视觉轨道来说是不相关的。音轨可以根据音量组合，然后使用整个电影标
题框的音量设置；也可以使用更复杂的音频组合（如 MPEG-4 BIFS）来组成。
matrix
matrix provides a transformation matrix for the video; (u,v,w) are restricted here to (0,0,1),
hex (0,0,0x40000000).
矩阵提供了一个视频的变换矩阵；（u,v,w）在这里被限制为（0,0,1），十六进制
（0,0,0x40000000）。
width and height
width and height fixed‐point 16.16 values are track‐dependent as follows:
For text and subtitle tracks, they may, depending on the coding format, describe the
suggested size of the rendering area. For such tracks, the value 0x0 may also be used to
indicate that the data may be rendered at any size, that no preferred size has been
indicated and that the actual size may be determined by the external context or by reusing
the width and height of another track. For those tracks, the flag track_size_is_aspect_ratio
may also be used. 对于文本和字幕轨道，它们可以根据编码格式，描述渲染区域的建议尺
寸。对于这类轨道，值 0x0 也可以用来表示数据可以以任何尺寸呈现，没有指明首选尺寸，
实际尺寸可以由外部环境决定，或者通过重新使用另一个轨道的宽度和高度。对于这些轨
道，也可以使用标志 track_size_is_aspect_ratio。
For non‐visual tracks (e.g. audio), they should be set to zero. 对于非视觉轨道（例如音
频），它们应该被设置为零。
For all other tracks, they specify the track's visual presentation size. These need not be the
same as the pixel dimensions of the images, which is documented in the sample
description(s); all images in the sequence are scaled to this size, before any overall
transformation of the track represented by the matrix. The pixel dimensions of the images
are the default values.
对于所有其他轨道来说，它们指定了轨道的视觉呈现尺寸。这些尺寸不需要与与图像的像素
尺寸不同，这一点在样本描述中已有记载；序列中的所有图像都被缩放为序列中的所有图像
都被缩放为这一尺寸，然后再对矩阵所代表的轨道进行任何整体转换。矩阵所代表的轨道进
行任何整体转换之前，序列中的所有图像都被缩放为这一尺寸。图像的像素尺寸是默认值。
8.4 Track Media Structure
8.4.1 Media Box
Syntax
aligned(8) class MediaBox extends Box(‘mdia’) {

}
8.4.2 Media Header Box
Syntax
aligned(8) class MediaHeaderBox extends FullBox(‘mdhd’, version, 0) {

if (version==1) {
}
bit(1)
pad = 0;
unsigned int(5)[3] language; // ISO-639-2/T language code
unsigned int(16) pre_defined = 0;
}
Semantics
version
version is an integer that specifies the version of this box (0 or 1)
版本是一个整数，指定这个 box 的版本（0或1）。
creation_time
creation_time is an integer that declares the creation time of the media in this track (in
seconds since midnight, Jan. 1, 1904, in UTC time).
创建 meida 的 UTC 时间。从 1904 年开始算起, 用秒来表示；

modification_time
modification_time is an integer that declares the most recent time the media in this track
was modified (in seconds since midnight, Jan. 1, 1904, in UTC time).
最近修改 meida 的 UTC 时间。从 1904 年开始算起, 用秒来表示；
timescale
timescale is an integer that specifies the time‐scale for this media; this is the number of
time units that pass in one second. For example, a time coordinate system that measures
time in sixtieths of a second has a time scale of 60.
timescale 是一个整数，用于指定该媒体的时间尺度；这是一秒钟内经过的时间单位的数量。
例如，一个以 60 秒为单位测量时间的时间坐标系，其时间尺度为 60。
duration
duration is an integer that declares the duration of this media (in the scale of the timescale).
If the duration cannot be determined then duration is set to all 1s.
duration 是一个整数，用于声明该媒体的持续时间（在时间尺度的范围内）。如果不能确定
持续时间，那么持续时间将被设置为所有 1 秒。
language
language declares the language code for this media. See ISO 639‐2/T for the set of three
character codes. Each character is packed as the difference between its ASCII value and
0x60.
language 声明了该媒体的语言代码。关于三个字符代码的集合，请参见 ISO 639-2/T。每个

字符都被包装成其 ASCII 值与 0x60 之间的差值。
Since the code is confined to being three lower‐case letters, these values are strictly
positive.
由于代码被限制为三个小写字母，这些数值严格来说是正数。
8.4.3 Handler Reference Box
Syntax
aligned(8) class HandlerBox extends FullBox(‘hdlr’, version = 0, 0) {
unsigned int(32) pre_defined = 0;
unsigned int(32) handler_type;
string name;
}
Semantics
version
version is an integer that specifies the version of this box
版本是一个整数，指定这个盒子的版本。
handler_type
when present in a media box, contains a value as defined in clause 12, or a value from a
derived specification, or registration.
当出现在媒体框中时，包含一个在条款 12 中定义的值，或一个来自衍生规范的值，或注
册。
when present in a meta box, contains an appropriate value to indicate the format of the
meta box contents. The value ‘null’ can be used in the primary meta box to indicate that it is
merely being used to hold resources.
当出现在一个元框中时，包含一个适当的值来表示元框内容的格式。值 "null "可以在主元框

中使用，以表明它只是被用来容纳资源。
name
name is a null‐terminated string in UTF‐8 characters which gives a human‐readable name

for the track type (for debugging and inspection purposes).
name 是一个以 UTF-8 字符为空尾的字符串，它为轨道类型提供了一个人类可读的名称（用

于调试和检查）。
8.4.4 Media Information Box
包含用于声明当前 track 信息的所有对象，它定义了 track 媒体类型以及 sample 数据，来描述

sample 信息，一般 mdia 包含一个 mdhd(media header box),一个 hdlr(handler reference box) 和
一个 minf(media information box)。
Syntax
aligned(8) class MediaInformationBox extends Box(‘minf’) {
}
8.4.5 Media Information Header Boxes
8.4.5.2 Null Media Header Box
Syntax
aligned(8) class NullMediaHeaderBox

extends FullBox(’nmhd’, version = 0, flags) {
}
Semantics
version
version is an integer that specifies the version of this box.
版本是一个整数，用于指定这个盒子的版本。
flags
flags is a 24‐bit integer with flags (currently all zero).
flags 是一个带有标志的 24 位整数（目前都是 0）。
8.5 Sample Tables
8.5.1 Sample Table Box
video sample 即为一帧视频或者一组连续视频帧，audio sample 即为一段连续的压缩音频。
The sample table contains all the time and data indexing of the media samples in a track. Using
the tables here, it is possible to locate samples in time, determine their type (e.g. I‐frame or not),
and determine their size, container, and offset into that container.
样本表包含了一个轨道中媒体样本的所有时间和数据索引。使用这里的表格，可以在时间上定位
样本，确定它们的类型（例如，是否为 I-frame），并确定它们的大小、容器以及在该容器中的偏
移。
stsc stsz stco 三个 box 用于保存每帧视频或音频数据在文件中的保存位置。
stts stss ctts 三个 box 用于保存媒体数据和时间戳的对应关系。

Syntax
aligned(8) class SampleTableBox extends Box(‘stbl’) {

}
8.5.2 Sample Description Box
Syntax
aligned(8) abstract class SampleEntry (unsigned int(32) format)

extends Box(format){
unsigned int(16) data_reference_index;
}
class BitRateBox extends Box(‘btrt’){

unsigned int(32) bufferSizeDB;
unsigned int(32) maxBitrate;
unsigned int(32) avgBitrate;
}
aligned(8) class SampleDescriptionBox (unsigned int(32) handler_type)

extends FullBox('stsd', version, 0){
int i ;
unsigned int(32) entry_count;
for (i = 1 ; i <= entry_count ; i++){
SampleEntry(); // an instance of a class derived from SampleEntry
}
}
Semantics
version
version is set to zero unless the box contains an AudioSampleEntryV1, whereupon version
must be 1
如果 box 含有 AudioSampleEntryV1，那么 version 为 1，否则为 0；
entry_count
entry_count is an integer that gives the number of entries in the following table
整数，指定了 sample entry 的数量；
SampleEntry
SampleEntry is the appropriate sample entry.
SampleEntry 是适当的样本条目。
data_reference_index
data_reference_index is an integer that contains the index of the data reference to use to
retrieve data associated with samples that use this sample description. Data references are
stored in Data Reference Boxes. The index ranges from 1 to the number of data references.
data_reference_index 是一个整数，包含数据参考的索引，用于检索与使用此样本描述的样本
相关的数据。数据引用被存储在数据引用框中。索引范围从 1 到数据引用的数量。
bufferSizeDB
bufferSizeDB gives the size of the decoding buffer for the elementary stream in bytes.
bufferSizeDB 给出基本流的解码缓冲区的大小，单位是字节。
maxBitrate
maxBitrate gives the maximum rate in bits/second over any window of one second.
maxBitrate 给出任何一秒窗口的最大速率，单位是比特/秒。
avgBitrate
avgBitrate gives the average rate in bits/second over the entire presentation.
avgBitrate 给出了整个演示过程中的平均速率，单位为比特/秒。
8.6 Track Time Structures
8.6.1 Time to Sample Boxes
8.6.1.2 Decoding Time to Sample Box
Syntax
aligned(8) class TimeToSampleBox
extends FullBox(’stts’, version = 0, 0) {
int i;
for (i=0; i < entry_count; i++) {
unsigned int(32) sample_count;
unsigned int(32) sample_delta;
}
}
Semantics
version
entry_count
entry_count is an integer that gives the number of entries in the following table.
entry_count 是一个整数，它给出了以下表格中的条目数。
sample_count
sample_count is an integer that counts the number of consecutive samples that have the
given duration.
sample_count 是一个整数，用于计算具有给定持续时间的连续样本的数量。
sample_delta
sample_delta is an integer that gives the delta of these samples in the time‐scale of the
media.
sample_delta 是一个整数，它给出了这些样本在媒体的时间尺度中的 delta。
8.6.1.3 Composition Time to Sample Box
Syntax
aligned(8) class CompositionOffsetBox
extends FullBox(‘ctts’, version, 0) {
int i;
if (version==0) {
unsigned int(32) sample_offset;
}
}
else if (version == 1) {
signed int(32) sample_offset;
}
}
}
Semantics
version
entry_count
entry_count is an integer that gives the number of entries in the following table.
entry_count 是一个整数，它给出了以下表格中的条目数。
sample_count
sample_count is an integer that counts the number of consecutive samples that have the
given offset.
sample_count 是一个整数，用于计算具有给定偏移量的连续样本的数量。
sample_offset
sample_offset is an integer that gives the offset between CT and DT, such that CT(n) = DT(n)
+ CTTS(n).
sample_offset 是一个整数，给出 CT 和 DT 之间的偏移量，这样 CT(n)=DT(n)+CTTS(n)。
8.6.2 Sync Sample Box

Syntax
aligned(8) class SyncSampleBox

extends FullBox(‘stss’, version = 0, 0) {
int i;
unsigned int(32) sample_number;
}
}
Semantics
version
entry_count
entry_count is an integer that gives the number of entries in the following table. If
entry_count is zero, there are no sync samples within the stream and the following table is
empty.
entry_count 是一个整数，给出下表的条目数。如果 entry_count 为零，则说明流内没有同步

样本，下表为空。
sample_number
sample_number gives the numbers of the samples that are sync samples in the stream.
sample_number 给出了流中同步样本的编号。
8.7 Track Data Layout Structures
8.7.1 Data Information Box
Syntax
aligned(8) class DataInformationBox extends Box(‘dinf’) {

}
8.7.2 Data Reference Box

Syntax
aligned(8) class DataEntryUrlBox (bit(24) flags)

extends FullBox(‘url ’, version = 0, flags) {
string location;
}
aligned(8) class DataEntryUrnBox (bit(24) flags)

extends FullBox(‘urn ’, version = 0, flags) {
string name;
string location;
}
aligned(8) class DataReferenceBox

extends FullBox(‘dref’, version = 0, 0) {
for (i=1; i <= entry_count; i++) {
DataEntryBox(entry_version, entry_flags) data_entry;
}
}
Semantics
version
entry_count
entry_count is an integer that counts the actual entries
entry_count 是一个整数，用来计算实际条目。
entry_version
entry_version is an integer that specifies the version of the entry format
entry_version 是一个整数，它指定了条目格式的版本。
entry_flags
entry_flags is a 24‐bit integer with flags; one flag is defined (x000001) which means that the
media data is in the same file as the Movie Box containing this data reference.
entry_flags 是一个带有标志的24位整数；定义了一个标志（x000001），这意味着一个标志

被定义为(x000001)，这意味着媒体数据与包含该数据参考的电影盒在同一个文件中。
data_entry
data_entry is a URL or URN entry. Name is a URN, and is required in a URN entry. Location
is a URL, and is required in a URL entry and optional in a URN entry, where it gives a
location to find the resource with the given name. Each is a null‐terminated string using UTF
‐8 characters. If the self‐contained flag is set, the URL form is used and no string is present;
the box terminates with the entry‐flags field. The URL type should be of a service that
delivers a file (e.g. URLs of type file, http, ftp etc.), and which services ideally also permit
random access. Relative URLs are permissible and are relative to the file containing the
Movie Box that contains this data reference.
data_entry 是一个 URL 或 URN 条目。名称是一个 URN，在 URN 条目中是必需的。位置是

一个 URL，在 URL 条目中是必需的，在 URN 条目中是可选的，它给出了一个位置来寻找具
有给定名称的资源。每个都是一个使用 UTF-8 字符的空尾字符串。如果设置了自含标志，则
使用 URL 形式，不存在字符串；该框以 entry-flags 字段结束。URL 类型应该是传递文件的
服务（例如，文件、http、ftp 等类型的 URL），这些服务最好也允许随机访问。相对的 URL
是允许的，是相对于包含该数据参考的电影盒的文件而言的。
8.7.3 Sample Size Boxes
8.7.3.2 Sample Size Box
Syntax
aligned(8) class SampleSizeBox extends FullBox(‘stsz’, version = 0, 0) {

unsigned int(32) sample_size;
if (sample_size==0) {
for (i=1; i <= sample_count; i++) {
unsigned int(32) entry_size;
}
}
}
Semantics
version
sample_size
sample_size is integer specifying the default sample size. If all the samples are the same
size, this field contains that size value. If this field is set to 0, then the samples have different
sizes, and those sizes are stored in the sample size table. If this field is not 0, it specifies the
constant sample size, and no array follows.
sample_size 是一个整数，指定默认的样本大小。如果所有的样本都是相同的尺寸，这个字段
就包含这个尺寸值。如果这个字段被设置为 0，那么样本有不同的尺寸，这些尺寸被存储在
样本尺寸表中。如果这个字段不是 0，那么它指定的是恒定的样本大小，后面没有数组。
sample_count
sample_count is an integer that gives the number of samples in the track; if sample‐size is 0,
then it is also the number of entries in the following table.
sample_count 是一个整数，它给出了音轨中的样本数；如果样本大小为 0，那么它也是以下

表格中的条目数。
entry_size
entry_size is an integer specifying the size of a sample, indexed by its number.
entry_size 是一个整数，指定了一个样本的大小，以其编号为索引。
8.7.3.3 Compact Sample Size Box
Syntax
aligned(8) class CompactSampleSizeBox extends FullBox(‘stz2’, version = 0, 0) {

unsigned int(24) reserved = 0;
unsigned int(8) field_size;
for (i=1; i <= sample_count; i++) {
unsigned int(field_size) entry_size;
}
}
Semantics
version
field_size
field_size is an integer specifying the size in bits of the entries in the following table; it shall
take the value 4, 8 or 16. If the value 4 is used, then each byte contains two values: entry[i]
<<4 + entry[i+1]; if the sizes do not fill an integral number of bytes, the last byte is padded
with zeros.
field_size是一个整数，指定下表中条目的比特大小；它的值是4、8或16。如果使用4的值，
那么每个字节包含两个值：
entry[i]<<4 + entry[i+1]；如果大小没有填满一个整数的字节，最后一个字节就会用零填
充。
sample_count
sample_count is an integer that gives the number of entries in the following table
sample_count 是一个整数，它给出了以下表格中的条目数
entry_size
entry_size is an integer specifying the size of a sample, indexed by its number.
entry_size 是一个整数，指定了一个样本的大小，以其编号为索引。
8.7.4 Sample To Chunk Box
一个 track 的几个 sample 组成的单元。
Syntax
aligned(8) class SampleToChunkBox

extends FullBox(‘stsc’, version = 0, 0) {
unsigned int(32) first_chunk;
unsigned int(32) samples_per_chunk;
unsigned int(32) sample_description_index;
}
}
Semantics
version
entry_count
entry_count是一个整数，给出以下表格中的条目数
first_chunk
first_chunk is an integer that gives the index of the first chunk in this run of chunks that
share the same samples‐per‐chunk and sample‐description‐index; the index of the first
chunk in a track has the value 1 (the first_chunk field in the first record of this box has the
value 1, identifying that the first sample maps to the first chunk).
first_chunk 是一个整数，它给出了这个运行中的第一个块的索引，这些块共享相同的
samples-per-chunk 和 samples-description-index；一个轨道中的第一个块的索引值为 1（这
个盒子的第一个记录中的 first_chunk 字段的值为 1，表明第一个样本映射到第一个块）。
samples_per_chunk
samples_per_chunk is an integer that gives the number of samples in each of these chunks
samples_per_chunk 是一个整数，给出了这些块中每个样本的数量。
sample_description_index
sample_description_index is an integer that gives the index of the sample entry that
describes the samples in this chunk. The index ranges from 1 to the number of sample
entries in the Sample Description Box
sample_description_index 是一个整数，它给出了描述该块中样本的样本条目的索引。该索引
的范围是 1 到样本描述框中的样本条目的数量。
8.7.5 Chunk Offset Box
Syntax
aligned(8) class ChunkOffsetBox
extends FullBox(‘stco’, version = 0, 0) {
unsigned int(32) chunk_offset;
}
}
aligned(8) class ChunkLargeOffsetBox

extends FullBox(‘co64’, version = 0, 0) {
unsigned int(64) chunk_offset;
}
}
Semantics
version
entry_count
entry_count 是一个整数，给出以下表格中的条目数
chunk_offset
chunk_offset is a 32 or 64 bit integer that gives the offset of the start of a chunk into its
containing media file.
chunk_offset 是一个 32 或 64 位的整数，它给出了一个块在其包含的媒体文件中开始的偏

移。
8.8 Movie Fragments

Fragmented MP4 以 Fragment 的方式存储视频信息。每个 Fragment 由一个 moof box 和一个
mdat box 组成:
mdat（media data box）
和普通 MP4 文件的 mdat 一样，用于存放媒体数据，不同的是普通 MP4 文件只有一个 mdat

box，而 Fragmented MP4 文件中，每个 fragment 都会有一个 mdat 类型 box。
moof（movie fragment box）
该类型的 box 存放的是 fragment-level 的 metadata 信息，用于描述所在 fragment。该类型

的 box 在普通的 MP4 文件中是不存在的，而在 Fragmented MP4 文件中，每个 fragment 都
会有一个 moof 类型的 box。moof 和 moov 非常像，它包含了当前片段中 mp4 的相关元信
息。
一个 moof 和一个 mdat 组成 Fragmented MP4 文件的一个 fragment，这个 fragment 包含一个

video track 或 audio track，并且包含足够的metadata以保证这部分数据可以单独解码。
Fragmented MP4 中的 moov box 只存储文件级别的媒体信息，因此 moov box 的体积比传统的
MP4 中的 moov box 体积要小很多。
8.8.1 Movie Extends Box
Syntax
aligned(8) class MovieExtendsBox extends Box(‘mvex’){

}
8.8.2 Movie Extends Header Box
Syntax
aligned(8) class MovieExtendsHeaderBox extends FullBox(‘mehd’, version, 0) {

if (version==1) {
unsigned int(64) fragment_duration;
unsigned int(32) fragment_duration;
}
}
Semantics
fragment_duration
fragment_duration is an integer that declares length of the presentation of the whole movie
including fragments (in the timescale indicated in the Movie Header Box). The value of this
field corresponds to the duration of the longest track, including movie fragments. If an MP4
file is created in real‐time, such as used in live streaming, it is not likely that the
fragment_duration is known in advance and this box may be omitted.
fragment_duration 是一个整数，宣布了整个电影的演示长度，包括片段（在电影标题框中显
示的时间尺度）。这个字段的值对应于最长的轨道的持续时间，包括电影片段。如果一个
MP4 文件是实时创建的，如用于实时流媒体，它不可能事先知道片段的时间，这个框可以省
略。
8.8.3 Track Extends Box
Syntax
aligned(8) class TrackExtendsBox extends FullBox(‘trex’, 0, 0){

unsigned int(32) default_sample_description_index;
unsigned int(32) default_sample_duration;
unsigned int(32) default_sample_size;
unsigned int(32) default_sample_flags;
}
Semantics
track_id
track_id identifies the track; this shall be the track ID of a track in the Movie Box
track_id 标识轨道；这应该是 "电影盒 "中的一个轨道的 ID。
default_*
default_ these fields set up defaults used in the track fragments.
default_* 这些字段设置了轨道片段中使用的默认值。
8.8.4 Movie Fragment Box
Syntax
aligned(8) class MovieFragmentBox extends Box(‘moof’){

}
8.8.5 Movie Fragment Header Box
Syntax
aligned(8) class MovieFragmentHeaderBox
extends FullBox(‘mfhd’, 0, 0){
unsigned int(32) sequence_number;
}
Semantics
sequence_number
sequence_number a number associated with this fragment
sequence_number 一个与此片段相关的数字
8.8.6 Track Fragment Box
Syntax
aligned(8) class TrackFragmentBox extends Box(‘traf’){

}
8.8.7 Track Fragment Header Box
Syntax
aligned(8) class TrackFragmentHeaderBox

extends FullBox(‘tfhd’, 0, tf_flags){
// all the following are optional fields
unsigned int(64) base_data_offset;
unsigned int(32) sample_description_index;
unsigned int(32) default_sample_duration;
unsigned int(32) default_sample_size;
unsigned int(32) default_sample_flags
}
Semantics
base_data_offset
base_data_offset the base offset to use when calculating data offsets
base_data_offset 在计算数据偏移量时使用的基本偏移量。
8.8.8 Track Fragment Run Box

Syntax
aligned(8) class TrackRunBox

extends FullBox(‘trun’, version, tr_flags) {
// the following are optional fields
signed int(32) data_offset;
unsigned int(32) first_sample_flags;
// all fields in the following array are optional
{
unsigned int(32) sample_duration;
unsigned int(32) sample_size;
unsigned int(32) sample_flags
if (version == 0)
{ unsigned int(32) sample_composition_time_offset; }
else
{ signed int(32) sample_composition_time_offset; }
}[ sample_count ]
}
Semantics
sample_count
sample_count the number of samples being added in this run; also the number of rows in
the following table (the rows can be empty)
sample_count 本次运行中添加的样本数；也是下表的行数（行数可以为空）。
data_offset
data_offset is added to the implicit or explicit data_offset established in the track fragment
header.
data_offset 被添加到轨道片段头中建立的隐式或显式 data_offset 中。
first_sample_flags
first_sample_flags provides a set of flags for the first sample only of this run.
first_sample_flags 为本次运行的第一个样本提供一组标志。
8.8.9 Movie Fragment Random Access Box
Syntax
aligned(8) class MovieFragmentRandomAccessBox
extends Box(‘mfra’)
{
}
8.8.10 Track Fragment Random Access Box
Syntax
aligned(8) class TrackFragmentRandomAccessBox

extends FullBox(‘tfra’, version, 0) {
unsigned int(2) length_size_of_traf_num;
unsigned int(2) length_size_of_trun_num;
unsigned int(2) length_size_of_sample_num;
unsigned int(32) number_of_entry;
for(i=1; i <= number_of_entry; i++){
if(version==1){
unsigned int(64) time;
unsigned int(64) moof_offset;
}else{
unsigned int(32) time;
unsigned int(32) moof_offset;
}
unsigned int((length_size_of_traf_num+1) * 8) traf_number;
unsigned int((length_size_of_trun_num+1) * 8) trun_number;
unsigned int((length_size_of_sample_num+1) * 8) sample_number;
}
}
Semantics
track_ID
track_ID is an integer identifying the track_ID.
track_ID 是一个识别 track_ID 的整数。
length_size_of_traf_num
length_size_of_traf_num indicates the length in byte of the traf_number field minus one.
length_size_of_traf_num 表示 traf_number 字段的字节长度减去 1。
length_size_of_trun_num
length_size_of_trun_num indicates the length in byte of the trun_number field minus one.
length_size_of_trun_num 表示 trun_number 字段的字节长度减去 1。
length_size_of_sample_num
length_size_of_sample_num indicates the length in byte of the sample_number field minus

one.
length_size_of_sample_num 表示样本号字段的长度，以字节为单位，减去 1。
number_of_entry
number_of_entry is an integer that gives the number of the entries for this track. If this value
is zero, it indicates that every sample is a sync sample and no table entry follows.
number_of_entry 是一个整数，给出了该轨道的条目数。如果这个值为零，表示每个样本都
是同步样本，后面没有表项。
time
time is 32 or 64 bits integer that indicates the presentation time of the sync sample in units
defined in the ‘mdhd’ of the associated track.
time 是 32 或 64 位的整数，表示同步样本的呈现时间，单位是相关轨道的 "mdhd "中定义

的。
moof_offset
moof_offset is 32 or 64 bits integer that gives the offset of the ‘moof’ used in this entry.
Offset is the byte‐offset between the beginning of the file and the beginning of the ‘moof’.
moof_offset 是 32 或 64 位的整数，给出此条目中使用的 'moof' 的偏移量。偏移量是文件的

开头和 'moof' 的开头之间的字节偏移。
traf_number
traf_number indicates the ‘traf’ number that contains the sync sample. The number ranges
from 1 (the first ‘traf’ is numbered 1) in each ‘moof’.
traf_number 表示包含同步采样的 'traf' 编号。这个数字的范围是每个 'moof' 中的 1（第一个

'traf' 的编号是 1）。
trun_number
trun_number indicates the ‘trun’ number that contains the sync sample. The number ranges
from 1 in each ‘traf’.
trun_number 表示包含同步采样的 'trun' 号码。这个数字的范围是每个 "traf " 中的 1。
sample_number
sample_number indicates the sample number of the sync sample. The number ranges from
1 in each ‘trun’.
sample_number 表示同步样本的样本号。这个数字在每个 "traf "中从 1 开始。
8.8.11 Movie Fragment Random Access Offset Box
Syntax
aligned(8) class MovieFragmentRandomAccessOffsetBox

extends FullBox(‘mfro’, version, 0) {
unsigned int(32) size;
}
Semantics
size
size is an integer gives the number of bytes of the enclosing ‘mfra’ box. This field is placed
at the last of the enclosing box to assist readers scanning from the end of the file in finding
the ‘mfra’ box.
size 是一个整数，给出包围 'mfra' 框的字节数。这个字段被放置在包围盒的最后，以帮助读

者从文件的末尾开始扫描，找到 'mfra' 盒。
8.8.12 Track fragment decode time
Syntax
aligned(8) class TrackFragmentBaseMediaDecodeTimeBox

extends FullBox(‘tfdt’, version, 0) {
if (version==1) {
unsigned int(64) baseMediaDecodeTime;
unsigned int(32) baseMediaDecodeTime;
}
}
Semantics
version
version is an integer that specifies the version of this box (0 or 1 in this specification).
版本是一个整数，指定这个盒子的版本（在本规范中为 0 或 1）。
baseMediaDecodeTime
baseMediaDecodeTime is an integer equal to the sum of the decode durations of all earlier
samples in the media, expressed in the media's timescale. It does not include the samples
added in the enclosing track fragment.
baseMediaDecodeTime 是一个整数，等于媒体中所有早期样本的解码时间之和，以媒体的
时间尺度表示。它不包括包围轨道片段中添加的样本。
12 Media-specific definitions
12.1 Video media
12.1.1 Video media header
Syntax
aligned(8) class VideoMediaHeaderBox

extends FullBox(‘vmhd’, version = 0, 1) {
template unsigned int(16) graphicsmode = 0; // copy, see below
template unsigned int(16)[3] opcolor = {0, 0, 0};
}
Semantics
version
graphicsmode
graphicsmode specifies a composition mode for this video track, from the following
enumerated set, which may be extended by derived specifications: copy = 0 copy over the
existing image
graphicsmode 指定了这个视频轨道的组成模式，来自于以下的枚举集，可以通过派生规格来
扩展： copy = 0 在现有图像上复制。
opcolor
opcolor is a set of 3 colour values (red, green, blue) available for use by graphics modes
opcolor 是一组3个颜色值（红、绿、蓝），可供图形模式使用。
12.2 Audio media
12.2.2 Sound media header
Syntax
aligned(8) class SoundMediaHeaderBox

extends FullBox(‘smhd’, version = 0, 0) {
template int(16) balance = 0;
}
Semantics
version
balance
balance is a fixed‐point 8.8 number that places mono audio tracks in a stereo space; 0 is
centre (the normal value); full left is ‐1.0 and full right is 1.0.
balance 是一个8.8 的定点数字，用于将单声道音轨置于立体声空间中；0 是中心（正常

值）；全左是 -1.0，全右是 1.0。
12.4 Hint media
12.4.2 Hint media header
12.4.2.1 Hint Media Header Box
Syntax
aligned(8) class HintMediaHeaderBox
extends FullBox(‘hmhd’, version = 0, 0) {
unsigned int(16) maxPDUsize;
unsigned int(16) avgPDUsize;
unsigned int(32) maxbitrate;
unsigned int(32) avgbitrate;
unsigned int(32) reserved = 0;
}
Semantics
version
maxPDUsize
maxPDUsize gives the size in bytes of the largest PDU in this (hint) stream
maxPDUsize 给出了这个（提示）流中最大的 PDU 的字节大小。
avgPDUsize
avgPDUsize gives the average size of a PDU over the entire presentation
avgPDUsize 给出了整个演示过程中 PDU 的平均大小。
maxbitrate
maxbitrate gives the maximum rate in bits/second over any window of one second
maxbitrate 给出了任何一秒窗口的最大速率（比特/秒）。
avgbitrate
avgbitrate gives the average rate in bits/second over the entire presentation
avgbitrate 给出了整个演示的平均速率（比特/秒）。
Other
MP4 seek 过程
这里不考虑 edit list box (edts)

给定时间，找到对应的文件偏移量。
1. 知道 seek 时间，可在 stts 中找到 decoding 时间最大且小于 seek 时间的 sample（time-to-

sample box (stts) 记录了每个 sample 的 decoding 时间）；
2. 找到时间对应的 sample 后，由于该 sample 不一定是关键帧，可在 stss 找到编号最大且小于

该 sample 的 sync sample（sync sample box(stss) 记录了所有的 sync sample 的编号）
3. 找到 sync sample 后，可在 stsc 找到该 sample 属于哪个 chunck（sample‐to‐chunk box

(stsc) 记录了 sample 怎么被归到不同的 chunk）
4. 拿到 chunk 后，可在 stco 中找到该 chuck 对应的文件偏移（chunk offsets box(stco) 记录了

每个 chunk 的文件偏移）
5. 知道 chunk 的文件偏移，再根据 stsc 和 stsz 找到 sync sample 的文件偏移（sample size box

(stsz) 记录了所有 sample 的大小）

Fragment mp4 Box

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fragment mp4 Box

Uploaded by

Copyright:

Available Formats

4 Object-structured File Organization

4.2 Object Structure

aligned(8) class Box (unsigned int(32) boxtype,

整数，指定这个 box 所占空间的字节数大小，包含自身 size、type 等字段所需空间加上内嵌

如果 size 为 1，那么实际大小放在 largesize 字段；

如果 size 为 0，那么表示这个 box 是文件的最后一个 box，它的内容直至文件尾（一般只在

标准 box 的类型使用 4 个可打印字符来表示，便于识别；

当类型为 uuid 时，用户可进行类型拓展。

很多 Box 都包含 version、flags 字段。

4.3 File Type Box

aligned(8) class FileTypeBox

兼容的 brand 版本列表；

8.1 File Structure and general boxes

8.1.1 Media Data Box

8.2 Movie Structure

8.2.1 Movie Box

该 box 包含了文件媒体的 metadata 信息，moov 是一个 container box，具体内容信息由子 box

aligned(8) class MovieBox extends Box(‘moov’){

moov 在 mp4 文件中有且只有一个，moov 中会包含 1 个 mvhd 和若干个 trak。

8.2.2 Movie Header Box

mvhd 定义了整个 movie 的特性，例如播放时长、播放速度、声量大小等

创建 presentation 的 UTC 时间。从 1904 年开始算起, 用秒来表示；

最近修改 presentation 的 UTC 时间。从 1904 年开始算起, 用秒来表示时间；

将 1s 切割后的时间单位，例如 timescale 为 1000，则一个时间单位时长是 1ms；

presentation 时长，值为 tracks 中时长最长的那个 track 的时长；

1.0 (0x00010000) 表示正常播放速度；

1.0 (0x0100) 为最大音量；

(u,v,w) 值是 (0,0,1)，对应 16 进制 (0,0,0x40000000).

presentation 的下一个 track id，应该比当前的 track id 大；

8.3 Track Structure

8.3.1 Track Box

track 表示一些 sample 的集合，对于媒体数据来说，track 表示一个视频或音频序列。

aligned(8) class TrackBox extends Box(‘trak’) {

一个 mp4 文件可以包含一个或多个 tracks，它们之间相互独立，各自有各自的时间和空间信息。

8.3.2 Track Header Box

box 版本号（0 或 1）；

是否启用这个 track。最低位为 1 时表示启用。

presentation 中是否使用到这个 track；第二位为 1 时表示是。

在预览 presentation 时，是否用到此 track；第三位为 1 时表示是。

创建 track 的 UTC 时间。从 1904 年开始算起, 用秒来表示。

最近修改 track 的 UTC 时间。从 1904 年开始算起, 用秒来表示。

用于唯一标识此 track，作用域是 presentation 的整个生命周期；不可以复用，并且不能为

duration（持续时间）是一个整数，表示该轨道的持续时间（在 "电影标题框 "中显示的时间

layer 指定了视频轨道的前后顺序；数字越小的轨道越靠近观众。0 是正常值，而 -1 会在轨

alternate_group is an integer that specifies a group or collection of tracks. If this field is 0,

alternate_group 是一个整数，用于指定轨道的组或集合。如果该字段为 0，则没有关于与其

volume 是一个固定的 8.8 值，指定轨道的相对音频音量。全音量是1.0（0x0100），是正常

width and height

width and height fixed‐point 16.16 values are track‐dependent as follows:

8.4.1 Media Box

aligned(8) class MediaBox extends Box(‘mdia’) {

8.4.2 Media Header Box

aligned(8) class MediaHeaderBox extends FullBox(‘mdhd’, version, 0) {

version is an integer that specifies the version of this box (0 or 1)

版本是一个整数，指定这个 box 的版本（0或1）。

创建 meida 的 UTC 时间。从 1904 年开始算起, 用秒来表示；

最近修改 meida 的 UTC 时间。从 1904 年开始算起, 用秒来表示；

language 声明了该媒体的语言代码。关于三个字符代码的集合，请参见 ISO 639-2/T。每个

8.4.3 Handler Reference Box

version is an integer that specifies the version of this box

当出现在一个元框中时，包含一个适当的值来表示元框内容的格式。值 "null "可以在主元框

name is a null‐terminated string in UTF‐8 characters which gives a human‐readable name

name 是一个以 UTF-8 字符为空尾的字符串，它为轨道类型提供了一个人类可读的名称（用

entry_flags 是一个带有标志的24位整数；定义了一个标志（x000001），这意味着一个标志