GStreamer and in-band metadata
From RidgeRun Developer Connection
As video data is moving though a GStreamer pipeline, it can be convenient to add information related to a specific frame of video, such as the GPS location, in a manner that receivers who understand how to extract metadata can access the GPS location data in a way that keeps it associated with the correct video data. In a similar fashion, if a receiver doesn't understand in-band metadata, the inclusion of such data will no effect the receiver.
MISP Motion Imagery Standards Profile
The key statement in the specification is Within the media container, all metadata must be in SMPTE KLV (Key-Length-Value) format. In MISP, the metadata is tagged with a timestamp so that it can be associated with the right video frame. This is important because GStreamer handles the association slightly differently. To understand the difference, you need to see how MISP combines video frames and metadata so you can compare it to GStreamer. The following diagram is a modified version from the MISBTRM0909 MISP spec.
KLV Key Length Value Metadata
For this discussion, we care about time stamping and transporting KLV data, not what it means. Stated another way, KLV data is any binary data (plus a length indication) that we need to move from one end to the other while keeping the data associated with correct video frame. It is up to the user of the video encoding stream and the user of the video decoding stream to understand the meaning and encoding of the KLV data.
To give a concrete KLV encoding example, here is a terse description of the SMPTE 336M-2007 Data Encoding Protocol Using Key-Length Value, which is used by MISB standard.
Fixed length (1, 2, 4, or 16 bytes), size know to both sender and receiver, encoding the key. There are very specific rules on how keys are encoded and how both the sender and receiver know the meaning of the encoded key.
Fixed or variable length (1, 2, 4, or BER) indication of the number of bytes of data used to encode the value.
Variable length value whose meaning is agreed to by both the sender and the receiver.
As an example (from Wikipedia KLV entry),
Which could be passed in as a 4 byte binary blob of 0x2A 0x02 0x00 0x03. The transport of the KLV doesn't need to know the actual encoding, just that it is 4 bytes long and the actual KLV data.
As another example (not MISB compliant), you could have the length be 8 and the data be 0x46 0x4F 0x4F 0x3D 0x42 0x41 0x52 0x00, which works out to be the NULL terminated ASCII string FOO=BAR. The transport doesn't care about the encoding, just so both the sending and receiver are in agreement.
In addition to being able to provide out-of-band information from the sender to the receiver, the information includes a timestamp that allows the data to maintain a time relationship with the video frames that also include a timestamp. Both the metadata and the video frame timestamps are generated by the same source clock.
Since both the flow of the metadata and the flow of the video frames can be viewed as data streaming though a pipe, the maximum accuracy in maintaining the time relationship between the two is for both metadata and video frames be assigned a timestamp value as soon as the data is generated. Any delay or variability in associating the timestamp with either the video frames or the metadata will add error to the time relationship.
MPEG-2 Transport Stream
The MPEG-2 Transport Stream protocol adds a TS header to video data, audio data, and metadata. The video data, audio data, and metadata are termed elementary streams. The TS header follow by data is called a packet. The TS header allows the receiving side to use the TS header PID (Packet ID) field to demultiplex the elementary streams. There are many other fields in a TS header beside the PID field.
For this discussion the important point is the Transport Stream protocol definition already supports of the notion of including timestamped metadata in a transport stream.
Metadata and GStreamer
GStreamer models streaming audio/video/data as moving though a pipeline from source to sink. Adding support for metadata involves adding a new metadata source element and a new sink pad to the transport stream multiplexer element.
A simplified textural representation of the pipeline would be:
gstlaunch v4l2src ! dmaienc_h264 ! mux. \ alsasrc ! dmaienc_aac ! mux. \ metasrc ! queue ! \ mpegtsmux name=mux mux. ! rtpmp2tpay ! udpsink port=5004 host=$HOST
A decorated pipeline for DM36x would be:
gst-launch v4l2src queue-size=6 always-copy=FALSE input-src=composite chain-ipipe=true ! capsfilter caps=video/x-raw-yuv,format=\(fourcc\)NV12,width=640,height=480 ! dmaiaccel ! dmaienc_h264 name=video_encoder targetbitrate=1000000 idrinterval=90 intraframeinterval=30 ratecontrol=2 encodingpreset=2 ! queue ! mux. alsasrc buffer-time=800000 latency-time=30000 ! dmaiperf ! capsfilter caps=audio/x-raw-int,channels=1,width=16,depth=16,rate=16000 ! dmaienc_aac name=aac_encode outputBufferSize=131072 maxbitrate=64000 bitrate=32000 ! queue ! mux. metasrc ! queue ! mpegtsmux name=mux mux. ! rtpmp2tpay ! udpsink port=5004 host=$HOST
GStreamer supports an event called a tag. When an element receives a tag it doesn't understand, it simply passes it downstream. Tags are either independent of the stream encoding (like the title of the song for an audio stream) or information that effects how the stream is processed (like the stream bitrate).