After our announcement of an in-house SRT implementation, we received many responses saying essentially “Why not fix Haivision’s libsrt, it’s open source”. Unfortunately, the reality is more complicated, as while we sent fixes for many of the issues we saw, we can’t replace third-party equipment running older versions of libsrt. Furthermore, it’s one of the reasons Internet Standards are required to have a minimum of two implementations. This leads us to the counterintuitive situation that we must maintain bug-for-bug compatibility with libsrt because it’s so widespread: i.e if libsrt has a bug, we must also implement that bug.
This blog post will describe a few of the bugs we had to reimplement:
Historical file transfer baggage breaking live video
The SRT protocol was built around the UDT file transfer protocol and unfortunately some of the historical baggage around file transfer affects the behaviour of live video. The most problematic of which is the “Available Buffer Size” field in the ACK message. Normally in live video the buffer size means the number of packets buffered and this is what we initially set the field to. But we found that this caused libsrt to stop sending packets at the beginning of a stream when the buffer contained a few packets.
After a lot of painful investigation we determined that this is actually representing the available file writing buffer of UDT and allows UDT to slow down file transmission if the receiver is having problems writing to disk. This is of course meaningless in live video, there isn’t a disk to write to, nor does it make any sense for a live video sender like libsrt to “slow down” video sending (and somehow make up for it later?).
A fixed buffer doesn’t really make sense in live video because it’s dependent on the incoming content which you don’t know beforehand. In the end we just put a magic value of 8192 that made it not slow down in the real world and moved on.
RTT0 is impossible to implement
Another challenge was how to implement RTT0. RTT (round trip time) is important in any packet retransmission protocol as it’s important for spacing out packet retransmissions to get the best chance of recovery. The SRT specification talks about how to calculate RTT0, an initial calculation of RTT. RTT0, according to the specification is calculated by sending a handshake packet and waiting for a response and measuring the time between the first packet sent and the response. However, this is UDP and the response could be lost. SRT therefore repeats the handshake response to be sure that it arrives. But a design flaw means you cannot identify whether the handshake response is the first one or a repetition. If it’s a repetition this vastly overinflates the RTT value causing problems down the line.
The outbound handshake could also be lost and a second one sent 250ms later. It’s impossible to correspond a given handshake with its response. This is normally done with a sequence number, allowing you to identify the packet pairs. On long latency links > 250ms, further care must be needed to be sure that a second outbound handshake that’s a repetition is not associated with a response for the first, causing the RTT estimation to be too low.
Interestingly, libsrt acknowledges this is not implemented in their own code: https://github.com/Haivision/srt/blob/697dce0978c9e8c2f8fff4d1443f2cb69941bdc2/srtcore/tsbpd_time.cpp#L115
Instead, we calculate RTT during the session instead of in the beginning, using the ACKACK message which has a sequence number. Libsrt does the same thing.
Extreme Bursting of packets (unsolved)
One mystery that still remains is the extreme bursting of traffic on some third-party devices running libsrt. Data was collected from a third-party sender over several days and bursts were observed:
Sequence Number | Written Timestamp | Delta (us) | Arrival time (us) | Delta (us) |
26879936 | 465753134 | 667749101015 | ||
26879937 | 465754313 | 1179 | 667749202359 | 101343.6 |
26879938 | 465754351 | 38 | 667749207245 | 4886.074 |
26879939 | 465754379 | 28 | 667749207515 | 270.7778 |
26879940 | 465755301 | 922 | 667749207659 | 143.3333 |
26879941 | 465755337 | 36 | 667749207742 | 83.40741 |
26879942 | 465756122 | 785 | 667749207844 | 102.1481 |
26879943 | 465757290 | 1168 | 667749207937 | 93.03704 |
26879944 | 465757330 | 40 | 667749208028 | 90.37037 |
26879945 | 465757362 | 32 | 667749208086 | 57.74074 |
26879946 | 465757863 | 501 | 667749208132 | 46.77778 |
26879947 | 465758816 | 953 | 667749208186 | 53.77778 |
26879948 | 465758841 | 25 | 667749208227 | 41.14815 |
26879949 | 465759857 | 1016 | 667749208269 | 41.2963 |
26879950 | 465759878 | 21 | 667749208309 | 40 |
You can see from the written timestamp that libsrt was intending to send these packets out in a (largely) smooth fashion but instead there was a 101ms gap then a lot of packets were all sent in one go, bursting heavily on the wire up to line rate.
There are a few potential causes for this bursting. The most likely is a thread scheduling issue in libsrt, which we have covered before. It could also be an overloaded CPU in the source encoder. Many encoders built around “embedded design” principles have extremely weak CPUs (more on that another day). It could also be the network, but this is unlikely because such bursting would have caused at least some packet loss.
In our SRT implementation we had not expected such extreme bursting and had to rewrite tolerances.
Encryption Key rotation
We also spent substantial time debugging Key Rotation. SRT changes keys after some number of packets and gives you the option to send both keys at the beginning or one key at a time. Only one key at a time actually worked.
While we did submit a fix that was accepted, we still had to implement the broken behaviour to work with older versions of libsrt.
This was covered in the previous blog post: https://www.obe.tv/why-did-we-write-an-in-house-srt-implementation/
Conclusion
We are asked regularly if our SRT implementation is compatible with Haivision’s libsrt and the answer is that it is compatible all the way down to matching the same bugs. We believe only in house implementations of technologies such as SRT, SDI, ST-2110 give the performance and control needed (and yes, we have implemented 3rd-party related vendor bugs in ST-2110). This is how we can offer the highest reliability in our low-latency encoders and decoders.