Adventures in Video Conferencing Part 2: Fun with FaceTime

Posted via Natalie Silvanovich, Project Zero

FaceTime is Apple’s video conferencing utility for iOS and Mac. It is closed supply, and does now not seem to make use of any third-party libraries for its core capability. I questioned whether or not fuzzing the contents of FaceTime’s audio and video streams would result in an identical effects as WebRTC.

Fuzzing Set-up


Philipp Hancke carried out a very good research of FaceTime’s structure in 2015. It is very similar to WebRTC, in that it exchanges signalling knowledge in SDP layout after which makes use of RTP for audio and video streams. Looking on the FaceTime implementation on a Mac, it gave the impression the majority of the calling capability of FaceTime is in a daemon known as avconferenced. Opening up the binary that helps its capability, AVConference in IDA, it comprises a serve as known as SRTPEncryptData. This serve as then calls CCCryptorUpdate, which gave the impression to encrypt RTP packets underneath the header.


To do a handy guide a rough check of whether or not fuzzing used to be prone to be efficient, I hooked this serve as and adjusted the underlying encrypted knowledge. Normally, this may also be performed via atmosphere the DYLD_INSERT_LIBRARIES atmosphere variable, however since avconferenced is a daemon that restarts mechanically when it dies, there wasn’t a very simple method to set an atmosphere variable. I sooner or later used insert_dylib to change the AVConference binary to load a library on startup, and restarted the method. The library loaded used DYLD_INTERPOSE to switch CCCryptorUpdate with a model that fuzzed each and every enter buffer (the usage of fuzzer q from Part 1) prior to it used to be processed. This implementation had a large number of issues: it fuzzed each encryption and decryption, it affected each and every name to CCCryptorUpdate from avconferenced, now not simply ones concerned in SRTP and there used to be no method to reproduce a crash. But the usage of the changed FaceTime to name an iPhone resulted in video output that appeared corrupted, and the telephone crashed in a couple of mins. This showed that this serve as used to be certainly the place FaceTime calls are encrypted, and that fuzzing used to be prone to in finding insects.


I made a couple of adjustments to the serve as that hooked CCCryptorUpdate to try to resolve those issues. I restricted fuzzing the enter buffer to the two threads that write audio and video output to RTP, which additionally solved the issue of decrypted packets being fuzzed, as those threads best ever encrypt. I then added capability that wrote the encrypted, fuzzed contents of each and every packet to a chain of log recordsdata, in order that check circumstances might be replayed. This required changing the sandbox of avconferenced in order that it would write recordsdata to the log location, and including spinlocks to the hook, as calling CCCryptorUpdate is thread secure, however logging packets isn’t.

Call Replay


I then wrote a 2nd library that hooks CCCryptorUpdate and replays packets logged via the primary library via copying the logged packets in series into the packet buffers handed into the serve as. Unfortunately, this required a small amendment to the AVConference binary, because the SRTPEncryptData serve as does now not recognize the duration returned via CCCryptorUpdate; as an alternative, it assumes that the duration of the encrypted knowledge is equal to the duration because the plaintext knowledge, which is affordable when CCCryptorUpdate isn’t being hooked. Since SRTPEncryptData all the time makes use of a big fixed-size buffer for encryption, and encryption is in-place, I modified the serve as to retrieve the duration of the encrypted buffer from the very finish of the buffer, which used to be set in the hooked CCCryptorUpdate name. This reminiscence is not going for use for different functions because of the everyday shorter duration of RTP packets. Unfortunately regardless that, despite the fact that the similar encrypted knowledge used to be being replayed to the objective, it wasn’t being processed accurately via the receiving instrument.


To perceive why calls for a proof of the way RTP works. An RTP packet has the next layout.



It comprises a number of fields that have an effect on how its payload is interpreted. The SSRC is a random identifier that identifies a flow. For instance, in FaceTime the audio and video streams have other SSRCs. SSRCs too can assist differentiate between streams in a state of affairs the place a consumer may just probably have a limiteless selection of streams, for instance, a couple of individuals in a video name. RTP packets even have a payload kind (PT in the diagram) which is used to tell apart various kinds of knowledge in the payload. The payload kind for a undeniable knowledge kind is constant throughout calls. In FaceTime, the video flow has a unmarried payload kind for video knowledge, however the audio flow has two payload varieties, most likely one for audio knowledge and the opposite for synchronization. The marker (M in the diagram) box of RTP could also be utilized by FaceTime to constitute when a packet is fragmented, and must be reassembled.


From this it’s transparent that merely copying logged knowledge into the present encrypted packet received’t serve as accurately, since the knowledge must have the proper SSRC, payload kind and marker, or it received’t be interpreted accurately. This wasn’t vital in WebRTC, as a result of I had sufficient regulate over WebRTC that I may just create a connection with a unmarried SSRC and payload kind for fuzzing functions. But there is not any approach to try this in FaceTime, even muting a video name ends up in silent audio packets being despatched versus the audio flow shutting down. So those values had to be manually corrected.


An RTP characteristic known as extensions made correcting those fields tough. An extension is an not obligatory header that may be added to an RTP packet. Extensions aren’t intended to rely at the RTP payload to be interpreted, and extensions are frequently used to transmit community or show options. Some examples of supported extensions come with the orientation extension, which tells the endpoint the orientation of the receiving instrument and the mute extension, which tells the endpoint whether or not the receiving instrument is muted.


Extensions imply that even supposing it’s imaginable to resolve the payload kind, marker and SSRC of knowledge, this isn’t enough to replay the precise packet that used to be despatched. Moreover, FaceTime creates extensions after the packet is encrypted, so it isn’t imaginable to create your complete RTP packet via hooking CCCryptorUpdate, as a result of extensions might be added later.


At this level, it gave the impression vital to hook sendmsg in addition to CCCryptorUpdate. This would permit the outgoing RTP header to be changed as soon as it’s entire. There had been a couple of demanding situations in doing this. To get started, audio and video packets are despatched via other threads in FaceTime, and may also be reordered between the time they’re encrypted and the time they’re despatched via sendmsg. So I couldn’t think that if sendmsg gained an RTP packet that it used to be essentially the final one that used to be encrypted. There used to be additionally the issue that SSRCs are dynamic, so replaying an RTP packet with the similar SSRC it’s recorded with received’t paintings, it must have the brand new SSRC for the audio or video flow.


Note that in MacOS Mojave, FaceTime can name sendmsg by way of both the AVConference binary or the IDSFoundation binary, relying at the community configuration. So to seize and replay unencrypted RTP visitors on more recent methods, it can be crucial  to hook CCCryptorUpdate in AConference and sendmsg in IDSFoundation (AVConference calls into IDSFoundation when it calls sendmsg). Otherwise, the method is equal to on older methods.


I finished up imposing an answer that recorded packets via recording the unencrypted payload, after which recorded its RTP header, and the usage of a snippet of the encrypted payload to pair headers with the proper unencrypted payload. Then to replay packets, the packets encrypted in CCCryptorUpdate had been changed with the logged packets, and as soon as the encrypted payload got here thru to sendmsg, the header used to be changed with the logged one for that payload. Fortunately, the two streams with distinctive SSRCs utilized by FaceTime don’t percentage any payload varieties, so it used to be imaginable to resolve the brand new SSRC for each and every flow via looking ahead to an incoming packet with the proper payload kind. Then in each and every next packet, the SSRC used to be changed with the proper one.


Unfortunately, this nonetheless didn’t replay a FaceTime name accurately, and calls frequently skilled decryption screw ups. I sooner or later decided that audio and video on FaceTime are encrypted with other keys, and up to date the replay script to queue the CCCryptor utilized by CCCryptorUpdate serve as in response to whether or not it used to be audio or video content material. Then in sendmsg, all of the logged RTP packet, together with the unencrypted payload, used to be copied into the outgoing packet, the SSRC used to be constant, after which the payload encrypted with the following CCCryptor out of the precise queue. If a CCCryptor wasn’t to be had, outgoing packets had been dropped till a brand new one used to be created. At this level, it used to be imaginable to forestall the usage of the changed AVConference binary, as all of the packet amendment used to be now going down in sendmsg. This implementation nonetheless had reliability issues.


Digging extra deeply into how FaceTime encryption works, packets are encrypted in CTS mode, which calls for a counter. The counter is initialized to a singular price for each and every packet this is despatched. During the initialization of the RTP flow, the friends alternate two 16-byte random tokens, one for audio and one for video. The counter price for each and every packet is then calculated via unique or-ing the token with a number of values discovered in the packet, together with the SSRC and the series quantity. Only one price in this calculation, the series quantity, adjustments between each and every packet. So it’s imaginable to calculate the counter price for each and every packet via understanding the preliminary counter price and series quantity, which may also be retrieved via hooking CCCryptorCreateWithMode. The series quantity is xor-ed with the random token at index 0x12 when FaceTime constructs a counter, so via xor-ing this location with the preliminary series quantity after which a packet’s series quantity, the counter price for that packet may also be calculated. The key may also be retrieved via hooking CCCryptorCreateWithMode.This allowed me to dispense with queuing cryptors, as I now had all of the knowledge I had to assemble a cryptor for any packet. This allowed for packets to be encrypted quicker and extra appropriately.


Sequence numbers nonetheless posed an issue regardless that, because the preliminary series selection of an RTP flow is randomly generated originally of the decision, and is other between next calls. Also, series numbers are used to reconstruct video streams in order, so that they want to be proper. I altered the replay instrument resolve the beginning series selection of each and every flow, after which calculate the adaptation between the beginning series selection of each and every logged flow and the series selection of the logged packet after which upload it to this price. These two adjustments in the end made the replay instrument paintings, regardless that replay will get slower and slower as a flow is replayed because of dropped packets.   

Results


Using this setup, I used to be in a position to fuzz FaceTime calls and reproduce the crashes. I reported three insects in FaceTime in response to this paintings. All those problems were constant in contemporary updates.


CVE-2018-4366 is an out-of-bounds learn in video processing that happens on Macs best.


CVE-2018-4367 is a stack corruption vulnerability that is affecting iOS and Mac. There are an excellent selection of variables at the stack of the affected serve as prior to the stack cookie, and a number of other fuzz crashes because of this factor led to segmentation faults versus stack_chk crashes, so it’s most likely exploitable.


CVE-2018-4384 is a kernel heap corruption factor in video processing that is affecting iOS. It is most likely very similar to this vulnerability discovered via Adam Donenfeld of Zimperium.


All of those problems took lower than 15 mins of fuzzing to search out on a reside instrument. Unfortunately, this used to be the prohibit of fuzzing that may be carried out on FaceTime, as it will be tough to create a command line fuzzing instrument with protection like we did for WebRTC as it’s closed supply.


In Part 3, we can have a look at video calling in WhatsApp.