The answer is timing resolution. For continuous signals, a 48kHz system is perfectly capable of reproducing time relationships with a very high accuracy. But when it comes to the start and end of audio signals, the timing resolution breaks down to about 21 microseconds - the reciprocal of the sampling rate. The challenge: the human hearing system is actually capable of detecting smaller time differences then 21 microseconds (known as dichotic difference). For example, experienced listeners, in ideal listening conditions, have been reported to be able to localise sounds with accuracy of less than five degrees. Since localisation is partially dependent on the arrival time difference between left and right ears, taking an average head size of 20cm sets the time difference between left and right for a sound coming from the side (90 degrees) at about half a millisecond. Five degrees then accounts for about six microseconds.
To be able to capture this resolution, a sampling rate of 192kHz would do the trick. However, most listening is done in less controlled circumstances such as a home living room, a hotel room, in a car or at a rock concert with a large PA system and thousands of fellow-listeners. In those cases, a 21 microseconds resolution may be more than enough - corresponding to 48kHz. Only for controlled listening situations - with perfect acoustics, perfect speakers and a single listener in the perfect sweet spot, is 96kHz worth going for.