In beginning, when Internet Protocol was first designed, no one was thinking about possibilities of sending audio and video. Real time communication was not an issue.Perhaps greatest single problem is that Internet Protocol (the IP part of VoIP - Voice over Internet Protocol) wasn't designed to ensure that packets are delivered in correct order. When information is transmitted using IP, data is broken up into information packets, each of which is sent separately. The correct sequence of packets is part of information in each packet, but nothing specifically exists to make sure that packets are delivered, and, therefore, received in proper order.
Now this isn't usually a significant issue for web pages, email, etc. Why? Because these aren't real-time applications. Audio and video however, especially live audio and live video are definitely real-time applications. For a real time conversation to work, packets have to arrive - pretty much in order and also within certain time limits.
The first, and one of major challenges then, is to restructure incoming packets into correct order and to somehow cope with lost and/or trashed packets. Face it, internet does not provide a quality of service guarantee. If enough packets are lost, an audio or video stream rapidly turns into a useless mess. While packets can be resent - standard way lost/trashed packets are dealt with - real time communication means that you just can't wait around forever. After a certain time, it's simply too late to maintain a coherent stream.
What we've seen over last few years is a gradual and now nearly explosive growth in use of VoIP, and streaming audio and video. The reason is decline of dial-up and growth of ISDN, DSL, ADSL, cable and other high speed, high bandwidth access modes. Bandwidth is answer to most of problems posed by IP. End-to-end high speed links can ensure high quality sound. The sole remaining problem is latency.