In general terms latency is a measure of the delay that occurs between the start of a transaction and its completion.
In gaming for example, latency is the delay between clicking a mouse button, and seeing the results of your competitive action happen on screen in real time.
This measure of performance is also a key part of AI technology. The best AI model in the world is useless unless it can deliver results in a timely fashion.
This is especially true when the AI is being used in real-time applications, such as customer service or telephone support
So in artificial intelligence systems, latency reflects the gap between the time the user initiates a request and the response from the system. This delay can be compounded by various different factors.
For example the state of your internet connection congestion, the processing power of the local or cloud computer system, and even the complexity of the request that’s being made and size of the model being addressed.
All of these can affect the speed at which the user will receive a response when they interact with an AI model.
The importance of measuring latency
Latency is typically measured in derivatives of time, e.g. seconds, milliseconds or nanoseconds.
There are a number of different aspects to latency which are important in terms of AI. Inferencing latency is particularly key, as is compute latency and even network latency.
The goal in any AI environment is to reduce latency to as low as possible, in other words to deliver as fast a response to the user as possible.
A good example of the importance of low latency is in the real-time security environment.
Face unlock and fingerprint recognition both have to deliver near real-time performance if they’re to be useful in security applications. Waiting even a few seconds for your phone to unlock, or a door to unlatch after a scan is unacceptable.
Low latency is also crucial for mission-critical applications such as telemedicine, where slow transmission of vital data to and from an AI model can result in the catastrophic failure of an operation.
AI-assisted transportation, where the model is tasked with recognizing traffic signals and other road features on an autonomous vehicle, is another area where low latency is crucial.
A split second wrong decision taken because of a delay in processing can mean the difference between accident avoidance and disaster.
But sometimes slow is good
However not every application needs low latency. Complex batch industrial processes, for example, are unlikely to have stringent real-time conditions imposed on the process. In this case saving a second or two here and there is unimportant.
Similarly applications where the human is the slowest link in the chain, rarely demand super low latency performance.
This is particularly true of consumer grade needs such as image or music generation, or mobile apps which utilize AI for entertainment. In these cases most people can wait a few seconds.
Optimizing for low latency will generally take two main approaches. Compute latency, which reflects the speed at which a computer operates the neural network, is typically tackled by increasing the power of the computer host.
Throwing more memory and processors at the problem.
The other way to combat this problem is by optimizing the model itself, to reduce complexity and improve its throughput and responsiveness.
This is often done by fine-tuning a model on a specific, more tightly controlled requirement, so it can respond more efficiently to requests in its subject area.
https://cdn.mos.cms.futurecdn.net/54C8VkmPpZjoYVwkWweYxi.png
Source link