In the context of artificial intelligence (AI) and machine learning (ML), “serving” refers to the deployment and operation of trained models so they can make predictions or inferences on new, previously unseen data. Serving is the bridge between building a model in the lab and putting it to work in the real world. Once a model has been trained and validated, it needs to be made accessible to applications, systems, or end-users—this is where serving comes in.
Serving typically involves packaging the trained model and exposing it through an API (often a REST or gRPC endpoint) or integrating it directly into a production system. This allows external clients, such as web or mobile applications, to send data to the model and receive predictions in real time or batch mode. For example, a model trained to recognize objects in images can be served via an API, enabling users to upload photos and instantly receive labels or descriptions.
There are different approaches to serving, depending on scale, latency requirements, and infrastructure. “Online serving” refers to making predictions in real time, responding to user requests within milliseconds. This is crucial for interactive applications, like chatbots or recommendation systems. “Batch serving” processes large volumes of data at once, which is suitable for tasks like periodically scoring millions of records overnight. “Offline inference” may also be used to describe scenarios where predictions are made without immediate user interaction.
Serving infrastructure is often managed by specialized tools or frameworks, such as TensorFlow Serving, TorchServe, or custom solutions built on cloud platforms. These systems handle tasks like model versioning, scaling to handle high request volumes, monitoring prediction performance, and rolling out new models safely. Efficient serving is critical for maintaining low latency, high throughput, and reliability in production AI systems.
Security and access control are important considerations in serving models, especially when dealing with sensitive data or models that encode proprietary information. Logging and monitoring are also essential to ensure that the model continues to perform as expected and to detect issues like concept drift or data quality problems.
In summary, serving is a key part of the machine learning lifecycle. It turns the output of the training process into a usable service, closing the loop between model development and real-world impact. As AI adoption grows, robust and scalable serving solutions become increasingly important for delivering intelligent features and experiences to users.