Network monitoring is the process of continuously monitoring a computer network to ensure that it is functioning properly and is available to users. It involves monitoring network devices, such as routers, switches, and servers, as well as the performance of the network as a whole. Network monitoring is important because it helps to identify and resolve issues with the network before they can cause significant disruptions to users. Network monitoring is part of the broader field of IT observability.
Modern networks are running new types of workloads, including machine learning models. A machine learning model behaves differently from traditional IT infrastructure and requires different types of monitoring. In this article, I'll explain the differences between traditional network monitoring and ML model monitoring and provide a quick overview of how to assist data science teams in monitoring their production models.
Types of Network Monitoring
When choosing a network monitoring solution, it’s important to understand the differences between the three main types of network monitoring:
Traffic Flow Monitoring
Traffic flow-based network monitoring tools collect and analyze data on network traffic flows, such as the amount of data being transferred, the number of packets being sent, and the sources and destinations of the traffic. This type of monitoring can provide a detailed view of network usage and can be used to identify bottlenecks, troubleshoot connectivity issues, and plan for future network capacity needs. Examples of traffic flow-based tools include NetFlow and sFlow.
Simple Network Management Protocol (SNMP) is a widely used protocol for managing and monitoring network devices. SNMP-based tools use this protocol to gather information from network devices such as routers, switches, and servers and can provide detailed statistics on device performance, including CPU and memory usage, interface statistics, and error rates. Examples of SNMP-based tools include Nagios and PRTG Network Monitor.
Active Network Monitoring
These tools use active methods such as pinging, tracerouting, and other similar techniques to actively check the availability, reachability, and responsiveness of network elements. Active network monitoring tools can also check the service availability on a specific port, check the response time, and provide more detailed information about the network and its devices, which plays an important role in network security. Examples of active network monitoring tools include SolarWinds and Paessler PRTG.
What Is Machine Learning Model Monitoring?
Machine learning model monitoring refers to the process of tracking and analyzing the performance and behavior of a machine learning model over time. This can include monitoring the model's accuracy, data drift, and other performance metrics, as well as monitoring the input data and the environment in which the model is being used.
The goal of machine learning model monitoring is to ensure that the model is continuing to perform well and to identify and address any issues that may arise. Organizations can take corrective action to improve the model's performance or detect when the model's performance starts to deteriorate and retraining is needed.
Machine learning model monitoring can be done in various ways, such as tracking the model's performance on a validation set or monitoring the real-time performance on live data. This can be done through various techniques such as:
- Tracking the performance of the model on a validation set to detect overfitting or underfitting.
- Monitoring the real-time performance of the model on live data to detect data drift, also called concept drift, which occurs when the distribution of the data changes over time.
- Monitoring the input data to detect any anomalies or changes in the data that may affect the model's performance.
How Is ML Monitoring Different from Network Monitoring?
Machine learning (ML) monitoring is the process of monitoring and managing the performance and reliability of machine learning models in production environments. It involves tracking key metrics and observing the behavior of the model to ensure that it is functioning as expected and meeting the desired performance goals.
ML monitoring is different from network monitoring in a few key ways:
- Scope: Network monitoring is focused on the performance and availability of the network infrastructure, while ML monitoring is focused on the performance and reliability of machine learning models.
- Metrics: Network monitoring typically involves monitoring performance metrics such as bandwidth usage, packet loss, and latency, while ML monitoring involves tracking metrics such as model accuracy, prediction errors, and model drift.
- Techniques: Network monitoring typically involves using tools such as ping, traceroute, and network mapping, while ML monitoring may involve techniques such as online learning, data drift detection, and hyperparameter optimization.
How Machine Learning Model Performance Monitoring Works
The responsibility for machine learning model performance monitoring can be shared among different stakeholders such as data scientists, machine learning engineers, operations and IT teams, model owners, and compliance and legal teams.
Monitoring System Performance
Another element to monitoring is the performance or behavior of the systems and infrastructure on which the machine learning models are deployed. This includes monitoring the resources such as CPU, memory, storage, and network usage, as well as monitoring the health and availability of the systems and services that support the machine learning models.
By monitoring the system performance, it is possible to:
- Identify and troubleshoot any issues that may arise with the systems and infrastructure supporting the machine learning models.
- Understand the resource utilization and identify any bottlenecks that may affect the performance of the machine learning models.
- Ensure the availability and reliability of the systems and infrastructure supporting the machine learning models.
- Monitor the security of the systems and infrastructure to detect and prevent any potential security threats.
- Monitor the scalability of the systems and infrastructure to ensure they can handle the increasing load and traffic.
Monitoring Model Drift
Model drift happens when the distribution of the input data that the model was trained on changes over time. In other words, the model's assumptions about the input data are no longer valid, and the model's predictions or decisions may become less accurate. Model drift can occur due to various reasons, such as changes in the underlying data distribution, concept drift, or changes in the way the model is used.
There are several types of model drift that can occur:
- Instant drift: There is a sudden and significant change in the distribution of the input data. This can happen, for example, when a new data source is added or when there is a sudden change in the environment in which the data is collected.
- Gradual drift: There is a gradual change in the distribution of the input data over time. This can happen, for example, when the data collection process changes or when the environment in which the data is collected changes over time.
- Recurring drift: This can happen, for example, when the data collection process changes periodically or when there are seasonal changes in the environment in which the data is collected.
- Temporary drift: This can happen, for example, when there is a temporary change in the data collection process or when there is a temporary change in the environment in which the data is collected.
Teams can monitor model drift by monitoring data drift, tracking model performance, using online drift detection, and using drift metrics. By using these methods, organizations can detect model drift early and take appropriate action, such as retraining the model, to address the drift.
Monitoring for Adversarial Attacks
Adversarial attacks on machine learning models are attempts to manipulate the input data in a way that causes the model to make incorrect predictions or decisions. These attacks can be designed to evade detection, and they can be difficult to detect and defend against. There are several types of adversarial attacks, including:
- Poisoning: This involves injecting malicious data into the training set in order to cause the model to learn incorrect patterns and make incorrect predictions.
- Evasion attacks: This involves manipulating the input data in a way that causes the model to misclassify it. For example, an attacker might add small perturbations to an image in order to cause a model to misidentify it as a different object.
- Impersonation: This involves creating synthetic data that is designed to be similar to real data but is used to impersonate a different class or user.
- Backdoor attacks: These inject a hidden trigger into the model that causes it to make incorrect predictions or decisions when presented with specific input.
There are several ways to monitor and detect adversarial attacks on machine learning models:
- Anomaly detection: Identify input data that is significantly different from the normal input data.
- Adversarial training: Train the model on adversarial examples so that it is more robust to these attacks.
- Ensemble methods: By using multiple models and comparing their predictions, it's possible to detect when an attack is happening as the predictions of the models will diverge.
Machine learning models are designed to make predictions or decisions based on input data, but over time, the performance of the model may degrade due to changes in the data or the environment in which the model is used. Monitoring the model's performance allows organizations to detect and address any issues that may arise and to take corrective action to improve the model's performance. Additionally, monitoring the input data and the model's performance can help detect data drift and take appropriate action to retrain the model.
Machine learning model monitoring also includes monitoring key areas such as segment drift, accuracy, data quality, concept soundness, fairness, and explainability. These aspects are crucial to ensure the model's performance, detect and address any issues that may arise, ensure the model's fairness, transparency, and accountability, and make the model's behavior more understandable.