The topic of using machine learning in network operations has become the next big thing, steadily displacing software-defined...
networks in the minds of many network engineers. But before the networking world enters into the steep slope of the hype cycle, now is a good time to consider the realistic tradeoffs of using machine learning systems compared to human-only interactions.
The broad promise of machine learning in network engineering is your network can be turned over to a machine learning system and left alone to run itself. Presumably, at some point, the network will tell you when and where you need more resources, so perhaps even network design will become simpler.
All this promise is offset to some degree, however, by real-world experience with machine learning systems. For example, the Amazon warehouse system is a wonder of machine learning being used at the intersection of human-to-machine interaction. In Amazon's case, warehouse packing is based on machine learning to determine where product shelves should be located at specific times.
By contrast, it's extremely inefficient for a human worker to try to find something in a warehouse organized by products commonly ordered together (see sidebar).
Amazon implemented machine learning systems in its New Jersey fulfillment center to more efficiently assemble customer orders. Instead of Amazon workers walking around the warehouse to find the necessary products to fill orders, Amazon uses machine learning and robots to arrange the shelves and even move the shelves to the worker.
In network engineering terms, imagine telling a machine learning system to find the optimal set of quality-of-service (QoS) parameters for a set of applications running on the network. The best choice will most likely be one QoS queue per application, with each queue set to make its application perform optimally, while also considering the traffic demands of the other applications. For humans, however, managing four to eight QoS queues -- or classes of service -- would be barely practical. So, can we really build a network like Amazon builds a warehouse?
Moving forward with machine learning systems
Humans are not likely to be removed from the computer networking world for many years to come. The Amazon warehouse floor example is likely much simpler. The movement of physical shelves is far less complex than the movement of packets through an already complex network. This implies making room for human interaction is going to be far different in networking operations than in building a warehouse.
So, the network operations world needs to find some way to include machine learning in its set of tools, while preserving an interaction surface humans can use. At least two potential paths forward present themselves.
First, machine learning tools could be scoped more tightly, instead of making a broad request that will generate a wide range of results. For example, instead of telling a machine learning system to find the optimal QoS settings for these applications, you could say: "Find the optimal division of queues, the settings for those queues, for this set of applications."
By limiting the complexity of the solution the machine learning system is allowed to create, it's possible to contain the complexity of what the machine learning system generates within a set of bounds humans can understand and manage. Even when humans are taken out of the loop in specific network management tasks, allowing complexity to proliferate uncontrollably is not a good idea.
Second, machine learning tools can be partitioned off in the network by choosing clear jobs the machine learning system should perform, with clear boundaries around those jobs. This is where the concept of the API comes into play. The idea behind an API in the software development field is much like the idea of route aggregation in network design. They are both generalized ways to divide complexity from complexity by abstracting out details of the subsystem behind the API.
Returning to the QoS example above, a network operator could allow the machine learning system to operate somewhat autonomously within a single sphere of the network. This could be accomplished by consciously dividing the work that needs to be done in order to provide QoS from the rest of the network operation. The network operator could include the discovery of different kinds of traffic flowing through the network, the separation of traffic into different classes, marking traffic so it falls within a class of service, and the creation and management of the appropriate queues corresponding to these classes of service.
These kinds of abstractions will always have leaks. For example, QoS will always overlap with traffic engineering and any sort of dynamic bandwidth mechanisms, so these factors will need to be taken into account.
Planning for machine learning's constraints
The primary shift in thinking is moving from what might be called places in the network -- or modularization along strictly topological boundaries -- to network as a service, which sees the network as a set of services interacting with one another. In the places-in-the-network model, APIs would be located where any set of places intersect, such as the data center and the core, or the campus and the core.
But in the network-as-a-service model, the total set of services required to support the business are divided up into logical subservices, like QoS or security. These services interact over the whole network to build a complete solution. There are still topological division points in this model, used to create failure domains and control state, but these intersect with services, rather than terminate them.
Machine learning systems can and will play a role in the future of network design and operation. The role it plays needs to be well-thought-out, however. Machine learning needs to be constrained to prevent the growth of unnecessary complexity. Additionally, APIs need to be carefully positioned to provide the kinds of failure domain separation and problem scoping that have proven to be necessary to the successful operation of networks.