Quality of Service (QoS) and multiple paths
Quality of Service (QoS) is an issue that often arises in high-availability networks as well as in realtime control systems. In the Qnet context, QoS really boils down to transmission media selection: in a system with two or more network interfaces, Qnet chooses which one to use, according to the policy you specify.
QoS policies
Qnet provides the following policies that let you specify how it should select a network interface for transmission:
- loadbalance: Qnet is free to use all available network links, and shares transmission equally among them.
- preferred: Qnet uses one specified link, ignoring all other networks (unless the preferred one fails).
- exclusive: Qnet uses one—and only one—link, ignoring all others, even if the exclusive link fails.
The default is loadbalance. Let's look at them in more detail:
- loadbalance
- Qnet decides which links to use for sending packets,
depending on current load and link speeds as determined by io-pkt*.
A packet is queued on the link that can
deliver the packet the soonest to the remote end.
This effectively provides greater bandwidth between nodes when
the links are up (the bandwidth is the sum of the bandwidths
of all available links) and allows a graceful degradation
of service when links fail.
If a link does fail, Qnet switches to the next available link. By default, this switch takes a few seconds the first time, because the network driver on the bad link will have timed out, retried, and finally died. But once Qnet
knows
that a link is down, it doesn't send user data over that link.The time required to switch to another link can be set to whatever is appropriate for your application using Qnet's command-line options; see the entry for lsm-qnet.so in the Utilities Reference.
Using these options, you can create a redundant behavior by minimizing the latency that occurs when switching to another interface in case one of the interfaces fails.
While load-balancing among the live links, Qnet sends periodic maintenance packets on the failed link in order to detect recovery. When the link recovers, Qnet places it back into the pool of available links.
- preferred
- With this policy, you specify a preferred link to use for transmissions.
Qnet uses only that one link until it fails.
If your preferred link fails, Qnet then turns to
the other available links and resumes transmission, using the
loadbalance policy.
When your preferred link is available again, Qnet again uses only that link, ignoring all others (unless the preferred link fails).
- exclusive
- You use this policy when you want to lock transmissions to only one link.
Regardless of how many other links are available, Qnet latches onto the one interface you specify.
And if that exclusive link fails, Qnet doesn't use any other link.
Why would you want to use the exclusive policy? Suppose you have two networks, one much faster than the other, and you have an application that moves large amounts of data. You might want to restrict transmissions to only the fast network, in order to avoid swamping the slow network if the fast one fails.
You specify the QoS policy as part of the pathname. For example, to access /net/node1/dev/ser1 with a QoS of exclusive, you could use the following pathname:
/net/node1~exclusive:en0/dev/ser1
The QoS parameter always begins with a tilde (~) character.
Here we're telling Qnet to lock onto the en0 interface exclusively, even if it fails.
Symbolic links
You can set up symbolic links to the various QoS-qualified
pathnames:
ln -sP /net/node1~preferred:en1 /remote/sql_server
This assigns an abstracted
name of /remote/sql_server to the node
node1 with a preferred QoS (i.e., over the en1 link).
Abstracting the pathnames by one level of indirection gives
you multiple servers available in a network, all providing
the same service. When one server fails, the abstract
pathname can be remapped
to point to the
pathname of a different server. For example, if
node1 fails, then a monitoring program
could detect this and effectively issue:
rm /remote/sql_server
ln -sP /net/node2 /remote/sql_server
This removes node1 and reassigns the
service to node2. The real advantage here is
that applications can be coded based on the abstract
service name
rather than be bound to a specific node name.
For a real-world example of choosing appropriate QoS policy in an
application, see
Designing a system using Qnet.
