You may have noticed some periods of instability with our Forum network recently. This has proved hard to diagnose as one of the first things to be hit would be the management interfaces to our network switches, so that obtaining useful data (indeed, any data at all) during an event was problematic. However, we believe we have now accumulated enough evidence to be able to come to a tentative conclusion.
The first thing to say is that there hasn’t been one single cause. Rather, instability has been due to several things happening to occur at the same time. Ultimately, though, the effect has been to overwhelm the CPUs in our older-model switches, which have then started to miss various important pieces of housekeeping. In particular:
- As mentioned, the management interfaces stopped responding. This resulted in us losing logging and traffic data, making it much harder to look back afterwards to see what had happened.
- The older switches missed some DHCP exchanges, which they normally track to help manage IP address use on the network for security reasons. As a result, self-managed machines which were trying to acquire or renew their leases would be blocked, though existing leases appeared to be honoured correctly.
- Ultimately spanning-tree packets would be lost. Loops in the network formed, and some links were swamped as a result. This appeared to affect wireless users more than wired users. These high traffic levels then meant that it would take longer for the network to converge again afterwards.
As for the various causes, we have identified the following. They are all somewhat variable in effect, and generally aren’t an issue individually. It’s when several happen to combine that problems occur.
- The older-model switches are underpowered. In the medium term they are due to be upgraded to newer, more powerful models. Meanwhile we have removed as much “unnecessary” processing as we can.
- In particular, we now don’t process multicast traffic specially, though as a result we do now have to propagate it more widely across the network and to end-systems which don’t particularly want it and will have to process it to throw it away.
- We identified unexpectedly high levels of IPv6 multicast traffic coming in from outside. This was on a subnet which we carried for E&B, and as it was no longer required by them we have removed it completely. We queried this with IS, and it turns out that they were also investigating poor performance on the same subnet, so it appears that whatever this machine was doing was affecting more than just Informatics. We now believe that this outside traffic was what finally tipped our older switches over the edge, and that this also explains the peculiar timing of some of the events.
- However, along the way we made quite a few configuration changes in order to remove potential sources of instability, and unfortunately it looks as though we were running into bugs, or at least features, in the way that the older switches’ management interfaces were implemented. Thus, some of the instability was self-induced, for which we apologise! It also took a bit longer than we would have hoped to identify this, due to the effects being confounded with the other ongoing instability. At least we now know better what to expect and how to try to minimise the effects.
- There is still one issue which we believe only affects older Mac with wired network connections, and which is resolved by a reboot. It’s still not at all clear where the fault lies.
We still have some network configuration changes to make, some of which may result in a few more short glitches. We’ll try to keep those to a minimum, and may be able to reduce the impact further with some code changes to our management tools.
As usual, our technical network documentation is available to browse for those of you who would like a more detailed picture of the Informatics network.