IT-Monitoring

Event Escalation

There is no such thing as fate; there are only wrong reactions to events...

Controlled Escalation Management

The need for elimination of faults within cloud infrastructures can be reduced in many cases by providing redundant hardware. However, redundancies cannot be realized on every operative level. Especially on the software level, processes to eliminate problems in a coordinated, controllable way should be defined soon.

Successful escalation management is characterized by the detection of error scenarios in the run-up and the proactive design of effective solution processes. The logics of these workflows and the processes that have to be activated have to be defined by means of escalation schemes. Workflows are then automatically initiated and controlled by the monitoring system after defined events.

configurable escalation schemes
The reaction to monitoring events is realized by individually configurable schemes. The schema choice is depending from the event classification among others.
Specific Workflows for Event Remediation
Schemes contain specific workflows according to the afflicted infrastructure, operative levels and time specifications for 24/7, in the middle of the night, or on holidays.
Monitoring Event Priorization
Prioritization logic allows for cascading workflows. Every single workflow is checked for effectivity. The escalation is automatically ended if necessary.

Automatic recovery

Automatically initiated recovery processes significantly reduce the MITR (mean-time-to-recovery) and in many cases can help to entirely avoid interruptions in operation. With self-managing of virtual cloud infrastructures, we have implemented automatic reactions to monitoring events categorized as Configuring, Healing and Optimization.

These processes are part of our configuration management system and incorporated into the escalation management workflows via interface modules.

Self-Configuring

This category contains processes to control providing of system resources suited to the needs.

A classic example would be adding and removing virtual systems of a Private Cloud infrastructure.

Self-Healing

These are processes activated proactively after anomaly detection.

A controlled application reboot when a so-called memory leak has occurred is such a form of Self-Healing. When clustered, these actions can be conducted without interruptions.

Self-Optimizing

These are functions to adapt system parameters and resources such as vCPUs, memory or disk capacities.

The adaptive load allocation of the local traffic management to achieve an ideal use of resources also is part of the Self-Optimizing.

Alerting and communication

If unambiguous and therefore automated reactions cannot be deduced by the event analysis, classic escalation based on alerting and communication comes into play. These procedures also take place if automatic processes are not able to bring the desired results.

For escalation schemes there are various configuration parameters such as calendar, repetition interval or throttling available. In connection with event analysis this ensures that only notable events are escalated. Schemes with alert function feature the following:

  • Integrated NetChat module allows the direct communication among team members for documentation and processing
  • Detailed alerts via e-mail including relevant information of the event analysis and evaluated metrics
  • SMS alerts with a short version of the event and specific links for Web-Console access
  • An automatic ticket creation allows for further processing on other systems

Hierarchic control

Another point in favor of the hierarchic data model is the heredity of escalation schemes to guarantee the dealing with the same events identically on every system within the Virtual Cloud infrastructure. Anchoring specific schemes within the hierarchy in order to be able to configure exceptional cases is also possible.

Generally speaking, hierarchic functions drastically simplify the control of different monitoring processes, allowing for activation or deactivation of the entire monitoring, processing or the recording of metrics throughout complete hierarchic levels by a single mouse click or via calendar-controlled schemes.