Incident is any issue that needs to be worked upon and resolved. Many a times, these incidents often affects your customers. When we get an incident, we alert and assign the incident to the right person using escalation policy attached to that integration.
Monitoring CPU utilization - alerts at high utilization at say, 80%
Monitoring Memory consumption - alerts at high consumption at say, 90%
DB monitoring alerts
DB backup fails
Sending notification fails
Errors on loading dashboard
Website down monitored with uptime monitoring
and many more ....
Incidents can have one of the below statuses -
When an incident is triggered, Spike loads the escalation policy attached to the integration and sends an alert. We continue to send automated alerts based on the escalation policy until the incident is acknowledged. Once the person has received the alert, they choose to acknowledge or resolve. We do not send alerts when incidents are acknowledged and resolved. In this state, repeat incidents are automatically suppressed and logged reducing alert fatigue.
An acknowledged incident would mean that the work for the resolution of incident is ongoing. In this state, we do not send any alerts and stop the escalation policy to where it stands. You can customise the settings to have a timeout for amount of time the incident remains in acknowledged state and does not get resolved. Once this timeout reaches, we change the status to triggered and start sending alerts again. The acknowledge timeout setting is optional. In this state, repeat incidents are automatically suppressed and logged reducing alert fatigue
Once incident has been fixed, you can mark it as resolved. In this case, no alerts are sent and escalation policy resets in case if there is a new incident again.