August 10, 2022

False Positives: The Grey Area of Uptime Monitoring

Discover how to debug and fix uptime 'false positives'

Over the last 20 years, I have always been involved in working with different monitoring solutions to ensure that applications and systems are operational.

A few years back, working as a consultant, I was having a conversation with the infrastructure team for services hosted in a data center.

As I was suspecting that things were not as stable as the team would claim. I started to collect firsthand information to confirm the hypothesis. I truly believe in the data-driven approach, thus a monitoring system external and independent of the existing infrastructure was configured to collect information.

The monitoring system found short-lived issues happening very frequently. Short enough not to trigger any existing alert, but frequent enough to be concerning. When I presented the data to the infrastructure team, they claimed that the issue was “in the external monitoring system”.

As technically speaking that could be true, another monitoring system was set up, in a different cloud, with a different internet provider and different geographical location.

The readings were the same: small, constant, and repetitive short-lived issues were recorded.

When I presented this information to the infrastructure team, they used the same argument: the second monitoring system was also not working well.

As technically speaking that statement could also be true, a control group was set up. This control group was monitoring, under the same conditions, very large businesses such as Google, which are known for hiring very talented engineers. The data collected showed almost non-existent issues in the control group, providing an insight that the original readings were correct.

In this situation, we are facing two problems:

  • A psychological problem, the infrastructure version of Select isn’t broken, where an individual (or teams) might blame for a potential problem to an external entity that is is outside their control;
  • Failure to present the information in a way easy to understand, clear enough to provide enough basis to start a deeper investigation, and in a way where the data is easily verifiable.

After this experience, I checked how other commercial monitoring systems were managing those use cases. In this context, clients classified short-lived issues as false positives.

But the million-dollar question is: is it really a problem of the monitoring system, or is it an indication of a problem in the systems that are being monitored?

This is when things start getting interesting, that’s why MeerkatWatch exists.

Let’s break down this problem a bit further!

State 1: Verifiable UP status

Let’s start easy. Let’s start agreeing!

We have a website, API, or some service connected to the public internet.

  • The monitoring system says that it works (it is UP)
  • You can check that the system works by using a web browser, postman, or similar. You see the results that you expect, with a 2xx family status code.

It is definitely working!

State 2: Verifiable DOWN status

We have a website, API, or some service connected to the public internet.

  • The monitoring system says that it does not work (it is DOWN)
  • You can check that the website or the API is down using the browser or postman. You see the clear results of an error, for example, an error message and a 4xx or 5xx status code.

State 3: Intermittent Failures (aka WARNING)

This is the interesting area, the uptime monitoring version of Schrödinger’s cat, which is so easy to mix up with false positives.

MeerkatWatch uses different geographical locations. Two different, independent systems, in different data centers.

Each of the monitoring systems, by separate, are very reliable. However, at least two confirmations from independent monitoring systems, from different data centers, internet providers, and geographical locations need to happen at the same time for MeerkatWatch to report that a system is DOWN.

Readings from each monitoring system are cross-checked with a Control Group (such as google.com). If the readings of the control group work, the monitoring system location is considered healthy.

When only one of those monitoring systems identifies a system as down, we have started to show the activity by triggering a WARNING status. In MeerkatWatch, if you are experiencing short-lived issues, you will have recorded multiple WARNING over a period of time.

It is very interesting to see clients interact with WARNING status. For example, in a period of 24 hours, you can see one monitor that has collected 50 different WARNING states, but only one brief DOWN when different monitoring platforms detected the problem at the same time.

If no WARNINGS were collected for a time, and then the WARNINGS started showing up, this means that the system is less stable than it was. Something is happening there, and an internal investigation should be opened. Please be aware that when you are having short time lived issues, the application is working most of the time. Therefore, if you do a sample to manually test access to the website or the API, chances are that it can work as you expect. You need to carefully read the data and interpret what it means!.

The problem is that search engines will see that the website or API is not that reliable and there will be consequences: sales might not be as good as it should. SEO might be affected: the client’s website might be replaced with competitors with more reliable solutions. The cost per click in paid advertisement might increase, and the algorithms that calculate the actual cost per click uses Quality metrics, which includes website speed and reliability.

Common Causes of Intermittent Failures

Intermittent failures due HTTP Error Codes

This is the easiest use case of intermittent issues. If there is an error HTTP code that has been collected (such as “Page Not Found” or “Internal Server Error”). The team should be able to trace the problem by searching logs with the given timestamp.

In MeerkatWatch we are displaying the HTTP error code for DOWN and WARNING states, along with a timestamp, so a team should have enough information to trace this problem.

Intermittent failures due to Connection Timeout

On the Internet, speed matters. A lot. The ideal website load time for mobile sites is 1-2 seconds. 53% of mobile site visits are abandoned if pages take longer than 3 seconds to load. A 2-second delay in load time resulted in abandonment rates of up to 87%.

As APIs feed data to applications and websites, the speed of response if APIs needs to be even faster. In most cases, multiple API calls are required to collect enough information to display a single website, and the single website should load in 1 to 2 seconds.

MeerkatWatch has a timeout of 15 seconds for web and API monitoring.

Given the speed requirements of 2 seconds to serve a visitor, 15 seconds is an eternity. If the resource that is being tested takes more than 15 seconds to load, a Timeout Error is raised. If one location identifies a timeout, a WARNING status is displayed. If this is confirmed by two independent monitoring locations at the same time, a DOWN status is triggered.

The problem is that the customer might see the application working if they manually test it. If the slowness of the application is consistent, it will just take more than 15 seconds, but end up working with a success status code. It is much harder for the customer to trace this problem in the logs error codes unless a more sophisticated Application Performance Metrics are implemented in the customers’ solution.

There might be more things happening: maybe not all the requests were slow. Maybe the problem just happened to 5% of the requests!

There are more reasons: MeerkatWatch would have never reached the client’s application: maybe there was a problem in the client’s data centre, the client’s cloud internet connection or any other layer before the actual application.

Those problems are hard to identify, but if you see WARNINGS due to connection timeout happening from the different locations, there should be no doubt that something is happening.

At MeerkatWatch, we constantly check our monitoring systems against a control group. This control group closely monitors the network performance, so we avoid sending false positives to our clients.

What now?

We are currently very fortunate that we can collect statistical information from thousands of websites.

Our short-term first objective is to help with the grey area of uptime monitoring. But we would love your feedback!

  • What type of information would you like to see that would help you figure out what is going on when you are
  • experiencing intermittent failures?
  • Could you share your experience in which you thought that your favourite monitoring system detected a false positive?

Stay tuned! At MeerkatWatch, we are actively working to get more useful information when intermittent issues are happening. Among many other things, we are developing predictive algorithms that collect bits of warnings and show a much clearer picture of what happened, what is happening, and what is (likely) to happen!

We are looking forward to your feedback at support@meerkatwatch.com!