Sinks may implement a healthcheck as a means for validating their configuration against the envionment and external systems. Ideally, this allows the system to inform users of problems such as insufficient credentials, unreachable endpoints, non-existant tables, etc. They're not perfect, however, since it's impossible to exhaustively check for issues that may happen at runtime.
When implementing healthchecks, we prefer false positives to false negatives. This means we would prefer that a healthcheck pass and the sink then fail than to have the healthcheck fail when the sink would have been able to run successfully.
A common cause of false negatives in healthchecks is performing an operation that the sink itself does not need. For example, listing all of the available S3 buckets and checking that the configured bucket is in that list. The S3 sink doesn't need the ability to list all buckets, and a user that knows that may not have given it permission to do so. In that case, the healthcheck will fail due to bad credentials even through its credentials are sufficient for normal operation.
This leads to a general strategy of mimicking what the sink itself does. Unfortunately, the fact that healthchecks don't have real events available to them leads to some limitations here. The most obvious example of this is with sinks where the exact target of a write depends on the value of some field in the event (e.g. an interpolated Kinesis stream name). It also pops up for sinks where incoming events are expected to conform to a specific schema. In both cases, random test data is reasonably likely to trigger a potentially false negative result. Even in simpler cases, we need to think about the effects of writing test data and whether the user would find that surprising or invasive. The answer usually depends on the system we're interfacing with.
In some cases, like the Kinesis example above, the right thing to do might be nothing at all. If we require dynamic information to figure out what entity (i.e. Kinesis stream in this case) that we're even dealing with, odds are very low that we'll be able to come up with a way to meaningfully validate that it's in working order. It's perfectly valid to have a healthcheck that falls back to doing nothing when there is a data dependency like this.
With all that in mind, here is a simple checklist to go over when writing a new healthcheck:
Does this check perform different fallible operations from the sink itself?
Does this check have side effects the user would consider undesirable (e.g. data pollution)?
Are there situations where this check would fail but the sink would operate normally?
Not all of the answers need to be a hard "no", but we should think about the likelihood that any "yes" would lead to false negatives and balance that against the usefulness of the check as a whole for finding problems. Because we have the option to disable individual healthchecks, there's an escape hatch for users that fall into a false negative circumstance. Our goal should be to minimize the likelihood of users needing to pull that lever while still making a good effort to detect common problems.