[DAEMON-59] appcd-telemetry: Implement telemetry for abnormal health checks
GitHub Issue | n/a |
---|---|
Type | Improvement |
Priority | Low |
Status | Open |
Resolution | Unresolved |
Affected Version/s | n/a |
Fix Version/s | n/a |
Components | appcd-telemetry |
Labels | n/a |
Reporter | Chris Barber |
Assignee | Chris Barber |
Created | 2017-03-07T19:26:57.000+0000 |
Updated | 2020-02-13T18:05:45.000+0000 |
Description
Tracking the system hardware configuration, software versions, and actions don't give us enough technical insight into the daemon's inner workings. We should track memory, CPU, subprocess, and network I/O usage and watch for things that are abnormal such as prolonged cpu usage or high memory usage, then send telemetry events every so often.
A health check involves:
* Main process CPU and memory usage
* Plugin CPU and memory usage
When a threshold is exceeded for a period of time, then the system will send an event containing:
* System info
** OS and architecture
** How much memory does the system have?
** How much memory is free?
** How many CPUs are there?
* Daemon info
** Startup time?
** How big is the V8 heap?
** Config file settings
* Daemon runtime info
** Dispatcher requests
** How much CPU is the daemon consuming for the past minute, 5 minutes, 15 minutes?
** How much memory is the daemon consuming for the past minute, 5 minutes, 15 minutes?
** How many subprocesses are actively running?
** How many active client connections are there?
** How much I/O is caused by the client connections?
** What is the load average for the past minute, 5 minutes, 15 minutes?
** What is the resident set size for the past minute, 5 minutes, 15 minutes?
** Filesystem Watcher stats
The appcd-core process (via the StatusMonitor) as well as all external plugin child host processes have an Agent that collects health data. Stats are constantly emitted, but that doesn't include the historical data and that has to be manually fetched.
As apart of this ticket, it would be great if the Agent had the option to store the collection of data in the parent process instead of the Agent itself. This would open the door to being to stream this info, analyze it, and send out these telemetry bits.
No comments