ElastiCache の metrics

ref

Monitoring best practices with Amazon ElastiCache for Redis using Amazon CloudWatch

CPU

The tolerance to high EngineCPUUtilization differs for every use case and there is no universal threshold. However, as a best practice, make sure your EngineCPUUtilization is always below 90%.

EngineCPUUtilization and performance. We recommend setting multiple CloudWatch alarms at different levels for EngineCPUUtilization so you’re informed when each threshold is met (for example, 65% WARN, 90% HIGH) and before it impacts performance.

For smaller nodes with two or fewer CPU cores, monitoring the CPUUtilization is imperative. Because aside operations such as snapshots and managed maintenance events need compute capacity and share with Redis the CPU cores of the node, the CPUUtilization can reach 100% before the EngineCPUUtilization.

VCPU が少ないノード (2 以下) では、CPU Utilization の方が先にネックになってくるのでそちらの監視をした方がよいとのこと。

-> 65% で warning 90 % で high の設定をする

Memory

These default values are subject to the reserved memory. Because of this, the maxmemory of your cluster is reduced. For example, the cache.r5.large node type has a default maxmemory of 14037181030 bytes, but if you’re using the default 25% of reserved memory, the applicable maxmemory is 10527885772.5 bytes (14037181030×.75).

予約メモリ (デフォルト 25 %) を考慮すると実際に使えるメモリは r5.large だと 10 GiB 程度

When your DatabaseMemoryUsagePercentage reaches 100%, the Redis maxmemory policy is triggered and, based on the policy selected (such as volatile lru), evictions may occur. If no object in the cache is eligible for eviction (matching the eviction policy), the write operations fail and the Redis primary node returns the following message: (error) OOM command not allowed when used memory > 'maxmemory'.

Memory パーセンテージが 100% になると、eviction policy に応じて、オブジェクトが purge されてメモリを確保しようとする。

purge できない場合は Write Operation は OOM のエラーになる。

If your workload isn’t designed to experience evictions, the recommended approach is to set CloudWatch alarms at different levels of DatabaseMemoryUsagePercentage to be proactively informed when you need to perform necessary scaling actions and provision more memory capacity. For cluster mode disabled, scaling up to the next available node type provides more memory capacity. However, for cluster mode enabled, scaling out to progressively increase the memory capacity is the most appropriate solution.

eviction を前提にして設計してない場合は、DatabaseMemoryUsagePercentage を監視しておくべき

→ 60 % で warn 90 % で high を設定する

Finally, it’s also recommended to implement a CloudWatch alarm for the SwapUsage. This metric should not exceed 50 MB. If you cluster is consuming the swap, verify in the cluster’s parameter group that you have configured enough reserved memory.

Swap は 50 MB 以上は使用されるべきではない → 50 MB で通知設定

Network

ElastiCache and CloudWatch provide several host-level metrics to monitor the network utilization, similar to Amazon Elastic Compute Cloud (Amazon EC2) instances. NetworkBytesIn and NetworkBytesOut are the number of bytes the host has read from the network and sent out to the network. NetworkPacketsIn and NetworkPacketsOut are the number of packets received and sent on the network.

NetworkIn と NetworkOut が上限 10GiBit に近くなるとまずいので 90 % くらいで通知

Connections

CurrConnections – The number of concurrent and active connections registered by the Redis engine. This is derived from the connected_clients property in the Redis INFO command.

NewConnections – The total number of connections that have been accepted by Redis during a given period of time, regardless of whether these connections are still active or closed. This metric is also derived from the Redis INFO command.

To monitor the connections, you need to remember that Redis has a limit called maxclients. ElastiCache’s default and non-modifiable value is 65,000. In other words, you can use up to 65,000 simultaneous connections per node.

curr connection の上限 65,000

New Connection が多すぎると、Connection の使い回しをしていないのであまり良くない

New は 65,000 を超えうる (あくまで同時 65,000 なので)

Latency

これらは、Performance の調査の時に調べるで良さそう

#ElastiCache