Taking a Byte Out of Cloud Storage Costs

by Ben Hundley

Storage costs aren’t usually the first consideration in a new cloud deployment. For me, personally, compute costs are top of mind when sizing out a cluster. But for any project that needs long-term monitoring / logging or persistent time series data, storage costs tend to creep up over time. Almost every long term project I’ve worked on looked at storage first when the cost cutting started.

There’s two main types of storage in the cloud: block and object. Block storage refers to the underlying hardware device that writes and reads files in blocks or chunks -- hard disk drives (HDDs), solid-state drives (SDDs) -- as well as the storage area networks (SANs) like Amazon’s EBS or Google’s Persistent Disks. Object storage is a newer, higher-level means of storage where files are written and read in full (not chunks), and accessed through a network protocol like HTTP instead of through drivers on the operating system. Amazon S3 and Google Cloud Storage are some examples.

Block storage is used for databases, file systems, boot images, queues, and anything that needs low-latency or high-throughput. Object storage has a wide variety of uses as well, but the most common are long-term backups and content delivery (CDN). Most modern applications will use a mix of both kinds of storage.

Remove orphaned volumes

Remove orphaned volumes

One of the simplest ways to shore up unnecessary storage costs is to check for any orphaned volumes. An orphaned volume is a network block device (such as an EBS volume) that is unattached and not being used. There are several ways this situation arises -- for example, deleting K8S nodes before containers are deleted -- and while prevention is a solid strategy, it’s a lot easier to automate a recurring cleanup. One way to achieve this (described here) is to use a serverless function through something like AWS Lambda to delete any volumes that have been unattached for longer than X days.

Reduce snapshot costs

Reduce snapshot costs

For mission-critical applications, backups are essential. A common approach is to rely on volume-level snapshots provided by the cloud. Amazon’s EBS snapshots, for example, make backup and restoration painless. They’re also relatively cost-efficient incremental backups, which means only the changes between snapshots are stored.

However, for update-heavy datasets, there’s definitely money to be saved by gradually lowering the granularity of the backups as they age. For example, you might keep 24 hourly snapshots, but only 7 daily snapshots behind that, 4 week snapshots behind that, and so on. This is something that could also be automated via serverless function, but AWS also has the Data Lifecycle Manager which allows you to define a schedule right from the UI.

For all the convenience of EBS snapshots, they come with a price. At the time of writing, snapshots were over 2x the price of data stored in S3 (object storage). Amazon actually stores the snapshot data in S3, but charges a premium for the snapshot service. A simple way to save money is to offload older, infrequently-accessed snapshots to S3. You pay for network transfer with S3, but since these are backups and only accessed in disaster recovery, the costs should be minimal.

Additionally, many databases and applications will manage their own incremental snapshots directly with S3. This prevents the need for EBS snapshots entirely, shaving a good chunk off your DR costs.

Use available ephemeral storage

Use available ephemeral storage

A lot of containerized applications make use of network-attached storage by default. However, there are some situations where ephemeral storage (local disks) are available for consumption by containers running on the host. This is almost always the case with bare metal environments.

There are some caveats when using local storage in Kubernetes, though. One problem is that if the node is lost, the data is lost. Another is that the pod that requires the data is locked to whatever node contains it, which prevents it moving freely in the case of a node outage.

Replication for ephemeral storage

Both of these can be mitigated by running extra live copies of the data via Replica Set, provided that the stateful application is capable of handling its own data replication. This comes with the cost of increased CPU and memory usage, but it could be a cost-effective trade-off if you have the space.

If you decide to go the ephemeral storage route on Kubernetes, check out OpenEBS -- it allows for working with local devices in a much more K8S-native way.

Storage tiering

Storage tiering

A more advanced approach to saving money on storage is with the concept of tiering. The idea is to partition your dataset in a way that places less-frequently accessed data on cheaper (or “cold”) disks, while the more-frequently accessed data resides on more expensive (or “hot”) disks.

Storage tiering is something that will generally be handled at the application level. An example of this would be PostgreSQL, in which tablespaces can be mapped to specific storage devices. Another example is node-specific sharding in Elasticsearch.

Other considerations

  • Gradually scale by adding smaller disks and stripe them to avoid over-allocating up front (RAID, ZFS, application-based striping)
  • Use compression (ZFS or application-based)
  • https://github.com/openebs/zfs-localpv (for ZFS on ephemeral disks)
  • Outside of replicating for performance, only replicate data when quick recovery time is needed; consider using backups on larger datasets that can tolerate slower disaster recovery
  • And finally, in some cases, you can reduce TCO with a managed service (such as RDS) -- the higher per-GB cost may be offset by reduced inefficiencies and maintenance costs