During the past year I have migrated a number of Zimbra systems to Amazon Web Services (“AWS”).  These “real-world” experiences led to my being invited by Zimbra to present my AWS Hosting Best Practices at their North American Partners’ Conference in Dallas, Texas in October 2018.  (You can request a copy of my presentation at this page link here.)  Since that time, several additional engagements have enabled me to refine what I presented last October.

I am presenting here a preview of just the disk portions of my soon-to-be-updated Zimbra Hosting on AWS Reference Architectures, to encourage others to test these storage architectures and report back with suggested further refinements.

This blog post presumes the reader has a working familiarity with AWS’s various block and object storage options (if not, please request the presentation I made at last year’s Partner Conference, which contains an introduction to AWS storage and other AWS concepts).  At a minimum, you should understand the differences between gp2, st1 and io1 EBS volumes, their relative costs, and how S3 object storage is provisioned and priced.

 

Goals, Prerequisites and Disclaimer
The storage reference architectures below endeavor to provide a healthy balance between performance, reliability/redundancy/resiliency and cost.  My own view on what comprises a “healthy balance” may differ from your view, so feel free to adjust as needed.  Also, these architectures are only relevant for Zimbra systems running Network Edition with Backup NG enabled, or Zimbra Suite+ or Open Source Zimbra running the ZeXtras Suite, with ZeXtras Backup enabled.  Please consider these as a starting point, not as a be-all/end-all. You should do your own testing of course, and your unique Zimbra workloads may require something very different.

Let’s jump right in!

 

Single-Server Storage Reference Architecture
An m5.xlarge instance (4 cores and 16GB of RAM) is generally sufficient for environments of less than ~500 users with average mail flows and modest mailbox sizes.  Environments with some combination of smaller mail flows and/or less active users can use this instance size with a larger number of mailboxes.  My current recommended storage layout for a single Zimbra server with a 1TB mailstore is as follows:

  • 50GB gp2 disk for /
  • 250GB gp2 disk for /opt
    • Provision a 16GB swapfile on this disk.
  • S3 Bucket for HSM
    • Deploy an automated HSM policy that moves only email objects, which objects are older than a period that is sufficient to keep ~100GB free in /opt.
    • Ensure Zimbra uses Infrequent Access for S3 to save on costs.
  • 1.5TB st1 disk for /opt/zimbra/backup
    • st1 disks are “throughput optimized” and don’t have a lot of IOPs (which is why they cost less than half of a gp2 SSD disk).  Above a certain volume of inbound/outbound email, an st1 disk won’t be able to keep up, and you will get warnings about RealTime Scanner slowdowns. If so, you’ll need to convert your backup disk from st1 to gp2 (no service interruption!),
  • 150GB gp2 disk for /opt/zimbra/backup/zextras/accounts
    • This directory uses a lot of inodes.  A backup of 1.5TB will have an accounts subdirectory of likely no more than 150GB, but can use more than 15 million inodes.  Running out of inodes is like running out of disk space; no more files can be written.  We recommend formatting this separate disk with at least 20 million inodes, or 15,000 x the number of GB in your ZeXtras backups – whichever is greater.
  • 1GB RAM disk for /opt/zimbra/data/amavisd/tmp

Multi-Server Mailbox Server Storage Reference Architecture
If you have a multi-server architecture by definition you have a lot of mailboxes and likely a lot of very active users.  In our experience an r5.xlarge instance (4 cores and 32GB of RAM) is sufficient for most mailbox servers and costs only ~20% more than an m5.xlarge. For a mailbox server with a mailstore size of X, we recommend:

  • 50GB gp2 disk for /
  • 250GB – 500GB gp2 disk for /opt
    • 750GB for /opt may be used when hosting very large (e.g. 75GB to 100GB) mailboxes, to ensure there is enough space to do the initial import/onboarding.  This is because HSM Policy execution moves mail blobs to S3 much more slowly than Zimbra accepts imported email.  At ten cents per GB provisioned, hosting an extra 250GB of gp2 disk will cost only US$25 more per month — and get you 750 more IOPs by doing so.
    • Provision a 16GB or larger swapfile on this disk.
  • S3 Bucket for HSM
    • Deploy an automated HSM policy that moves only email objects, which objects are older than a period that is sufficient to keep at least 25% free disk space in /opt.
    • Ensure Zimbra uses Infrequent Access for S3 to save on costs.
  • 1.3x the mailstore size st1 disk for /opt/zimbra/backup
    • st1 disks are “throughput optimized” and don’t have a lot of IOPs (which is why they cost less than half of a gp2 SSD disk).  Above a certain volume of inbound/outbound email, an st1 disk won’t be able to keep up, and you will get warnings about RealTime Scanner slowdowns. If so, you’ll need to convert your backup disk from st1 to gp2 (no service interruption!), or, consider moving high-activity mailboxes to a different mailbox server.  In large environments of say 5K or more mailboxes, I suggest a cost calculation to compare provisioning a few larger, beefier mailbox servers with fast gp2 backup disks against a larger number of smaller mailbox servers whose mail volume won’t overwhelm an st1 disk.
  • 10% of the size of /opt/zimbra/backup, for a gp2 disk for /opt/zimbra/backup/zextras/accounts
    • This directory uses a lot of inodes.  A backup of 1.5TB will have an accounts subdirectory of likely no more than 150GB, but can use more than 15 million inodes.  Running out of inodes is like running out of disk space; no more files can be written.  We recommend formatting this separate disk with at least 20 million inodes, or 150,000 x the number of GB in your ZeXtras backups – whichever is greater.

Discussion
All disks and your S3 bucket IMHO should be encrypted, so that you preserve encryption in flight and encryption at rest.  Only the root disk cannot be encrypted, but the only Zimbra data here will be in /var/log (which you can relocate elsewhere). AWS provides a comprehensive key management system you can use if you wish to bring your own keys, or you can use AWS’s baked-in default keys.

Zimbra now supports S3 for Primary Volumes; what would normally be /opt/zimbra/store. We have not sufficiently tested this so are not (yet) recommending customers deploy Zimbra Primary Volumes on anything but EBS gp2 volumes.

The /opt/zimbra/backup/zextras/accounts directory is used by ZeXtras (Backup NG) to hold the backup metadata for each mailbox.  This directory is very I/O intensive, because every little change recorded by the RealTime scanner triggers individual writes to this directory. This directory we have observed consumes between 5% and 10% of the total ZeXtras (Backup NG) backup space.  Moving this directory from the backup’s st1 disk to a dedicated gp2 disk resulted in one system’s weekly ZeXtras Purges decreasing from two days to just over an hour.

Similarly, Zimbra’s Lucene engine, used for indexing mailboxes, is also very I/O intensive, but it doesn’t generally occupy a lot of disk space. AWS gives you 3 IOPs for every GB you provision in a gp2 disk, so as the number of mailboxes and mail items increases, we need more IOPs for Lucene (and for the rest of Zimbra as well).  On-premises systems often provision a separate, high-speed disk for /opt/zimbra/index to keep Lucene performant, but AWS’s io1 (Provisioned IOPs) EBS offering is so expensive, it’s cheaper to provision a larger-than-actually-needed gp2 disk for all of /opt and leave Lucene’s indexes there than it is to provision a smaller io1 disk just for Lucene.

If you have been using Zimbra’s traditional Network Edition backups and have not yet used the Backup NG backup engine from ZeXtras, you will be pleased to learn that ZeXtras backups consume only about 70% to 80% of the mailstore size, due to compression and deduplication on the fly. Traditional Network Edition backups over 30 days typically consume 2.5 times to 3.5 time the mailstore size. Several Zimbra customers (before I found them) had avoided doing Zimbra migrations to AWS because they thought they were going to have to spend 3x to 4x as much on backup storage as they actually would have.

 

Disaster Recovery, RPO and RTO
We recommend deploying an AWS Lifecycle Policy to create snapshots to S3 of the /opt/zimbra/backup and /opt/zimbra/backup/zextras/accounts disks. The default policy takes snapshots every 12 hours.  More frequent snapshots shorten your RPO, but result in higher costs. At some point, if you need a zero or near-zero RPO, it is more cost-effective to stay with a 12- or 24-hour snapshot Lifecycle Policy and supplement with a Mimecast (which I sell), ProofPoint, or other third-party subscription for email Continuity services. FWIW, Mimecast also bundles in email archiving and email security, so if you have a lot of Zimbra Archive mailboxes, it can be cheaper to use Mimecast than to pay AWS for the storage associated with Zimbra Archive mailboxes.

The Lifecycle Policy snapshots enable you to recover Zimbra in the event an entire AWS data center (Availability Zone in AWS-speak) fails. Lifecycle Policy snapshots are made to S3, which is replicated Region-wide, across all of the Availability Zones in a Region.  So, if one Availability Zone failed, you would simply provision two new EBS gp2 disk volumes in another Availability Zone from your /opt/zimbra/backup and /opt/zimbra/backup/zextras/accounts snapshots, and then perform an Incremental Migration to a replacement Zimbra server (or Zimbra farm).  Your RTO is determined by how “warm” that replacement Zimbra server or system in that other Availability Zone is. Your RPO is at most the snapshot interval.  Again, you can shorten your RPO/RTO targets with a Mimecast subscription, or at the very least a backup MX subscription from DNS Made Easy or others.

 

How Will I Know When I Don’t Have Enough IOPs On AWS?
Zimbra’s own stats tools, iotop, vmstat, atop, top and other tools can give you insight into performance bottlenecks in your system. To get started, you can run as root the following command, which will identify specific disks and other processes that are resource starved and whose process state has changed to “D”.

while true; do date; ps auxf | awk '{if($8=="D") print $0;}'; sleep 1; done

If you have insufficient IOPs on one or more disks, not only will the above command list the specific disks impacted, but the Linux tool “top” will report wait state percentages consistently in the low to mid double digits.  This is the “wa” or “%wa” metric reported in top’s header.   If your disks aren’t fast enough, top will report that the CPUs are spending some significant percentage of their available cycles “waiting” to do real work.

At least on Ubuntu 16.04, insufficient IOPs also results in a lot of swapping.  After remediating several mailbox servers with insufficient IOPs for the /opt/zimbra/backup/zextras/accounts directory, we noted that top-reported “buff/cache” values decreased from more than 20GB with swap file usage above 9GB, down to about 10GB total of buff/cache usage — and no swapping.

 

My Zimbra Server On AWS Is Already Built But I Need More IOPs… Now What?
No stress!  The remediation process comprises two steps: First, temporarily provision enough IOPs by converting your problematic gp2 disk to an io1 disk, gradually increasing IOPs until you eliminate needless wait states and excessive swapping.  If it took, say, 1,800 IOPs to remove the bottleneck, then you know that that gp2 disk should be ~600GB or larger in size.  (Recall that AWS gives you 3 IOPs for every 1GB of provisioned storage on a gp2 disk, so 1,800/3=600.)

If you have an st1 disk, convert it to a gp2 disk and see what happens.  The cost increase to do this kind of testing is quite small, and again, changing disk types can be done on the fly with no need to reboot the server nor restart Zimbra.

The second step is then to take the results of the testing and implement a more permanent fix.  Generally speaking the common fixes comprise:

  1. If the st1 backup disk has insufficient IOPS, create a dedicated gp2 disk for /opt/zimbra/backup/zextras/account, then stop Real Time Scanning to shuffle the data in the accounts directory to the the new disk, finalize new mount points and restart Real Time Scanning (be sure to update /etc/fstab accordingly).
  2. If the /opt gp2 disk has insufficient IOPS, and you have already deployed an S3 HSM volume, create a new gp2 disk for /opt/zimbra/store2 and add it as the second, and new Default, Primary Store volume.
  3. If the /opt gp2 disk has insufficient IOPS, and you have not yet deployed an S3 HSM volume, please do so and create a policy that moves only mail items to S3.  We recommend starting with a policy that moves email blobs older than ten years, as this is a low-risk method to confirm the operation works. Afterwards, you can shorten the policy in successive intervals; be prepared however for the first HSM job after a policy change to take several days to complete.

When all is said and done, you’ll likely end with something that looks like this (note that S3 storage does not paint when running a “df” command):

zimbra@zimbra:~$ df -h | grep -v run | grep -v udev | grep -v cgroup | grep -v shm
Filesystem      Size  Used Avail Use% Mounted on
/dev/nvme0n1p1   39G  3.9G   35G  10% /                              <<== gp2 disk
/dev/nvme2n1p5  739G  414G  287G  60% /opt                           <<== gp2 disk
/dev/nvme1n1p5  1.3T  750G  447G  63% /opt/zimbra/backup3            <<== st1 disk
tmpfs           1.0G   30M  995M   3% /opt/zimbra/data/amavisd/tmp   <<== RAM disk
/dev/nvme4n1p5  296G   30G  251G  11% /opt/zimbra/backup3/accounts   <<== gp2 disk
zimbra@zimbra:~$

In a multi-server environment, you could also migrate mailboxes between mailbox servers.  Just add a new mailbox server with your new right-sizing, move the mailboxes over, and then destroy the original mailbox server.  Your AWS bill will be a little larger that month, based on how long you had the two mailbox servers up and running.  You could also build a more modest companion additional mailbox server, keeping the original and the new mailbox server and then distribute your mailboxes between them.   In either case, no downtime.

If you have a single-server environment with maxed-out disks, you could add a second mailbox server or replace your constrained server with a new, right-sized server, by leveraging the Backup NG Incremental Migration strategy (which you will use for Disaster Recovery as well).  You’ll be down for just a few minutes to complete the cutover, and your users will be greeted with empty mailboxes after the Provisioning-Only portion of the restore, but the replacement Zimbra server will be fully functional, and if you do the cutover on a Friday night, the remainder of the restore can run over the weekend.  And again, if you can’t afford to have your users without access to their older emails over a weekend, you should be talking to me about a Mimecast subscription!

 

Conclusions and Key Takeaways
AWS’s storage infrastructure, already incredibly performant, resilient, redundant and secure, provides a variety of ways to address storage bottlenecks and right size your Zimbra storage infrastructure cost effectively.  The majority of the solution adjustments involve zero downtime, and be be adjusted as your Zimbra system grows or shrinks. In this article we recommend starting out with our Reference Architectures, testing, and then adjusting as needed.

Further, AWS’s storage infrastructure simplifies cost-effective Disaster Recovery plans and provides options for different RPO/RTO targets at different price points.  Zero RPO/RTO targets can be achieved by supplementing an AWS Disaster Recovery plan with a Mimecast subscription.

If you need help, or want to reserve a copy of our forthcoming Zimbra AWS Reference Architectures white paper, please fill out the form below.

Hope that helps,
L. Mark Stone
Mission Critical Email
14 March 2019

The information provided in this blog is intended for informational and educational purposes only. The views expressed herein are those of Mr. Stone personally. The contents of this site are not intended as advice for any purpose and are subject to change without notice. Mission Critical Email makes no warranties of any kind regarding the accuracy or completeness of any information on this site, and we make no representations regarding whether such information is up-to-date or applicable to any particular situation. All copyrights are reserved by Mr. Stone. Any portion of the material on this site may be used for personal or educational purposes provided appropriate attribution is given to Mr. Stone and this blog.