At Distil Networks, we have a lot of secrets such as database password, certificates, and private keys, and we take the job of protecting them seriously. In order for us to ship code quickly and reliably, we needed a highly available secret storage system. Fortunately for us, HashiCorp released Vault in 2015. With some help from the Vault Google group, later that year Distil’s ops team was able to implement a highly available Vault cluster using Consul by HashiCorp.
Vault Basics and Cluster Setup
The Consul cluster we created has three machines; they have been running very smoothly in production for well over a year. It uses a raft consensus algorithm to determine the current leader and keep data in sync across all Consul nodes. Here, an odd number of nodes is recommended to avoid stalemate issues during leader elections, should your cluster need to recover from a major outage.
For more detailed Consul setup instructions, take a look at Digital Ocean’s guide for Ubuntu 14.04.
Here is an example of bash commands to install the Vault client on MacOS, authenticate with the Vault server, write a secret, and then read it back.
# install vault command line
brew install vault
# set authentication variables export
# write a secret
vault write secret/file @file.txt
# read a secret back
vault read secret/file
Vault Behind HAProxy
Traffic directed to the Vault server needs to be written directly to the leader node. The normal way to handle this is to use the Consul DNS interface, which lets Consul manage all traffic routing to the leader node. (If traffic does not go to the leader node, the server will redirect it back to the Consul advertise address of the Consul cluster to have the connection retry its attempt to reach the leader.
In our case, Distil wanted to use our existing DNS service and not deal with Consul DNS. To do this, we use HAProxy with a health check. It is a lightweight load balancer and, as of version 7, began including health checks. These query each Vault node for /sys/leader and, when one responds that it is the leader, traffic is routed to it. Such health checks are also useful in notifying our ops team if something is wrong with the server.
Sample health checks:
Distil has had this setup running in production for over a year without any issues. We often restart individual machines for system updates. Each time the leader switches automatically and every node automatically rejoins the cluster when server restart is complete .
While this method worked best for our use case, you might look at Consul DNS before traveling down this path. There is a potential single point of failure disadvantage at the load balancer level.
Vault Behind an API
In many cases, it’s valuable to separate the user from having direct access to Vault by giving them an interface, such as a browser or a command line tool. This moves complex code from running on local machines to a remote API. It also provides single sign-on control through a protocol such as LDAP or Google SSO.
In Distil’s case, authentication between a user's machine and the remote API is handled by LDAP. The API-to-Vault validation is handled by token authentication. Different API endpoints can use different tokens, each having access only to the Vault data needed by that endpoint. In some cases, that means creating tokens with a read-only Vault access control policy.
Ruby code run by the API could look something like:
key = "-----BEGIN PRIVATE KEY-----......."
secret_path = "secret/important_data"
Vault.logical.write secret_path, key
# later on...
secret_returned = Vault.logical.read secret_path
puts secret_returned # prints key
In Distil’s setup, we ended up writing a Ruby command line application using the Commander gem. We bundled that and deployed it to our internal Gem in a Box server. This makes it easy for developers to make tool updates and deploy them to users.