A Decade of Using Ansible: Takeaways & Best Practices
My first encounter with Ansible was at a conference in Berlin during a talk given by the great Jan-Piet Mens in 2014. The only automation tool I had been using up until then was Puppet (not counting homegrown Bash scripts here). For the last ten years Ansible has been a constant in my toolbox, both work-related and private. The following points convinced me back in 2014:
- agent-less
- uses SSH as transport (delegates authentication to a proven and widely used mechanism)
- Python-based: easy to extend with custom modules, filters, lookup plugins etc.
- playbooks are easy to read, follow and understand
- a broad range of modules (even in 2014!)
The following blog post assumes at least some familiarity with Ansible - you should know your way around playbooks, roles, inventories, host- and group-vars. Some things might be old news to you, others might be new.
Please consider the following as an incomplete guide to “things I have gotten used to in the past”. It might contradict with official best practices and of course there may be better solutions to the problems described.
Use Roles, Environments and Possibly Inventory Plugins
I usually keep my Ansible roles, playbooks and variables in a single repository. Over the years I have defaulted to the following directory structure for an Ansible repository:
.
├── ansible.cfg
├── environments/
│ ├── dev/
│ │ ├── group_vars/
│ │ ├── host_vars/
│ │ └── inventory
│ └── live/
│ ├── group_vars/
│ ├── host_vars/
│ └── inventory
├── filter_plugins/
├── library/
├── lookup_plugins/
├── roles/
└── tasks/
Playbooks are stored on the top level. They never contain tasks directly but only include roles from the roles/
folder. Roles allow you to separate your tasks and reuse them in different playbooks easily. To support different environments (e.g. dev
/live
as shown) the inventories are separated by folder, so you would invoke Ansible on the top level as ansible-playbook -i environments/dev/inventory ...
. If you need to extend Ansible with custom filters or lookup plugins, simply drop them into the respective folders. library
would be to place to drop custom Ansible modules. However, for this to work your ansible.cfg
needs to contain the following lines:
[defaults]
library = ./library
In general it is a good habit to have an ansible.cfg
in your Ansible repository to keep required/custom configurations close to your playbooks/environments.
Sooner or later you will find that several of your playbooks or roles share the same small snippets of tasks (e.g. “disable monitoring for a service” or “get a certificate”). If these snippets do not warrant a role on their own, you can drop them into small YAML files in the tasks
folder and include them in any playbook or role like this:
- include_tasks: tasks/disable_monitoring.yml
vars:
services:
- my_service
- my_other_service
Ansible Vault
At some point you will need to store things in your host_vars
or group_vars
that must to be kept secret like API keys or passwords. This is where Ansible Vault comes in handy. If you are using Ansible for your personal environment, you will probably be fine with using one global vault password. However, if you are in a work environment and possibly share your Ansible setup with many other teams, multiple separate vault IDs are a handy way to limit the blast radius when a single vault password has been compromised.
You can create a vault file with a specific vault ID (myteam
) like this:
ansible-vault create --vault-id myteam@prompt vault.yml
You can also provide the same vault ID password interactively to Ansible like this:
ansible-playbook -i inventory --vault-id myteam@prompt playbook.yml
You can also query for multiple vault ID passwords if required by your playbook:
ansible-playbook -i inventory --vault-id myteam@prompt --vault-id otherteam@prompt playbook.yml
The @prompt
suffix will cause Ansible to interactively prompt for the vault password. You can also create helper scripts which read the relevant password from some source (e.g. the pass
password manager) and provide it to ansible automatically:
ansible-playbook -i inventory --vault-id myteam@retrieve-myteam-password.sh playbook.yml
The following shell alias (saved in ~/.profile
or ~/.bashrc
) allows you to always have one or multiple vault passwords provided to Ansible:
alias ansible-playbook='ansible-playbook --vault-id myteam@retrieve-myteam-password.sh'
Where and How to Store Vault Data
While you can use so called inline vaults, you really should not. The other option is to have the entire YAML document encrypted by Ansible vault. The following YAML file is an example of an inline vault:
some_non_confidential_var: true
some_other_non_confidential_var: "yolo"
super_confidential_stuff: !vault |
$ANSIBLE_VAULT;1.1;AES256
32656432386638396362303630666363653830633966663038643330306137643639336361333337
6665323361333865653635633038316133316266653530610a653534313232363664363066303337
61656531383861303232366464663137303931383531303236393838656239323765396261656565
3536633165383762350a333761656664333739626335343563623461323137366531663234383137
30363338383661646534366266646165313666633561613730353836666336323439
On the one hand, inline vaults ensure that all variables which belong together can be located in the same YAML file. If you are using grep
or similiar to locate super_confidential_stuff
, you will find exactly where it has been defined (and where it is used). On the other hand, the vault part bloats your YAML file and there is no easy way to decrypt it, without copy/pasting it somewhere else and using ansible-vault
on that. Replacing the encrypted data also includes quite a bit of copy/pasting. Finally, if you provide the wrong vault password to Ansible, the playbook will run up to the point where it tries to read the inline vault data, fails to decrypt and stop your entire playbook run.
To avoid the problems stated above but also retain the advantages, the following workflow has proven itself over the years: Imagine you have a host_vars
file for a server called app01.example.org
, containing the above example YAML data. We will now transform this into the following structure:
host_vars/app01.example.org/main.yml
:
some_non_confidential_var: true
some_other_non_confidential_var: "yolo"
super_confidential_stuff: "{{ vault_super_confidential_stuff }}"
host_vars/app01.example.org/vault.yml
:
$ANSIBLE_VAULT;1.1;AES256
32656432386638396362303630666363653830633966663038643330306137643639336361333337
[...]
Where the decrypted content of the above file will be:
---
vault_super_confidential_stuff: "secret data"
This approach combines the following advantages:
- you can
grep
all variables used in your playbook or templates and find their definition within yourhost_vars
orgroup_vars
- variables with secrets directly map to a similar named variable in an encrypted file in the same folder
- you can directly view or edit your encrypted
vault.yml
withansible-vault view|edit
(or any vault integration in your IDE) - the vault data is located “near” its non-vault counterparts
- if you forget to provide a vault password or provide the wrong vault password to
ansible-playbook
, Ansible will fail early and not run the playbook at all (which is better than failing half-way through the execution as with undecryptable inline vaults)
Use Handlers (Along with flush_handlers if Required)
Handlers have been around for quite a while. They allow the execution of Ansible tasks when other tasks have changed something (e.g. updated a configuration file or installed a package). Whenever you spot a combination of tasks with register: blah
followed by when: blah is changed
, you should immediatly refactor that to a handler. This will greatly improve the readability of your roles and make it especially easy if you have many tasks which require the same command to run at the end (e.g. a configuration split across many files).
If your playbook includes many roles/takes a long time, you sometimes want or even require your handlers to be executed earlier than at the very end of the playbook. This might be relevant if you configure service A followed by service B - but service B requires A to be already up and running. You can use the following meta task anywhere in your playbook/roles to execute all handlers that have been notified/queued up to this point:
- name: Flush all queued handlers now
ansible.builtin.meta: flush_handlers
Use Meaningful and Scoped Variable Names
There is no glory in using short and ambigious variable names like name
or path
. Unless your variable has really a global meaning to your entire environment, you should always prefix variables used in specific tasks or templates with e.g. your role’s name. This way you can avoid variable name clashes and unpredictable behaviour across multiple roles. Even within one role, using verbose and consise variable names improves readability and maintainability a lot. As we are in a Python world, sticking to PEP-8 is a good idea:
Function names should be lowercase, with words separated by underscores as necessary to improve readability.
Variable names follow the same convention as function names.
mixedCase is allowed only in contexts where that’s already the prevailing style (e.g. threading.py), to retain backwards compatibility.
You must never use dashes in variable names though. The dashes will be interpreted as substraction by the Jinja template engine and cause all sorts of trouble.
Good:
---
apache_tls_listen_port: 443
snmpd_system_location: "closet"
myapplication_unpriv_service_user: "nobody"
Bad:
---
port: 80
system-location: "closet"
user: "nobody"
YAML, The Norway Problem and Octal Numbers
While YAML is an accessible and easy to read format, it is far from being uncomplicated. The famous “Norway Problem” is just one example of its quirks. As a rule of thumb: always quote strings to avoid unwanted type inference. While the yes|no|true|false
dilemma should be well known by now, there are also lesser known issues: if you are using the file
, copy
or template
modules, you may (should!) specify a file mode using a numeric/octal representation (e.g. 0755
, 0644
). However, if you do not quote this value, YAML will interpret this as an octal number and Ansible will end up with the decimal representation of said octal number.
Let’s do a quick Python test:
>>> import yaml
>>> yaml.safe_load("""
... ---
... number: 0755
... """)
{'number': 493}
If passed to ansible as a file mode, this would lead to rather unexpected results. With proper quoting, the result will look like expected:
>>> import yaml
>>> yaml.safe_load("""
... ---
... number: '0755'
... """)
{'number': '0755'}
JSON Templating
If you find yourself writing a Jinja2 Template for a JSON file, you will sooner or later stumble across proper quoting. Let’s assume the following template:
{
"author": "{{ author }}",
"title": "{{ title }}",
"year": {{ year }}
}
We now need to specifiy author
and title
as variables in e.g. some host_vars
or group_vars
file:
---
author: 'Rudolph Bott'
title: 'My First Book'
year: 2024
This will work well. However, let’s assume a different title:
---
author: 'Rudolph Bott'
title: 'My "First" Book'
year: 2024
Ansible/Jinja will render the template just fine. But it will create broken JSON due to the use of double quotes inside the title
string, hence breaking the JSON syntax. You could easily solve this using backslashes (e.g. 'My \"First\" Book'
) in the YAML definition. But that might break the usage of the title
variable in other places, which do not need/require escaping. You could also use the replace
filter. But that will be just re-inventing the wheel, because Ansible comes with a powerful filter which knows all about proper JSON encoding: the ansible.builtin.to_json
filter.
Just keep in mind that it will include the sourrounding quotes, so your Jinja template should look like this:
{
"author": {{ author | to_json }},
"title": {{ title | to_json }},
"year": {{ year }}
}
Just for the sake of completeness: to_json
can also encode entire data structures, not just plain strings. The following will achieve exactly the same as above, without any JSON templating:
---
book_data:
author: 'Rudolph Bott'
title: 'My "First" Book'
year: 2024
{{ book_data | to_json }}
Working with many hosts
Ansible is not exactly known to be lightning fast. The Mitogen plugin was the single greatest improvement to Ansible execution times I have ever seen. However, I have not been using it for the last years because AFAIR it was broken with Ansible 2.10+ for quite a while (or still is?). However, there are other built-in ways to speed up your playbook execution times.
Use Persistent SSH Connections
Ansible supports persistent SSH connections. That means it will instruct SSH to open a connection to a server and keep it running in the background for a given time. If you execute a playbook against the same host again within that timeframe, it will reuse the existing connection and not negotiate a new one. For this to work, you need to configure the timeout and a path to a socket which will be created by SSH per connection. This can be achieved by creating a file named ansible.cfg
in your playbook repository with the following content:
[ssh_connection]
control_path=%(directory)s/CP-%%C
ssh_args=-o ControlPersist=60m -o ControlMaster=auto
pipelining = True
Raise Fork Limit
By default, Ansible limits itself to just 5 forks. That means that even when you set serial: 40
(or not use serial
at all) in your playbook, Ansible will not execute the same task on 40 hosts at the same time but rather in batches of 5. If you are working with many hosts and have a decent machine at hand to run your playbooks, you should raise this limit in your ansible.cfg
to 50, 100 or even 200.
[defaults]
forks = 100
Making Ansible Playbooks More Robust
Ansible offers multiple ways to make your life with large playbooks, large numbers of hosts or even both easier.
Set An Error Margin For Your Playbooks
If you need to run a playbook against many hosts (let’s say 300), you will most likely instruct Ansible to process your hosts in batches (by setting serial
to something like 40). However, one single failed task/host will end your entire playbook run which might or might not be what you expect. In many cases the strategy “finish as much as possible and inspect anything that failed in the end” will greatly improve your day. You can instruct Ansible to allow a certain percentage of hosts to fail within each batch of hosts:
---
- hosts: all
serial: 40
max_fail_percentage: 20
tasks:
- ...
The above snippet will run your playbook in batches of 40 hosts and allows up to 8 failed hosts per batch. Failed hosts will be listed in the summary output at the end of your playbook run and you can take your time to examine the causes of your failed tasks/hosts. This will ensure that your 2 hour playbook run will actually finish most of its hosts without being stopped in its tracks after 10 minutes by a single bad host.
Block / Rescue / Always Exception Handling
Especially with playbooks that are running for a long time and which span many hosts you find yourself in a situation where you need to gracefully handle errors without stopping the entire playbook run. You also might have to ensure that certain cleanup tasks run if or especially when a step in the playbook fails. Luckily Ansible has ported Python’s exception model (sort of) to the Ansible world:
Try…Except
Detect a failing task and execute some other tasks if that happens:
- tasks:
- block:
- name: some stupid task which might fail
service:
name: someservice
state: reloaded
- rescue:
- name: Reloading failed, go for a restart
service:
name: someservice
state: restarted
Try…Finally
Detect a failing task and always execute some other tasks:
- tasks:
- block:
- name: Disable Monitoring For Deployment
command:
cmd: /usr/local/bin/disable-monitoring.sh
- name: Restart Service
service:
name: someservice
state: restarted
- always:
- name: Enable Monitoring After Deployment
command:
cmd: /usr/local/bin/enable-monitoring.sh
Of course you can also use a combination of block
, rescue
, and always
. You can find more information on this subject in the official documenation.
Useful links and resources
The Ansible documentation contains a list of mostly useful best practices, you should read and understand them. Red Hat has various blog posts about Ansible, e.g. 8 ways to speed up your Ansible playbooks or how to Find mistakes in your playbooks with Ansible Lint. If you use Netbox as your inventory system, you can most probably ditch your file based inventory and use this inventory plugin to retrieve the lists of hosts and their settings directly from Netbox. The same also works for AWS/EC2 and many other possible data sources.
I hope you have learned something new while reading this blog post. If you have other suggestions or find some of the ideas questionable, please do not hesitate to contact me on Mastodon!