String interpolation in YAML with Python

String interpolation is not a feature of YAML. In this post, I will present a quick way to perform string interpolation in your configuration files written in YAML format. For that, I will use Jinja syntax to define the placeholders in key values and process the .yaml files with Python.

Goal

My goal is to

merge several .yaml configuration files into a single configuration object,
the configuration files are processed in order, and later configs could potentially overwrite the values of keys defined in previously processed configs,
use placeholders in values that reference the values of other keys in the same or different .yaml file.

Example

Let's see an example. If I have 2 .yaml files that are loaded in the following order

project:
  name: project
  environment: dev
storage:
  bucket: "{{ project.name }}-{{ project.environment }}-{{ aws.account_id }}"

project:
  name: yaml-interpolation
aws:
  account_id: "123456789"
user:
  username: "codiply"
  user_arn: "arn:aws:iam::{{ aws.account_id }}:user/{{ user.username }}"
  storage_path: "s3://{{ storage.bucket }}/{{ user.username }}"

I want the final result to be a Python dictionary (benedict dictionary specifically) that contains the configuration of this YAML file

aws:
  account_id: '123456789'
project:
  environment: dev
  name: yaml-interpolation
storage:
  bucket: yaml-interpolation-dev-123456789
user:
  storage_path: s3://yaml-interpolation-dev-123456789/codiply
  user_arn: arn:aws:iam::123456789:user/codiply
  username: codiply

The code

For the implementation, I am using benedict and Jinja2, specifically the following versions

python-benedict==0.32.1
Jinja2==3.1.2

The imports are

import re
import typing

from benedict import benedict
from jinja2 import BaseLoader, Environment

I work with two representations:

A list of strings, each string containing the content of a YAML file. The order of this list is important when there are duplicate keys.
A merged nested dictionary with all settings combined. This will serve as the "context" for doing the string interpolation.

For loading the YAML files and merging them into a single dictionary I use benedict which already gives me the functionality for loading and merging dictionaries. The code is

def _merge_configs_to_dict(yaml_texts: typing.List[str]) -> benedict:
    merged_config = benedict()
    for text in yaml_texts:
        config = benedict.from_yaml(text)
        merged_config.merge(config, overwrite=True, concat=True)
    return merged_config

Notice that contents are processed in the order they are passed in, and due to the setting overwrite=True, duplicate keys are overwritten. The setting concat=True controls the behaviour for key values that are lists. In this case, I am appending elements to the list if they exist in multiple configs, but you can choose to overwrite the whole list with the new list.

Once I have a context object loaded, I can attempt to render each one of the YAML texts with Jinja

def _render_jinja(text: str, context: benedict) -> str:
    template = Environment(loader=BaseLoader(), autoescape=False).from_string(text)
    return template.render(context)

def _render_yaml_texts(yaml_texts: typing.List[str], context: benedict) -> typing.List[str]:
    return [_render_jinja(yaml_text, context) for yaml_text in yaml_texts]

To tell if there are more placeholders left in the YAML, it is easier to work with the text representation.

def _exists_string_to_interpolate(yaml_texts: typing.List[str]) -> bool:
    for text in yaml_texts:
        if "{{" in text:
            return True
    return False

The idea is to go back and forth between the two representations (YAML text and dictionary/context) making string interpolations until there are no more interpolations to be made. If there are cyclic dependencies, the stopping condition will never be met. For that reason, I set a maximum number of iterations and I stop after the maximum number of passes. I raise an exception if at that point there are still placeholders left.

def _combine_configs_with_string_interpolation(ordered_yaml_texts: typing.List[str], max_passes: int = 8) -> benedict:
    yaml_texts = ordered_yaml_texts
    pass_number = 1

    while pass_number <= max_passes and _exists_string_to_interpolate(yaml_texts):
        context = _merge_configs_to_dict(yaml_texts)
        yaml_texts = _render_yaml_texts(yaml_texts, context)
        pass_number += 1

    if _exists_string_to_interpolate(yaml_texts):
        remaining_expressions = _find_all_remaining_placeholders(yaml_texts)
        raise Exception(
            f"Unable to extrapolate all strings after {max_passes} passes. "
            "Check for cyclic references. "
            f"Remaining expressions are {', '.join(remaining_expressions)}."
        )

    return _merge_configs_to_dict(yaml_texts)

For better debugging of cyclic dependencies, I find and report all placeholders that have not been replaced with a value. This function is below

def _find_all_remaining_placeholders(yaml_texts: typing.List[str]) -> typing.List[str]:
    remaining = set()
    for text in yaml_texts:
        remaining.update(re.findall("{{.*}}", text))
    return list(remaining)

To load the YAML texts, given some paths, the code is

def _load_yaml_texts(ordered_paths: typing.List[str]) -> typing.List[str]:
    yaml_texts = []
    for path in ordered_paths:
        if os.path.isfile(path):
            with open(path, "r") as file:
                yaml_texts.append(file.read())
    return yaml_texts

def load_config(ordered_yaml_paths: typing.List[str]) -> benedict:
    yaml_texts = _load_yaml_texts(ordered_yaml_paths)
    config = _combine_configs_with_string_interpolation(yaml_texts)
    return config

Finally, to create the dictionary for a set of config filenames, I do

config = load_config(["/some/path/a.yaml", "/some/path/b.yaml"])

Most likely this config will be used in your Python code, so a dictionary is a good representation. Or you can pass the dictionary to a constructor of a more typed object.

If you wish to get the rendered config as a single YAML file, you can simply do

config.to_yaml()

and store the result in a file.

Limitations

There are a few limitations

You cannot reference a key that contains a list. This is not exactly a limitation, because the goal is to do string interpolation. If you are here because you need to reference a list, then you should most likely be looking into anchors and aliases that are part of the YAML specification.
Depending on how deep is the graph of references, 8 passes might not be sufficient. You can raise the number of maximum passes to a bigger number.
If you have a cyclic dependency and a high number of maximum passes, the code is going to construct very large strings.

To demonstrate the last point, with the simplest cyclic dependency, this is what happens at each step

# Original
section:
  key1: "{{ section.key2 }}-a"
  key2: "{{ section.key1 }}-b"

# 1st pass
section:
  key1: "{{ section.key1 }}-b-a"
  key2: "{{ section.key2 }}-a-b"

# 2st pass
section:
  key1: "{{ section.key1 }}-b-a-b-a"
  key2: "{{ section.key2 }}-a-b-a-b"

# 3rd pass
section:
  key1: "{{ section.key1 }}-b-a-b-a-b-a-b-a"
  key2: "{{ section.key2 }}-a-b-a-b-a-b-a-b"

# 4th pass
section:
  key1: "{{ section.key1 }}-b-a-b-a-b-a-b-a-b-a-b-a-b-a-b-a"
  key2: "{{ section.key2 }}-a-b-a-b-a-b-a-b-a-b-a-b-a-b-a-b"

and the strings for these 2 values grow exponentially in size.

Deep Dive - Codiply.com