Parsing YAML in Python — PyYAML, ruamel.yaml, and Safe Loading

If you're working with YAML in Python, you're almost certainly using PyYAML. It's the standard library, it's been around since 2006, and it ships with a function called yaml.load() that has a critical security vulnerability that's burned a lot of teams. The fix is one word — safe_load — but you need to understand why, what you trade off, and when the newer ruamel.yaml library is the better choice.

This guide covers practical YAML parsing in Python: safe loading, multi-document streams, dumping Python objects back to YAML, config file patterns with defaults, and error handling. All examples use real-world scenarios — no placeholder data.

Installation

bash

pip install pyyaml

# For ruamel.yaml (covered later)
pip install ruamel.yaml

yaml.safe_load() — The One You Should Always Use

The most important thing to know about PyYAML is that yaml.load() can execute arbitrary Python code embedded in a YAML file. This is not a theoretical risk — it's a well-documented attack vector. Always use yaml.safe_load():

python

import yaml

# DANGEROUS — never use this with untrusted input
data = yaml.load(open('config.yaml'), Loader=yaml.FullLoader)

# SAFE — use this for any YAML from external sources
data = yaml.safe_load(open('config.yaml'))

# The attack: a YAML file could contain this, which executes Python
# !!python/object/apply:os.system ["rm -rf /important-dir"]

Security note: yaml.safe_load() only supports standard YAML types: strings, numbers, booleans, null, lists, and dicts. It will raise a ConstructorError if the YAML contains Python-specific tags like !!python/object. This is exactly the behaviour you want. yaml.full_load() is safer than the old bare yaml.load() but still less restrictive than safe_load(). Start with safe_load() and only upgrade if you genuinely need it.

Loading a YAML Config File

Here's a realistic config loading pattern for a web application. We load a YAML config file and use Python's dict merge to fill in defaults for anything not specified:

python

# config.yaml
database:
  host: postgres.internal
  port: 5432
  name: myapp_prod
  pool_size: 10

redis:
  host: redis.internal
  port: 6379

logging:
  level: INFO
  format: json

python

import yaml
from pathlib import Path
from typing import Any

DEFAULT_CONFIG = {
    'database': {
        'host': 'localhost',
        'port': 5432,
        'name': 'myapp',
        'pool_size': 5,
        'ssl': False,
    },
    'redis': {
        'host': 'localhost',
        'port': 6379,
        'db': 0,
    },
    'logging': {
        'level': 'DEBUG',
        'format': 'text',
    }
}

def deep_merge(base: dict, override: dict) -> dict:
    """Recursively merge override into base, returning a new dict."""
    result = base.copy()
    for key, value in override.items():
        if key in result and isinstance(result[key], dict) and isinstance(value, dict):
            result[key] = deep_merge(result[key], value)
        else:
            result[key] = value
    return result

def load_config(config_path: str | Path) -> dict[str, Any]:
    path = Path(config_path)
    if not path.exists():
        raise FileNotFoundError(f"Config file not found: {path}")

    with path.open('r', encoding='utf-8') as f:
        raw = yaml.safe_load(f)

    if raw is None:
        return DEFAULT_CONFIG.copy()

    return deep_merge(DEFAULT_CONFIG, raw)


config = load_config('config.yaml')
print(config['database']['host'])       # postgres.internal
print(config['database']['ssl'])        # False  (from defaults)
print(config['redis']['db'])            # 0  (from defaults)

Dumping Python Objects to YAML

yaml.dump() serialises Python dicts, lists, strings, numbers, booleans, and None to YAML. By default it uses flow style (inline braces) — set default_flow_style=False for the readable block style:

python

import yaml
from dataclasses import dataclass, asdict

@dataclass
class ServiceConfig:
    name: str
    replicas: int
    image: str
    port: int
    tags: list[str]

service = ServiceConfig(
    name='payment-api',
    replicas=3,
    image='payment-api:2.4.1',
    port=8080,
    tags=['payments', 'backend', 'critical']
)

# Convert dataclass to dict first, then dump to YAML
output = yaml.dump(
    asdict(service),
    default_flow_style=False,
    sort_keys=False,            # preserve insertion order
    allow_unicode=True
)
print(output)
# image: payment-api:2.4.1
# name: payment-api
# port: 8080
# replicas: 3
# tags:
# - payments
# - backend
# - critical

# Write to file
with open('service-config.yaml', 'w', encoding='utf-8') as f:
    yaml.dump(asdict(service), f, default_flow_style=False, sort_keys=False)

Multi-Document Streams with load_all

YAML supports multiple documents in a single file, separated by ---. This is common in Kubernetes manifests where a single file might contain a Deployment, a Service, and a ConfigMap. Use yaml.safe_load_all() to iterate over all documents:

python

import yaml

# manifests.yaml contains multiple Kubernetes resources separated by ---
with open('manifests.yaml', 'r') as f:
    # safe_load_all returns a generator
    documents = list(yaml.safe_load_all(f))

for doc in documents:
    if doc is None:
        continue
    kind = doc.get('kind', 'Unknown')
    name = doc.get('metadata', {}).get('name', 'unnamed')
    print(f"{kind}: {name}")

# Deployment: payment-api
# Service: payment-api-svc
# ConfigMap: payment-api-config

You can also write multiple documents to a stream with yaml.dump_all():

python

import yaml

documents = [
    {'kind': 'Deployment', 'metadata': {'name': 'api'}, 'spec': {'replicas': 2}},
    {'kind': 'Service', 'metadata': {'name': 'api-svc'}, 'spec': {'port': 80}},
]

output = yaml.dump_all(documents, default_flow_style=False)
print(output)
# kind: Deployment
# metadata:
#   name: api
# spec:
#   replicas: 2
# ---
# kind: Service
# metadata:
#   name: api-svc
# spec:
#   port: 80

ruamel.yaml — When You Need to Preserve Comments

PyYAML has one significant limitation: it strips comments when loading. If you load a YAML file, modify it, and write it back, all comments are gone. For config files that humans maintain, losing comments is a deal-breaker.

ruamel.yaml implements a round-trip parser that preserves comments, key order, and formatting — it targets the YAML 1.2 spec by default. It's the right choice whenever you're programmatically editing YAML that humans will read afterwards:

python

from ruamel.yaml import YAML

yaml = YAML()
yaml.preserve_quotes = True

# This config.yaml has important comments we need to keep:
# database:
#   host: localhost  # change this for production
#   port: 5432       # default PostgreSQL port
#   pool_size: 5     # increase under heavy load

with open('config.yaml', 'r') as f:
    config = yaml.load(f)

# Modify a value
config['database']['host'] = 'postgres.prod.internal'
config['database']['pool_size'] = 20

# Write back — comments and formatting are preserved!
with open('config.yaml', 'w') as f:
    yaml.dump(config, f)

# Result:
# database:
#   host: postgres.prod.internal  # change this for production
#   port: 5432                    # default PostgreSQL port
#   pool_size: 20                 # increase under heavy load

Use PyYAML when you're reading YAML for consumption — parsing config into your app, loading test fixtures, processing Kubernetes manifests programmatically.
Use ruamel.yaml when you're editing YAML that humans maintain — updating config files in place, tooling that modifies CI configs, anything where losing comments would be bad.
ruamel.yaml is also YAML 1.2 compliant by default, which means the Norway Problem (NO → false) doesn't affect it. PyYAML uses YAML 1.1 by default.

Error Handling

YAML parse errors raise yaml.YAMLError, which is the base class for all PyYAML exceptions. Always catch it when loading YAML from untrusted or user-provided sources:

python

import yaml
from pathlib import Path

def load_user_config(path: str) -> dict:
    try:
        with open(path, 'r', encoding='utf-8') as f:
            data = yaml.safe_load(f)
    except FileNotFoundError:
        raise FileNotFoundError(f"Config file not found: {path}")
    except yaml.scanner.ScannerError as e:
        # Includes line/column info in the error message
        raise ValueError(f"YAML syntax error in {path}:\n{e}")
    except yaml.YAMLError as e:
        raise ValueError(f"Invalid YAML in {path}: {e}")

    if data is None:
        return {}
    if not isinstance(data, dict):
        raise TypeError(f"Expected a YAML mapping at top level, got {type(data).__name__}")

    return data

Validate structure after loading. yaml.safe_load() only guarantees valid YAML syntax — it doesn't validate the shape of the data. A config file with database: at the top is fine; a config file with a list at the top is also valid YAML. Add type and structure checks after loading, or use a schema validation library like Pydantic to parse the loaded dict into a typed model.

Wrapping Up

PyYAML covers the vast majority of YAML work in Python: always use yaml.safe_load() (not yaml.load()), use yaml.safe_load_all() for multi-document streams, and use yaml.dump() with default_flow_style=False for readable output. When you need to preserve comments or get YAML 1.2 semantics, switch to ruamel.yaml — it's a drop-in upgrade for reading and a minor API change for writing. For syntax errors before your code even runs, the YAML Validator will tell you exactly which line and column is broken.

← All YAML articles Browse all categories →