Discover Top Posts Tagged with #data-engineering

How We Handle Errors in 137Foundry Data Projects

Error handling is one of those things that looks like plumbing -- boring and invisible until it breaks. At 137Foundry, we have a consistent approach to exception handling that we apply to every Python data pipeline and web application we build. This post describes what that approach looks like and why we settled on it.

We Start With a Custom Exception Hierarchy

Every project that has a service layer gets a custom exception hierarchy in the first commit. Not at refactor time, not after the first incident -- in the initial setup.

The hierarchy follows a predictable shape:

class ProjectError(Exception): """Base for all application-layer failures in this project.""" pass class RetryableError(ProjectError): """The operation failed but may succeed on retry.""" pass class PermanentError(ProjectError): """The operation will not succeed on retry. Escalate.""" pass class ValidationError(PermanentError): """Input data failed validation rules.""" def __init__(self, message, field=None, value=None): super().__init__(message) self.field = field self.value = value

The motivation for starting here is that it forces a conversation early about which failure modes exist and which require different responses. A RetryableError and a ValidationError need completely different handling. If you do not make the distinction in the type system, you make it in scattered if/else logic that is harder to maintain and easier to get wrong.

We Use log.exception() at Service Boundaries

Every service boundary in our projects -- API handlers, background job entry points, CLI commands -- has a consistent logging pattern for caught exceptions:

try: result = service.process(request_data) except ProjectError: log.exception("Service operation failed for request %s", request.id) raise

log.exception() captures the full traceback automatically. We do not use log.error("Failed: %s", err) at boundaries because the error message alone rarely contains enough information to diagnose the problem. The traceback tells us the exact call chain and the exact line. Without it, an incident investigation starts from scratch.

We re-raise after logging so that callers at higher levels get the exception and can respond to it with their own logic -- returning an HTTP 500, scheduling a retry, or escalating to an alert.

We Preserve Tracebacks With Chaining

When we catch a low-level exception and raise a domain exception, we always use raise X from Y:

try: raw = db.query(sql) except DatabaseConnectionError as err: raise RetryableError("Database temporarily unavailable") from err

The from err clause preserves the original DatabaseConnectionError in the traceback. When a developer reads the log, they see both exceptions and can trace the failure back to its source. Without chaining, the original error disappears and the traceback shows only the RetryableError, which is less useful for diagnosis.

The Python documentation at docs.python.org covers exception chaining, and PEP 3134 on peps.python.org explains the design intent behind __cause__ and __context__.

We Write Tests for Failure Paths First

In data projects, failure path tests are often the ones that prevent the most expensive bugs. A test that verifies the service raises RetryableError on database timeout ensures that the retry logic in the background job processor actually fires on timeout. Without that test, a change that accidentally swallows the timeout and returns None instead would not be caught until timeout caused silent data loss in production.

Our standard pattern for failure path tests uses pytest and mocker:

def test_raises_retryable_on_db_timeout(mocker): mocker.patch("project.db.query", side_effect=TimeoutError) with pytest.raises(RetryableError): service.fetch_records(batch_id) def test_validation_error_carries_field_name(): with pytest.raises(ValidationError) as exc_info: service.process_record({"amount": "not_a_number"}) assert exc_info.value.field == "amount"

The pytest documentation at docs.pytest.org covers pytest.raises, the match parameter for asserting on exception messages, and mocker.patch for simulating failures.

How We Handle Failures in Background Jobs

Background job processors have different error handling requirements than API endpoints. A web request must respond within a timeout; a job can fail and be retried. But the failure must be recorded and the retry must actually happen -- silent failures are just as dangerous in a job queue as in an API.

Our background job processors follow a consistent structure:

The job function catches RetryableError and marks the job for retry with appropriate backoff.

The job function catches PermanentError and marks the job as permanently failed, triggering an alert.

Any unexpected exception propagates to the job runner's outer handler, which marks it as failed and sends an immediate alert.

def process_job(job): try: result = service.process(job.data) job.mark_complete(result) except RetryableError as err: delay = min(2 ** job.attempt_count, 3600) job.schedule_retry(delay=delay) log.warning("Retryable failure on job %s, retry in %ds: %s", job.id, delay, err) except PermanentError: log.exception("Permanent failure on job %s", job.id) job.mark_failed() alerts.send(f"Job {job.id} permanently failed")

The except PermanentError handler does not catch RetryableError because it is not a subclass of PermanentError. Adding a new failure mode to the hierarchy does not require changing this handler unless the new type needs a distinct response.

The Value of Consistent Patterns Across a Codebase

Exception handling inconsistency compounds as a codebase grows. When some services raise ValueError, others raise Exception("message"), some return None and log an error, and some return (None, error_message) tuples, adding a new feature requires understanding the error contract for every function you call, because none of them are the same.

The consistent approach described here is not complex. Each piece is a few lines. The value accumulates because every developer working on the codebase -- including developers joining the project later -- can read any service and immediately understand its failure modes. The exception type tells them whether to retry, escalate, or discard. The logging tells them where to look. The test tells them the retry logic actually fires.

We have found this consistency to be particularly valuable during handoffs. When we finish a project and transfer it to a client's internal team, the new developers spend far less time understanding the error handling because it follows the same pattern everywhere. That is the return on the investment in establishing the convention at project start.

We Apply the Same Pattern to All Project Types

The same exception hierarchy and logging pattern applies whether we are building a REST API, a batch data pipeline, or a scheduled automation script. The boundary changes -- the API handler is the boundary in a web app, the job entry function is the boundary in a pipeline -- but the design is the same.

This consistency has a practical benefit: when we hand a project off to a client's internal team, the error handling is predictable throughout. Every exception type communicates a specific failure mode. Every boundary logs with log.exception(). Every service-to-service exception is chained. Developers who are new to the codebase can read the exception type and know whether to retry, escalate, or discard.

The full reference collection of Python error handling code snippets -- covering try-except-else-finally, exception chaining, contextlib.suppress, and testing patterns -- is on the 137Foundry blog. It is the reference we send to developers starting on a new project.

One thing we have consistently observed: the conversations that exception handling forces -- which failure modes exist, which require retries, which require escalation -- are valuable design conversations in their own right. Starting with the exception hierarchy is often the clearest way to make implicit failure assumptions explicit before writing the rest of the service logic. The exception types become documentation of what the service can fail with, and that documentation lives in the code where it remains accurate as the code changes.

#python #programming #software-development #data-engineering #best-practices

Tips and Tricks #221: Use Embeddings for Semantic Search

Implement semantic search using text embeddings for more relevant results than keyword matching. Code Snippet from openai import OpenAI import numpy as np client = OpenAI() def get_embedding(text: str) -> list[float]: """Generate embedding for text using OpenAI.""" response = client.embeddings.create( model="text-embedding-3-small", input=text ) return response.data[0].embedding def…

#data-engineering #ETL #SQL #Streaming

Tips and Tricks #220: Use dbt for Maintainable Data Transformations

Build modular, tested, documented data transformations with dbt. Code Snippet -- models/staging/stg_orders.sql WITH source AS ( SELECT * FROM {{ source('raw', 'orders') }} ), cleaned AS ( SELECT order_id, customer_id, CAST(order_date AS DATE) AS order_date, CAST(total_amount AS DECIMAL(10,2)) AS total_amount, LOWER(TRIM(status)) AS status FROM source WHERE order_id IS NOT NULL ) SELECT * FROM…

#data-engineering #ETL #SQL #Streaming

Tips and Tricks #219: Partition Large Tables for Query Performance

Use table partitioning to dramatically speed up queries on large datasets. Code Snippet -- PostgreSQL: Create partitioned table CREATE TABLE events ( event_id BIGSERIAL, event_date DATE NOT NULL, event_type VARCHAR(50), payload JSONB ) PARTITION BY RANGE (event_date); -- Create monthly partitions CREATE TABLE events_2024_01 PARTITION OF events FOR VALUES FROM ('2024-01-01') TO…

#data-engineering #ETL #SQL #Streaming

Tips and Tricks #218: Implement Idempotent ETL with Merge Statements

Use MERGE (upsert) for safe, rerunnable data pipelines that handle duplicates gracefully. Code Snippet -- PostgreSQL: INSERT ON CONFLICT (upsert) INSERT INTO dim_customer (customer_id, name, email, updated_at) SELECT customer_id, name, email, NOW() FROM staging_customers ON CONFLICT (customer_id) DO UPDATE SET name = EXCLUDED.name, email = EXCLUDED.email, updated_at = NOW() WHERE…

#data-engineering #ETL #SQL #Streaming

Tips and Tricks #193: Use Span for Zero-Allocation String Parsing

Eliminate heap allocations when parsing strings by using Span for memory-efficient operations. Code Snippet // Before: Creates new string allocations string input = "key=value"; string[] parts = input.Split('='); string key = parts[0]; string value = parts[1]; // After: Zero allocations with Span ReadOnlySpan span = input.AsSpan(); int index = span.IndexOf('='); ReadOnlySpan key =…

#data-engineering #ETL #SQL #Streaming

Tips and Tricks #192: Implement Retry Logic for LLM API Calls

Handle rate limits and transient failures gracefully with exponential backoff. Code Snippet import time import random from functools import wraps from openai import RateLimitError, APIError def retry_with_backoff(max_retries=3, base_delay=1): """Decorator for retrying LLM calls with exponential backoff.""" def decorator(func): @wraps(func) def wrapper(*args, **kwargs): for attempt in…

#data-engineering #ETL #SQL #Streaming

How We Handle Errors in 137Foundry Data Projects

We Start With a Custom Exception Hierarchy

Every project that has a service layer gets a custom exception hierarchy in the first commit. Not at refactor time, not after the first incident -- in the initial setup.

The hierarchy follows a predictable shape:

We Use log.exception() at Service Boundaries

Every service boundary in our projects -- API handlers, background job entry points, CLI commands -- has a consistent logging pattern for caught exceptions:

try: result = service.process(request_data) except ProjectError: log.exception("Service operation failed for request %s", request.id) raise

We re-raise after logging so that callers at higher levels get the exception and can respond to it with their own logic -- returning an HTTP 500, scheduling a retry, or escalating to an alert.

We Preserve Tracebacks With Chaining

When we catch a low-level exception and raise a domain exception, we always use raise X from Y:

try: raw = db.query(sql) except DatabaseConnectionError as err: raise RetryableError("Database temporarily unavailable") from err

The Python documentation at docs.python.org covers exception chaining, and PEP 3134 on peps.python.org explains the design intent behind __cause__ and __context__.

We Write Tests for Failure Paths First

Our standard pattern for failure path tests uses pytest and mocker:

The pytest documentation at docs.pytest.org covers pytest.raises, the match parameter for asserting on exception messages, and mocker.patch for simulating failures.

How We Handle Failures in Background Jobs

Our background job processors follow a consistent structure:

The job function catches RetryableError and marks the job for retry with appropriate backoff.

The job function catches PermanentError and marks the job as permanently failed, triggering an alert.

Any unexpected exception propagates to the job runner's outer handler, which marks it as failed and sends an immediate alert.

The Value of Consistent Patterns Across a Codebase

We Apply the Same Pattern to All Project Types

#python #programming #software-development #data-engineering #best-practices

#data-engineering

Trending Tags

Recently Viewed Tags

#data-engineering