How We Handle Errors in 137Foundry Data Projects
Error handling is one of those things that looks like plumbing -- boring and invisible until it breaks. At 137Foundry, we have a consistent approach to exception handling that we apply to every Python data pipeline and web application we build. This post describes what that approach looks like and why we settled on it.
We Start With a Custom Exception Hierarchy
Every project that has a service layer gets a custom exception hierarchy in the first commit. Not at refactor time, not after the first incident -- in the initial setup.
The hierarchy follows a predictable shape:
class ProjectError(Exception): """Base for all application-layer failures in this project.""" pass class RetryableError(ProjectError): """The operation failed but may succeed on retry.""" pass class PermanentError(ProjectError): """The operation will not succeed on retry. Escalate.""" pass class ValidationError(PermanentError): """Input data failed validation rules.""" def __init__(self, message, field=None, value=None): super().__init__(message) self.field = field self.value = value
The motivation for starting here is that it forces a conversation early about which failure modes exist and which require different responses. A RetryableError and a ValidationError need completely different handling. If you do not make the distinction in the type system, you make it in scattered if/else logic that is harder to maintain and easier to get wrong.
We Use log.exception() at Service Boundaries
Every service boundary in our projects -- API handlers, background job entry points, CLI commands -- has a consistent logging pattern for caught exceptions:
try: result = service.process(request_data) except ProjectError: log.exception("Service operation failed for request %s", request.id) raise
log.exception() captures the full traceback automatically. We do not use log.error("Failed: %s", err) at boundaries because the error message alone rarely contains enough information to diagnose the problem. The traceback tells us the exact call chain and the exact line. Without it, an incident investigation starts from scratch.
We re-raise after logging so that callers at higher levels get the exception and can respond to it with their own logic -- returning an HTTP 500, scheduling a retry, or escalating to an alert.
We Preserve Tracebacks With Chaining
When we catch a low-level exception and raise a domain exception, we always use raise X from Y:
try: raw = db.query(sql) except DatabaseConnectionError as err: raise RetryableError("Database temporarily unavailable") from err
The from err clause preserves the original DatabaseConnectionError in the traceback. When a developer reads the log, they see both exceptions and can trace the failure back to its source. Without chaining, the original error disappears and the traceback shows only the RetryableError, which is less useful for diagnosis.
The Python documentation at docs.python.org covers exception chaining, and PEP 3134 on peps.python.org explains the design intent behind __cause__ and __context__.
We Write Tests for Failure Paths First
In data projects, failure path tests are often the ones that prevent the most expensive bugs. A test that verifies the service raises RetryableError on database timeout ensures that the retry logic in the background job processor actually fires on timeout. Without that test, a change that accidentally swallows the timeout and returns None instead would not be caught until timeout caused silent data loss in production.
Our standard pattern for failure path tests uses pytest and mocker:
def test_raises_retryable_on_db_timeout(mocker): mocker.patch("project.db.query", side_effect=TimeoutError) with pytest.raises(RetryableError): service.fetch_records(batch_id) def test_validation_error_carries_field_name(): with pytest.raises(ValidationError) as exc_info: service.process_record({"amount": "not_a_number"}) assert exc_info.value.field == "amount"
The pytest documentation at docs.pytest.org covers pytest.raises, the match parameter for asserting on exception messages, and mocker.patch for simulating failures.
How We Handle Failures in Background Jobs
Background job processors have different error handling requirements than API endpoints. A web request must respond within a timeout; a job can fail and be retried. But the failure must be recorded and the retry must actually happen -- silent failures are just as dangerous in a job queue as in an API.
Our background job processors follow a consistent structure:
The job function catches RetryableError and marks the job for retry with appropriate backoff.
The job function catches PermanentError and marks the job as permanently failed, triggering an alert.
Any unexpected exception propagates to the job runner's outer handler, which marks it as failed and sends an immediate alert.
def process_job(job): try: result = service.process(job.data) job.mark_complete(result) except RetryableError as err: delay = min(2 ** job.attempt_count, 3600) job.schedule_retry(delay=delay) log.warning("Retryable failure on job %s, retry in %ds: %s", job.id, delay, err) except PermanentError: log.exception("Permanent failure on job %s", job.id) job.mark_failed() alerts.send(f"Job {job.id} permanently failed")
The except PermanentError handler does not catch RetryableError because it is not a subclass of PermanentError. Adding a new failure mode to the hierarchy does not require changing this handler unless the new type needs a distinct response.
The Value of Consistent Patterns Across a Codebase
Exception handling inconsistency compounds as a codebase grows. When some services raise ValueError, others raise Exception("message"), some return None and log an error, and some return (None, error_message) tuples, adding a new feature requires understanding the error contract for every function you call, because none of them are the same.
The consistent approach described here is not complex. Each piece is a few lines. The value accumulates because every developer working on the codebase -- including developers joining the project later -- can read any service and immediately understand its failure modes. The exception type tells them whether to retry, escalate, or discard. The logging tells them where to look. The test tells them the retry logic actually fires.
We have found this consistency to be particularly valuable during handoffs. When we finish a project and transfer it to a client's internal team, the new developers spend far less time understanding the error handling because it follows the same pattern everywhere. That is the return on the investment in establishing the convention at project start.
We Apply the Same Pattern to All Project Types
The same exception hierarchy and logging pattern applies whether we are building a REST API, a batch data pipeline, or a scheduled automation script. The boundary changes -- the API handler is the boundary in a web app, the job entry function is the boundary in a pipeline -- but the design is the same.
This consistency has a practical benefit: when we hand a project off to a client's internal team, the error handling is predictable throughout. Every exception type communicates a specific failure mode. Every boundary logs with log.exception(). Every service-to-service exception is chained. Developers who are new to the codebase can read the exception type and know whether to retry, escalate, or discard.
The full reference collection of Python error handling code snippets -- covering try-except-else-finally, exception chaining, contextlib.suppress, and testing patterns -- is on the 137Foundry blog. It is the reference we send to developers starting on a new project.
One thing we have consistently observed: the conversations that exception handling forces -- which failure modes exist, which require retries, which require escalation -- are valuable design conversations in their own right. Starting with the exception hierarchy is often the clearest way to make implicit failure assumptions explicit before writing the rest of the service logic. The exception types become documentation of what the service can fail with, and that documentation lives in the code where it remains accurate as the code changes.














