Python Error Recovery Patterns for Long-Running Scripts
Long-running scripts fail differently than web requests. A web request fails in milliseconds and the next request starts fresh. A script that processes 50,000 records fails hours in, and the question is not just what went wrong but how much work was lost and how much needs to be redone.
At 137Foundry, we work on data pipelines and automation scripts that process large batches of records. The patterns here come from building those systems and watching what breaks in production.
The Problem With Failing Fast in Long-Running Jobs
For short operations, fail fast is the correct default. If an API call fails at the start of a web request, raise immediately and let the framework return an error response. Clean and predictable.
For long-running batch jobs, fail-fast means losing all progress on a partially-completed batch. If you are processing 10,000 records and record 8,743 causes an unhandled exception, you have lost the results for records 1 through 8,742. You also have no idea which records failed and which succeeded.
The alternative is record-level error isolation: catch exceptions at the per-record level, log them with enough context to identify the problem record, and continue processing the remaining records.
results = {"success": [], "failed": []} for record in batch: try: output = transform(record) results["success"].append(output) except Exception: log.exception("Transform failed on record %s", record.id) results["failed"].append(record.id) log.info("Batch complete: %d succeeded, %d failed", len(results["success"]), len(results["failed"]))
This is a deliberate trade-off. You are choosing to continue processing over failing loudly. The correctness condition is that each record's failure or success is fully captured before moving to the next one.
Distinguishing Retryable from Fatal Failures
Not all failures are the same. A database timeout at record 4,000 might succeed on retry. A record with a malformed date field will fail every time. Treating both identically -- either always retry or always skip -- is wrong.
A custom exception hierarchy makes the distinction explicit:
class RetryableError(Exception): """The operation failed but may succeed on retry.""" pass class FatalError(Exception): """The operation will not succeed on retry. Escalate.""" pass
The service layer raises the appropriate type. The batch processor catches them differently:
for record in batch: for attempt in range(MAX_RETRIES): try: output = process(record) results["success"].append(output) break except RetryableError as err: if attempt == MAX_RETRIES - 1: log.error("Max retries reached for %s: %s", record.id, err) results["failed"].append(record.id) else: time.sleep(2 ** attempt) # Exponential backoff except FatalError: log.exception("Fatal failure on record %s, skipping", record.id) results["failed"].append(record.id) break
The Python documentation covers the exception hierarchy and how custom exceptions should be defined. The Python Enhancement Proposals on peps.python.org -- specifically PEP 3134 -- describe exception chaining, which is how you preserve the original cause when wrapping exceptions in your service layer.
Checkpointing to Avoid Reprocessing Completed Work
For very long jobs, error isolation is not enough. If the entire process crashes after 8 hours of work, you still lose all progress. Checkpointing lets you resume from the last successful point rather than starting over.
The simplest checkpoint is a file or database table that records which records have been processed:
import json CHECKPOINT_FILE = "progress.json" def load_checkpoint(): try: with open(CHECKPOINT_FILE) as f: return set(json.load(f)["completed"]) except FileNotFoundError: return set() def save_checkpoint(completed_ids): with open(CHECKPOINT_FILE, "w") as f: json.dump({"completed": list(completed_ids)}, f) completed = load_checkpoint() for record in batch: if record.id in completed: continue # Already processed on a previous run try: output = process(record) completed.add(record.id) save_checkpoint(completed) except Exception: log.exception("Failed on record %s", record.id)
The checkpoint write frequency is a trade-off: write after every record for maximum resume granularity, write less often for better throughput. For most batch jobs, writing every 100 records is a reasonable balance.
Structured Failure Reports
After a batch completes, the failure records are only useful if they are in a format that makes remediation possible. Logging exception messages is a minimum. Persisting the failed record IDs to a file or table is better, because it enables re-running just the failed subset without modifying the main script.
if results["failed"]: with open("failed_records.txt", "w") as f: for record_id in results["failed"]: f.write(f"{record_id}\n") log.warning("%d records failed. See failed_records.txt", len(results["failed"]))
For re-runs, the batch loading code accepts an optional filter:
def load_batch(rerun_ids=None): if rerun_ids: return [r for r in all_records if r.id in rerun_ids] return all_records
This pattern -- isolate failures, record them, enable targeted reruns -- is one of the consistent elements in the data pipeline work we do at 137Foundry. The same structure applies whether the batch is 1,000 records or 10 million.
When Record-Level Isolation Is Not Appropriate
Record-level isolation is not always the right pattern. It is specifically appropriate when:
Each record can be processed independently, with no side effects between records.
A failure on one record leaves the system in a clean state for the next record.
Partial completion is acceptable and resumable.
It is not appropriate when records have dependencies. If record B references the output of record A, and record A failed, record B may appear to succeed but produce corrupt output based on a missing or stale dependency. In these cases, the dependency graph needs explicit representation in the batch design, and failures need to propagate to dependent records.
Similarly, record-level isolation does not protect against failures in setup or teardown code outside the loop. If the database connection drops mid-batch and you are catching DatabaseConnectionError inside the loop, every remaining record will fail with the same error. The isolation masks what is actually a systemic failure. A connection health check before the loop and a circuit breaker pattern to abort when failure rate spikes prevents the isolation from hiding a systemic outage.
Production Monitoring Considerations
Error isolation means individual failures do not crash the job, but it also means failures can accumulate silently. A batch that completes with 200 failed records looks like a success from the job runner's perspective if you are only checking the exit code.
Three additions address this:
Exit with a non-zero code if failure rate exceeds a threshold. A 2% failure rate might be acceptable; a 40% failure rate indicates a systemic problem.
failure_rate = len(results["failed"]) / len(batch) if failure_rate > 0.05: log.error("Failure rate %.1f%% exceeds threshold", failure_rate * 100) sys.exit(1)
Send an alert when failure counts exceed a threshold. The Sentry error monitoring platform integrates with Python logging to aggregate error counts and alert when error rates spike above configured thresholds.
Log a structured summary at job end. A final log entry with total records, success count, failure count, and elapsed time makes it trivial to build dashboards and detect regressions across job runs without parsing individual record-level entries.
log.info( "Batch %s complete: total=%d success=%d failed=%d elapsed=%.1fs", batch.id, len(batch), len(results["success"]), len(results["failed"]), time.time() - start_time )
The full reference for Python exception handling patterns -- including exception chaining, service boundary design, and testing failure paths with pytest -- is in the Python Error Handling article on the 137Foundry blog.
The patterns described here -- record-level isolation, retryable vs. fatal error distinction, checkpointing, and structured failure reporting -- are composable. A long-running job that is not important enough to checkpoint may still benefit from record-level isolation and a structured failure report. Choose the patterns that match the job's requirements for recovery, auditability, and rerunnability, rather than applying all of them uniformly.



















