Discover Top Posts Tagged with #error-handling

Python Error Recovery Patterns for Long-Running Scripts

Long-running scripts fail differently than web requests. A web request fails in milliseconds and the next request starts fresh. A script that processes 50,000 records fails hours in, and the question is not just what went wrong but how much work was lost and how much needs to be redone.

At 137Foundry, we work on data pipelines and automation scripts that process large batches of records. The patterns here come from building those systems and watching what breaks in production.

The Problem With Failing Fast in Long-Running Jobs

For short operations, fail fast is the correct default. If an API call fails at the start of a web request, raise immediately and let the framework return an error response. Clean and predictable.

For long-running batch jobs, fail-fast means losing all progress on a partially-completed batch. If you are processing 10,000 records and record 8,743 causes an unhandled exception, you have lost the results for records 1 through 8,742. You also have no idea which records failed and which succeeded.

The alternative is record-level error isolation: catch exceptions at the per-record level, log them with enough context to identify the problem record, and continue processing the remaining records.

results = {"success": [], "failed": []} for record in batch: try: output = transform(record) results["success"].append(output) except Exception: log.exception("Transform failed on record %s", record.id) results["failed"].append(record.id) log.info("Batch complete: %d succeeded, %d failed", len(results["success"]), len(results["failed"]))

This is a deliberate trade-off. You are choosing to continue processing over failing loudly. The correctness condition is that each record's failure or success is fully captured before moving to the next one.

Distinguishing Retryable from Fatal Failures

Not all failures are the same. A database timeout at record 4,000 might succeed on retry. A record with a malformed date field will fail every time. Treating both identically -- either always retry or always skip -- is wrong.

A custom exception hierarchy makes the distinction explicit:

class RetryableError(Exception): """The operation failed but may succeed on retry.""" pass class FatalError(Exception): """The operation will not succeed on retry. Escalate.""" pass

The service layer raises the appropriate type. The batch processor catches them differently:

for record in batch: for attempt in range(MAX_RETRIES): try: output = process(record) results["success"].append(output) break except RetryableError as err: if attempt == MAX_RETRIES - 1: log.error("Max retries reached for %s: %s", record.id, err) results["failed"].append(record.id) else: time.sleep(2 ** attempt) # Exponential backoff except FatalError: log.exception("Fatal failure on record %s, skipping", record.id) results["failed"].append(record.id) break

The Python documentation covers the exception hierarchy and how custom exceptions should be defined. The Python Enhancement Proposals on peps.python.org -- specifically PEP 3134 -- describe exception chaining, which is how you preserve the original cause when wrapping exceptions in your service layer.

Checkpointing to Avoid Reprocessing Completed Work

For very long jobs, error isolation is not enough. If the entire process crashes after 8 hours of work, you still lose all progress. Checkpointing lets you resume from the last successful point rather than starting over.

The simplest checkpoint is a file or database table that records which records have been processed:

import json CHECKPOINT_FILE = "progress.json" def load_checkpoint(): try: with open(CHECKPOINT_FILE) as f: return set(json.load(f)["completed"]) except FileNotFoundError: return set() def save_checkpoint(completed_ids): with open(CHECKPOINT_FILE, "w") as f: json.dump({"completed": list(completed_ids)}, f) completed = load_checkpoint() for record in batch: if record.id in completed: continue # Already processed on a previous run try: output = process(record) completed.add(record.id) save_checkpoint(completed) except Exception: log.exception("Failed on record %s", record.id)

The checkpoint write frequency is a trade-off: write after every record for maximum resume granularity, write less often for better throughput. For most batch jobs, writing every 100 records is a reasonable balance.

Structured Failure Reports

After a batch completes, the failure records are only useful if they are in a format that makes remediation possible. Logging exception messages is a minimum. Persisting the failed record IDs to a file or table is better, because it enables re-running just the failed subset without modifying the main script.

if results["failed"]: with open("failed_records.txt", "w") as f: for record_id in results["failed"]: f.write(f"{record_id}\n") log.warning("%d records failed. See failed_records.txt", len(results["failed"]))

For re-runs, the batch loading code accepts an optional filter:

def load_batch(rerun_ids=None): if rerun_ids: return [r for r in all_records if r.id in rerun_ids] return all_records

This pattern -- isolate failures, record them, enable targeted reruns -- is one of the consistent elements in the data pipeline work we do at 137Foundry. The same structure applies whether the batch is 1,000 records or 10 million.

When Record-Level Isolation Is Not Appropriate

Record-level isolation is not always the right pattern. It is specifically appropriate when:

Each record can be processed independently, with no side effects between records.

A failure on one record leaves the system in a clean state for the next record.

Partial completion is acceptable and resumable.

It is not appropriate when records have dependencies. If record B references the output of record A, and record A failed, record B may appear to succeed but produce corrupt output based on a missing or stale dependency. In these cases, the dependency graph needs explicit representation in the batch design, and failures need to propagate to dependent records.

Similarly, record-level isolation does not protect against failures in setup or teardown code outside the loop. If the database connection drops mid-batch and you are catching DatabaseConnectionError inside the loop, every remaining record will fail with the same error. The isolation masks what is actually a systemic failure. A connection health check before the loop and a circuit breaker pattern to abort when failure rate spikes prevents the isolation from hiding a systemic outage.

Production Monitoring Considerations

Error isolation means individual failures do not crash the job, but it also means failures can accumulate silently. A batch that completes with 200 failed records looks like a success from the job runner's perspective if you are only checking the exit code.

Three additions address this:

Exit with a non-zero code if failure rate exceeds a threshold. A 2% failure rate might be acceptable; a 40% failure rate indicates a systemic problem.

failure_rate = len(results["failed"]) / len(batch) if failure_rate > 0.05: log.error("Failure rate %.1f%% exceeds threshold", failure_rate * 100) sys.exit(1)

Send an alert when failure counts exceed a threshold. The Sentry error monitoring platform integrates with Python logging to aggregate error counts and alert when error rates spike above configured thresholds.

Log a structured summary at job end. A final log entry with total records, success count, failure count, and elapsed time makes it trivial to build dashboards and detect regressions across job runs without parsing individual record-level entries.

log.info( "Batch %s complete: total=%d success=%d failed=%d elapsed=%.1fs", batch.id, len(batch), len(results["success"]), len(results["failed"]), time.time() - start_time )

The full reference for Python exception handling patterns -- including exception chaining, service boundary design, and testing failure paths with pytest -- is in the Python Error Handling article on the 137Foundry blog.

The patterns described here -- record-level isolation, retryable vs. fatal error distinction, checkpointing, and structured failure reporting -- are composable. A long-running job that is not important enough to checkpoint may still benefit from record-level isolation and a structured failure report. Choose the patterns that match the job's requirements for recovery, auditability, and rerunnability, rather than applying all of them uniformly.

#python #programming #error-handling #scripting #software-development

Building Resilient Applications: Python Error Handling Strategies

Python error handling is crucial for building robust and user-friendly applications. This guide details various techniques, starting from basic try...except blocks to advanced methods like exception chaining and custom exceptions. Proper error handling

From “Oops” to “Oh Yeah!”: Building Resilient, User-Friendly Python Code Errors are inevitable in any programming language, and Python is no exception. However, mastering how to anticipate, manage, and recover from these errors gracefully is what distinguishes a robust application from one that crashes unexpectedly. In this comprehensive guide, we’ll journey through the levels of error handling…

View On WordPress

#code-refactoring #error-handling #exceptions #learn-application-development #micropython #programming #programming-logic #python #software-development #try-except

How-to: Error message when I run sudo: unable to resolve host (none) #dev #fix #development

Error message when I run sudo: unable to resolve host (none)

I have this issue on AWS on some servers. Whenever I run sudo the terminal is stuck doing seemingly nothing, until it finally spits out this error message. My terminal looks like this:

ubuntu@(none):~$ sudo true sudo: unable to resolve host (none)

What can I do to solve it?

Answer [by darent]: Error message when I run sudo: unable to…

View On WordPress

#error-handling #sudo

How-to: What is the 'whoopsie' process and how can I remove it? #dev #it #solution

What is the ‘whoopsie’ process and how can I remove it?

On one of my machines I have a process running called “whoopsie”. I’m running 12.04 server and never specifically installed anything with this name.

Google seems to imply that it has something to with error logs but I’m not finding too much information. The fact that I didn’t manually install it and the 3 other servers I checked did in fact…

View On WordPress

#apport #error-handling #whoopsie

Resolved: What's a good way to extend Error in JavaScript? #solution #development #dev

What’s a good way to extend Error in JavaScript?

I want to throw some things in my JS code and I want them to be instanceof Error, but I also want to have them be something else.

In Python, typically, one would subclass Exception.

What’s the appropriate thing to do in JS?

Answer [by Blaine]: What’s a good way to extend Error in JavaScript?

Crescent Fresh’s answer highly-voted answer is…

View On WordPress

#error-handling #exception #javascript

Solution: Best practices for exception management in Java or C# #dev #answer #fix

Best practices for exception management in Java or C#

I’m stuck deciding how to handle exceptions in my application.

Much if my issues with exceptions comes from 1) accessing data via a remote service or 2) deserializing a JSON object. Unfortunately I can’t guarantee success for either of these tasks (cut network connection, malformed JSON object that is out of my control).

As a result, if I do…

View On WordPress

#c #error-handling #exception #java

How to: Cryptic "Script Error." reported in Javascript in Chrome and Firefox

Cryptic "Script Error." reported in Javascript in Chrome and Firefox

I have a script that detects Javascript errors on my website and sends them to my backend for reporting. It reports the first error encountered, the supposed line number, and the time.

EDIT to include doctype:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

View On WordPress

#error-handling #Firefox #Google Chrome #javascript

Fixed How to get useful error messages in PHP? #dev #it #asnwer

How to get useful error messages in PHP?

I find programming in PHP quite frustrating. Quite often I will try and run the script and just get a blank screen back. No error message, just empty screen. The cause might have been a simple syntax error (wrong bracket, missing semicolon), or a failed function call, or something else entirely.

It is very difficult to figure out what went wrong. I end up…

View On WordPress

#debugging #error-handling #php

Python Error Recovery Patterns for Long-Running Scripts

At 137Foundry, we work on data pipelines and automation scripts that process large batches of records. The patterns here come from building those systems and watching what breaks in production.

The Problem With Failing Fast in Long-Running Jobs

For short operations, fail fast is the correct default. If an API call fails at the start of a web request, raise immediately and let the framework return an error response. Clean and predictable.

The alternative is record-level error isolation: catch exceptions at the per-record level, log them with enough context to identify the problem record, and continue processing the remaining records.

Distinguishing Retryable from Fatal Failures

A custom exception hierarchy makes the distinction explicit:

class RetryableError(Exception): """The operation failed but may succeed on retry.""" pass class FatalError(Exception): """The operation will not succeed on retry. Escalate.""" pass

The service layer raises the appropriate type. The batch processor catches them differently:

Checkpointing to Avoid Reprocessing Completed Work

The simplest checkpoint is a file or database table that records which records have been processed:

Structured Failure Reports

For re-runs, the batch loading code accepts an optional filter:

def load_batch(rerun_ids=None): if rerun_ids: return [r for r in all_records if r.id in rerun_ids] return all_records

When Record-Level Isolation Is Not Appropriate

Record-level isolation is not always the right pattern. It is specifically appropriate when:

Each record can be processed independently, with no side effects between records.

A failure on one record leaves the system in a clean state for the next record.

Partial completion is acceptable and resumable.

Production Monitoring Considerations

Three additions address this:

Exit with a non-zero code if failure rate exceeds a threshold. A 2% failure rate might be acceptable; a 40% failure rate indicates a systemic problem.

failure_rate = len(results["failed"]) / len(batch) if failure_rate > 0.05: log.error("Failure rate %.1f%% exceeds threshold", failure_rate * 100) sys.exit(1)

log.info( "Batch %s complete: total=%d success=%d failed=%d elapsed=%.1fs", batch.id, len(batch), len(results["success"]), len(results["failed"]), time.time() - start_time )

#python #programming #error-handling #scripting #software-development

#error-handling

Trending Tags

Recently Viewed Tags

#error-handling