Fail-Safe by Default

An Engineering Philosophy for Living Systems

Table Of Contents

The valve that closes itself. Not a clever feature: the minimum standard for any system that controls water near living things.

If a microcontroller crashes mid-cycle and the irrigation solenoid stays open, a market garden can flood overnight. If a LoRaWAN connection drops after a valve opens and no close command ever arrives, the valve runs indefinitely. If a relay board loses power and restores its previous state on startup, it may turn on loads that should be off.

These scenarios are not exotic edge cases. They’re the predictable failure modes of any system that controls physical actuators over a network. The question isn’t whether they’ll happen. It’s what the system does when they do.

What Fail-Safe Means

In engineering, a fail-safe system is one that defaults to a safe state when something goes wrong. The specific safe state depends on what you’re controlling.

For an irrigation solenoid valve, fail-safe means closed. A closed valve can’t flood a crop. It can’t over-irrigate a bed that was already saturated. It can’t run a pump dry. It can’t cause the downstream problems that a stuck-open valve can cause in a matter of hours.

For a relay that switches a heating element, fail-safe means off. A heating element that stays on when the controller loses contact with a temperature sensor can cause a fire, or at minimum destroy a crop through heat damage.

For a relay board on startup, fail-safe means all channels off. If the software restoring previous state on power-up is itself the source of the problem (if the board is being restarted precisely because something went wrong), restoring a potentially dangerous previous state isn’t recovery. It’s compounding the failure.

These choices (what happens at failure, what happens on startup, what happens when the network drops) are design decisions. They’re made before a single line of code is written, or they’re not made at all and the system defaults to whatever is easiest to implement, which is often the unsafe option.

The Convenience of Fail-Open

Fail-open is cheaper to build. In many contexts, it’s also the commercially preferred option.

A thermostat that fails to a comfortable temperature is less likely to generate a complaint than one that fails to cold. A valve that stays open when the controller drops its connection might mean a slightly over-watered bed, which is visible but not catastrophic. From a product support perspective, a system that continues to do something when it loses connectivity (even the wrong thing) generates fewer urgent calls than one that stops completely.

This logic makes a certain sense in controlled environments with tolerant systems. It makes no sense in an agricultural setting where the “something” the system continues doing can destroy living organisms.

The systems we build at SEIN are designed to fail safe. This is not difficult. It requires thinking it through before building, not after.

How We Implement It

Relay boards default to off. The 16-channel Modbus relay modules used in our irrigation systems default all channels to the open (off) state at power-up. The relay bridge service (the software managing them) sends explicit off commands to all channels on startup and shutdown. No channel activates without a deliberate command from the automation system.

Safety timers enforce maximum on-duration. Each relay channel in the bridge service has a configurable safety timer: a maximum duration after which the relay cuts power automatically, regardless of whether an off command has arrived. An irrigation zone cannot remain open indefinitely due to a crashed controller, a dropped connection, or a software bug. The hardware timer closes it.

Flash-on Modbus registers prevent stuck-open valves. Our industrial LoRaWAN irrigation controller uses a specific Modbus register type (flash-on registers) for relay control. A flash-on command activates the relay for a specified duration at the hardware level, then deactivates it. Once the command is issued, the controller doesn’t need to receive another command to close the valve. If the LoRaWAN connection is lost, the relay closes on schedule. If the network server crashes, the relay closes on schedule. The safety isn’t contingent on ongoing connectivity.

Connection status sensors surface silent failures. A relay bridge that has lost contact with the RS485 bus can continue appearing functional to the automation system while doing nothing. The SEIN relay bridge publishes a connection status binary sensor to Home Assistant, and if the RS485 bus goes silent, Home Assistant knows, triggers an alert, and the operator can investigate before a silent failure becomes a damaged crop.

Biology and Failure

Living systems (biological ones) have been solving this problem for a long time.

When a plant experiences heat stress, it closes its stomata. This reduces photosynthesis, but it prevents dehydration. It’s a local, conservative response that defaults to survival over productivity when resources are stressed. The cost of closing stomata is a reduction in growth. The cost of not closing them is death. The system defaults to the safe option.

When a mycorrhizal network is disrupted by soil disturbance, the fungi don’t abandon all connected root systems simultaneously. They maintain connections where they can and degrade gracefully where they can’t. The network as a whole degrades slowly rather than catastrophically.

Immune systems, at the cellular level, operate with similar logic: when a cell cannot determine whether a signal is benign or pathogenic, it defaults to the conservative response. False positives (inflammation where it wasn’t needed) are costly. False negatives (no response to a real pathogen) can be catastrophic. The safe default is action, not inaction.

The same logic applies to automated systems that control physical actuators in environments where living things depend on the outcome. When uncertain, default to the state that causes the least harm. Design the system so that the absence of a signal is a meaningful signal: one that triggers the safe response, not the convenient one.

Who Is Responsible?

There’s an ethical dimension to this that goes beyond engineering practice.

When an automated system fails in a way that damages a living system (floods a bed, overheats a glasshouse, over-irrigates a paddock), who is responsible? The builder of the hardware? The developer of the firmware? The operator who configured the automation? The person who accepted the default settings without understanding what they meant?

In practice, the answer is usually the operator. The system failed; they are the one managing the growing environment; the loss is theirs. But the builder of the system made choices that made this failure possible, or unlikely.

Designing a system that fails safe doesn’t eliminate the operator’s responsibility. It shifts the risk calculation. A fail-safe system that goes wrong causes less harm. A fail-open system that goes wrong can cause catastrophic harm. The builder who chooses fail-open because it’s cheaper to implement is making a choice about who bears the cost of that failure. Usually it’s not them.

Designing for Care

Fail-safe by default is an expression of care. It says: this system operates around living things, and when things go wrong (as they will), we want the harm to be minimal.

The Fail-Safe Checklist for DIY Automation

If you’re building your own control system, ask these questions before you wire it up:

What is the safe state? (e.g., Valve closed, heater off, fan on).
Does it go to the safe state on total power loss? Use normally-open (NO) relays for heaters and valves so they disconnect when power is pulled.
Does it go to the safe state on microcontroller crash? Use hardware watchdogs or physical safety timers that don’t depend on the code running.
Does it go to the safe state on network loss? Ensure your “off” commands aren’t the only way to stop a process. Use “run-for-X-minutes” commands instead of just “start”.
Is there a manual override? Every automated valve should have a physical bypass or manual handle.

This is not a high standard. It is the minimum standard. But building to this standard requires thinking carefully about failure modes before the first component is wired, not discovering them in the field at 3am with a torch and a flooded garden bed.

The tools we build here are designed that way. Not because it’s clever. Because it’s the right approach to building things that operate near life.

Featured image by Aaron Volkening on Flickr — CC BY 2.0.

Comments

Be the first to comment! Reply to this post from your Mastodon/Fediverse or Bluesky account, or mention this post's URL in your reply. Your comment will appear here automatically via webmention.

Follow this blog on Mastodon at @sein.com.au@sein.com.au or on Bluesky at @sein.com.au

What's this?