Sunday, April 20, 2025

Lean Software Development: Calidad como base del desarrollo sostenible

Tercera parte de la serie sobre calidad en Lean Software Development. Después de ver cómo detectar errores a tiempo y aprender de ellos, en esta entrega exploramos cómo la calidad técnica e interna es clave para sostener la calidad externa, por qué menos es más y cómo crear una cultura donde trabajar con calidad no sea la excepción, sino la norma. En esta entrega, exploraremos cómo esta calidad fundamental no es un lujo, sino la higiene esencial de un producto de software bien hecho.

Calidad como acelerador del desarrollo

Una de las creencias más extendidas, especialmente en organizaciones que aún no han adoptado enfoques Lean, es que trabajar con calidad ralentiza el desarrollo. Se asume que escribir tests, automatizar validaciones o refactorizar consume tiempo que podríamos dedicar a “entregar más rápido”. Sin embargo, desde la perspectiva de Lean Software Development, esta visión no sólo es errónea, sino que perpetúa el desperdicio.

En realidad, la calidad bien entendida es un acelerador. Cuando el sistema está sano —con tests automatizados fiables, diseño simple y procesos robustos—, cada paso que damos tiene menos fricción. La confianza del equipo en su capacidad para cambiar el sistema crece, el feedback es más rápido y el coste del cambio se reduce drásticamente. Es decir, vamos más rápido no a pesar de la calidad, sino gracias a ella.

Esto está completamente alineado con los principios Lean que hemos ido explicando en esta serie (poka-yoke, jidoka, kaizen).

Además, cuando el equipo confía en su sistema —porque sabe que los errores se detectan a tiempo, que el diseño permite evolucionar fácilmente y que se puede experimentar sin romper nada—, se atreve a innovar, probar ideas nuevas y adaptarse rápido a lo que aprende del usuario. Es decir, se potencia la entrega continua de valor.

En mi experiencia, los equipos que invierten en calidad desde el principio y la incorporan como parte de su forma de trabajar avanzan de forma mucho más sostenida, rápida y con menor coste emocional. No tienen que parar constantemente para “arreglar el sistema”, porque nunca lo han dejado deteriorarse. Y eso es posible porque entienden que la calidad no se inspecciona al final, sino que se construye en cada paso.


Calidad interna como cimiento de la externa

En Lean Software Development, la calidad externa —la que perciben directamente los usuarios o clientes— es prioritaria. Sin embargo, para que esa calidad se mantenga en el tiempo, es imprescindible contar con una calidad interna sólida: un sistema bien diseñado, comprensible, que se pueda mantener y evolucionar sin miedo.

Muchas veces, los defectos visibles para los usuarios tienen su origen en problemas invisibles dentro del sistema: código acoplado, tests poco fiables, decisiones técnicas tomadas sin contexto o procesos frágiles. Estos problemas no solo generan errores, sino que ralentizan al equipo, dificultan la adaptación al cambio y elevan el coste de entregar valor. Son una forma silenciosa, pero muy real, de desperdicio.

Lean nos invita a ver estos problemas estructurales como oportunidades de mejora (kaizen) y a abordarlos de forma sistemática. No se trata de “embellecer el código” ni de seguir reglas arbitrarias, sino de construir una base técnica sólida que reduzca la fricción del día a día y permita avanzar con rapidez y confianza.

Aplicamos jidoka también en este contexto: cuando un test flaky, una dependencia opaca o un sistema difícil de desplegar nos impide avanzar, lo señalamos como un problema del sistema, no como una debilidad individual. Paramos, analizamos y mejoramos la infraestructura técnica para evitar que vuelva a ocurrir. Cada pequeño cambio suma. Por ejemplo, si un despliegue falla repetidamente, no solo lo intentamos de nuevo, sino que investigamos la causa raíz y automatizamos una solución para evitar que vuelva a ocurrir, como un script que verifica la disponibilidad de la base de datos antes del despliegue.

Además, los principios de poka-yoke también son aplicables a la calidad interna. Usar tipado fuerte, patrones de diseño sencillos, encapsulación adecuada y herramientas que faciliten el trabajo sin requerir esfuerzo constante son formas de prevenir errores técnicos y facilitar la evolución del sistema. Cuanto más fácil sea hacer lo correcto, menos probable será introducir deuda o errores involuntarios. Por ejemplo, usar un linter de código para detectar errores de sintaxis automáticamente antes de que lleguen a producción o configurar un sistema de control de versiones que impida el envío de código sin una revisión adecuada.

En resumen, la calidad interna no es un fin en sí mismo, sino un medio para garantizar la calidad externa de forma sostenible. Cuando el sistema es fácil de entender, probar y modificar, el equipo puede centrarse en entregar valor, aprender más rápido y adaptarse mejor a lo que necesita el cliente. Esa simplicidad estructural es la que permite construir con calidad… y moverse rápido sin romper cosas.


Gráfico que muestra la calidad interna como la base sobre la que se construye la calidad externa.

Calidad no es complejidad ni sofisticación

Una confusión habitual en ingeniería de software es asociar la calidad con la sofisticación técnica. Se valora el código “bonito”, las soluciones elegantes o los diseños complejos que anticipan futuras necesidades. Sin embargo, desde la perspectiva de Lean Software Development, este enfoque es profundamente equivocado. En realidad, la calidad no es un lujo, sino una necesidad básica, la higiene fundamental de un producto de software bien hecho.

Lean no premia la complejidad innecesaria ni el sobrediseño. Todo lo contrario: promueve la simplicidad deliberada como vía para reducir el desperdicio, facilitar la evolución del sistema y garantizar su fiabilidad. La calidad, en este contexto, no se mide por cuántos patrones aplicamos o lo "intelectualmente interesante" del diseño, sino por lo bien que resuelve el problema actual, con el menor esfuerzo y el menor riesgo posible. Es como la limpieza en una casa: no es un adorno, es lo mínimo indispensable para vivir saludablemente.

Cualquier línea de código que no necesitamos ahora mismo es una fuente potencial de errores. No solo añade coste de mantenimiento, sino que complica la comprensión del sistema, frena su evolución y puede inducir a decisiones equivocadas. 

"el mejor código es el que no existe" Ward Cunningham

Desde Lean, adelantarse a funcionalidades que aún no existen o diseñar sistemas más allá de las necesidades actuales es una forma de desperdicio. También es un fallo de kaizen, porque impide iterar y aprender paso a paso. Y rompe con poka-yoke, al introducir caminos opcionales no validados ni protegidos con tests. En lugar de evitar errores, los estamos sembrando.

La calidad en Lean se construye con código claro, probado, comprensible y limitado a lo estrictamente necesario. Cuidamos el diseño no para hacerlo más complejo, sino más simple, seguro y fácil de evolucionar. Nos apoyamos en tests, feedback continuo, diseño evolutivo y refactorización constante para mantener el sistema sano sin caer en la trampa de planificar el futuro desde el presente.

Así que no: la belleza técnica o el sobrediseño no son calidad. A menudo, son su enemigo. Construir con calidad en Lean es, ante todo, tener la humildad de resolver lo justo, hacerlo bien y prepararnos para mejorar según lo que aprendamos mañana.


Conclusiones: Calidad como higiene básica

En resumen, la calidad en Lean Software Development no es una característica opcional ni un adorno sofisticado. Es la base, el cimiento, la higiene fundamental de un producto de software sostenible y valioso. No se trata de buscar la complejidad o la elegancia por sí mismas, sino de construir un sistema claro, simple, probado y fácil de mantener.

La calidad es como el aire que respiramos: no siempre la notamos, pero su ausencia nos ahoga rápidamente. Un software sin calidad es un software destinado al fracaso, lleno de errores, difícil de cambiar y costoso de mantener. Por eso, invertir en calidad desde el principio, y entenderla como una práctica esencial y no como un lujo, es la mejor manera de asegurar el éxito a largo plazo.

La calidad no es un extra, es lo mínimo indispensable. Y la simplicidad, la mejor aliada para conseguirla.

En la cuarta parte, veremos cómo la colaboración y la visibilidad son también fundamentales para mantener esta higiene y construir un software de calidad sostenible.


Friday, April 18, 2025

Lean Software Development: Detect errors before they hurt

Second part of the series on how to build with quality from Lean Software Development. If you haven't yet read the first part of this series, where we explain the fundamentals of Lean Software Development, you can find it here: https://www.eferro.net/2025/04/lean-software-development-building-with.html. After understanding why quality is not just the final result, in this installment we focus on how we detect errors as early as possible, stop the flow when they appear, and learn from them to improve the system.

To avoid confusion, we will use "error" to refer to any deviation from the expected result, and "defect" for errors that impact the customer in production.

Detecting errors as early as possible  

In more traditional approaches, defects are often prioritized based on their criticality, which sometimes determines whether they are fixed and within what timeframe. However, in Lean Software Development, which considers quality a fundamental part of the product and focuses on the continuous improvement of processes and systems, it is more common to classify errors (potential defects) based on where and when they were identified within the process.

Having this information allows us to identify the stages where errors are most common, helping us focus our improvement efforts on detecting them as early as possible (shift left), reducing the likelihood of them becoming defects.

In my experience, it is very useful to classify error detection depending on the stage where they are identified. I usually use the following classification:

  1. Local machine (pair or ensemble work)
    1. Development cycle (including TDD cycle, continuous style verification (linting), type checks, errors, etc.)
    2. Pre-commit
    3. Pre-push
  2. CI Pipeline:
    1. Checks
    2. Tests (classified from fastest to slowest)
    3. Deployment (including smoke tests and validations during rollout)
  3. Production environment
    1. Pre-release (deployed but not activated for the client)
    2. Client activation

When a feature is already activated for the client, it is also useful to classify errors or defects based on who or what detected them:

  1. Automatic system, before it impacts the client (Monitoring)
  2. Internal user, such as a team member running tests or someone from another department
  3. Final user, reporting the defect to support

Regardless of the stage or who detected it, the main goal is always the same: detect (and fix) the error as early as possible, ideally before it is identified by an end user in production (when it is already considered a defect).

Lean Software Development accepts that we make mistakes continuously and understands that the cost (waste generated) increases the later the error is detected and fixed.

To illustrate how this progressive error detection is structured and visualized, I will show two real examples of pipelines we use. In both cases, the various steps (checks, tests, publishing, deployment, production validations, rollback, etc.) are organized to easily detect any error as soon as possible, stop the process, and fix it. This visualization not only helps structure the workflow better but also ensures that the entire team clearly understands at what stage each type of error can appear.

In this first pipeline, each component (webapp, API, event processor…) has its own checks, unit, integration, and acceptance tests, as well as differentiated publishing and deployment processes for different environments (dev and mgmt). Additionally, end-to-end tests are automated in production before activating changes, and a rollback logic is included if something fails. This structure reinforces the principle of automatically stopping the flow when errors occur and facilitates traceability at each stage.

In this second example, more focused on structural validations and specific testing of certain technologies (argo workflows in this case), additional phases such as static checks, cleanup tasks before publishing the image to ECR, and integration tests with different configurations are highlighted. This type of pipeline shows how even auxiliary tasks like configuration validation or environment cleanup are an integral part of an approach that seeks to detect errors before they hurt.

Stop and fix policy

Jidoka, also known as "autonomation" or "automation with a human touch," is a key principle of Lean Software Development. It’s not just about automating processes but doing so in a way that they automatically stop when a problem is detected, allowing teams to investigate and fix the root cause before continuing. Applying the Jidoka concept, teams working with Lean Software Development design development processes that make it very easy to detect errors—either automatically in most cases, or manually thanks to a development process that facilitates identifying those errors.

For continuous quality improvement to work, we not only need to detect those errors but it is crucial to stop immediately (See https://en.wikipedia.org/wiki/Andon_(manufacturing)) and have a policy that forces us to prioritize their immediate resolution. This way of working may seem too radical at first and might give the impression that it slows the team down. However, my experience is quite the opposite. If you adopt a working approach where, upon detecting an error, you analyze it, learn from it, and fix it at the root—by, for example, adding a test to prevent it from happening again—you soon achieve a process that resolves errors as early as possible. This eliminates a lot of rework and potential problems for the end customer while generating great confidence within the team to move quickly, take risks, and experiment.

In fact, I believe it’s the best way to move fast sustainably, and the DORA Reports studies and the book Accelerate confirm that the best way to be fast is to build with quality—at least when it comes to product development.

In my case, this application of the Jidoka approach is reflected in:

  • Automatic tests that, upon detecting a failure, temporarily interrupt the development flow to prevent the error from propagating.
  • Git hooks (pre-commit, pre-push, etc.) that interrupt the flow if an attempt is made to push code with errors.
  • Working with trunk-based development, a strategy where all developers integrate their changes into a single main branch. In this setup, we run all validations on the main branch. When a test fails in continuous integration, we stop to fix it immediately. This is crucial in trunk-based development because any failure blocks the ability to integrate new changes, ensuring the main branch is always stable and ready for deployment. This discipline is fundamental to maintaining quality and avoiding error accumulation, allowing us to move fast with confidence.
  • Automatic prioritization in the workflow for resolving bugs detected in production, following an incident management process with postmortems for production incidents. This automatic prioritization is based on the severity of the error and its impact on the customer, determining which bugs are addressed first. (https://www.eferro.net/2024/12/using-blameless-incident-management-to.html).
  • Pair programming or ensemble programming, where multiple people work together on the same task, allowing misunderstandings or potential errors to be detected from the start. This intense collaboration acts as a continuous review that prevents many errors, both in understanding the problem and in implementing the solution.

Dealing with defects and learning from them

Accepting that we are going to make mistakes—and that some will become defects—is a fundamental part of the Lean approach. Instead of denying or hiding it, we embrace it as something natural. We work in complex environments, with multiple dependencies, constant uncertainty, and, moreover, we are human. We are fallible by definition.

That does not mean we don't try to avoid errors or defects. On the contrary, we put a lot of effort into preventing them with techniques like poka-yoke, automated testing, pair programming, evolutionary design, and many other practices that are part of our daily work. Even so, we know they will happen. And since we know it, we prepare to minimize their impact and recover as quickly as possible.

This shift in mindset is key: we move from an obsession with avoiding mistakes at all costs to a more robust and sustainable strategy based on fast recovery (resilience) and learning capability. Because when a defect reaches production, the first objective is to restore service as quickly as possible. And immediately after, to learn.

Over the past years, in several teams I've worked with, we've refined and applied a blameless incident management approach. The idea is simple: when an incident occurs, we don't look for someone to blame. We focus on understanding what happened, how the system contributed to the error, and what we can do to prevent it from happening again or reduce its impact next time.

This type of approach, simple as it may seem, has had a huge impact on team culture. It brings psychological safety, builds trust, promotes transparency, and encourages people to make problems visible without fear. At TheMotion, Nextail, and ClarityAI, we used it not only to manage incidents but also as a major lever to evolve the culture toward one that is more collaborative, learning-oriented, and focused on continuous improvement.

For example, in a recent incident where a service failed, we applied the 5 Whys technique and discovered that the initial problem (a configuration error) had triggered a cascade of events due to the lack of error handling in another service. This led us to add more robust integration tests and improve the resilience of the second service.

Our blameless incident management process relies on several principles:

  • Stay calm. Don’t panic. We even value this ability during interviews as a sign of professional maturity.
  • Assign an Incident Commander to coordinate the response and ensure no one is left alone firefighting.
  • Restore the service as soon as possible. Sometimes this means disabling a feature, communicating with customers, or applying a temporary mitigation. The important thing is to stabilize the system.
  • Analyze what happened in depth without seeking a single “root cause.” We understand that incidents usually stem from a combination of causes and circumstances. We use techniques like the 5 Whys, asking “why?” starting from the first visible symptom. Doing this in a group allows us to uncover the various factors that contributed to the incident. Often, we find flaws in the process, assumptions, communication, or even data interpretation.
  • Define corrective and preventive actions that not only avoid the problem but also reduce future recovery time and increase system resilience.
  • Integrate these actions into the normal workflow so they don't just remain on paper.
  • Use a blameless incident report, public within the company and collaborative, as the basis for collective learning. This continuous analysis and learning process is an example of Kaizen—continuous improvement applied to incident management, where we constantly seek ways to improve our processes and prevent future errors.

These reports include summaries, timelines, causes, actions, and learnings. Sharing them openly reinforces the message: errors are not hidden, they are learned from. And when the whole team internalizes this, the organization improves faster.

In the end, the message is clear: incidents are inevitable, but how we respond to them truly defines our culture. We can hide them, blame, and move on... or we can use them as catalysts for improvement and continuous learning. In Lean Software Development, we choose the latter.

In the next article...  

Once we have established how to detect and respond quickly to errors, it is crucial to build a solid foundation. In the third part, we will delve into internal quality as the basis for sustainable development. We will see why less is more, how simplicity and well-thought-out design accelerate development, and how a good technical foundation allows us to move fast without breaking things.


Related content

Monday, April 14, 2025

Lean Software Development: Building with Quality

Part one of a series on how to build with quality based on Lean Software Development principles. In this article, we explore what “quality” really means, how it differs from the traditional approach, and what principles help us prevent errors from becoming defects.

Lean Software Development: Building with Quality

One of the main differences between Lean Software Development and traditional approaches lies in the treatment of quality—both in the resulting software or product and in the work environment (collaboration, tools, communication).

Lean Software Development considers quality as an integral part of the product’s value, essential to deliver value to the customer as efficiently as possible. From this perspective, all quality issues are considered one of the main wastes to eliminate.

In contrast, traditional approaches usually focus primarily on the quality of the final product. Lean, on the other hand, focuses on the continuous improvement of the processes and the system that generates the final product.

Quality control, which traditionally focuses on inspecting the product at the end of the process, becomes in Lean a fundamental part of each stage. In other words, instead of inspecting and validating quality at the end, quality is built in from the beginning and maintained throughout the process. This involves moving from a reactive approach to problems to a more proactive one, making any issues visible with the goal of solving them immediately and addressing root causes (system or processes).

We can summarize these differences as follows:

Focus

  • Traditional: Mainly on the quality of the final product.
  • Lean Software Development: Continuous improvement of processes and systems.

Quality Control

  • Traditional: Inspection at the end or at specific production checkpoints.
  • Lean Software Development: Built-in quality at every step (jidoka).

Problem Solving

  • Traditional: Reactive, addressing problems after they occur.
  • Lean Software Development: Proactive, highlighting and solving problems immediately.

In the end, Lean Software Development considers that all rework (incidents, bug fixing, repeated processes, etc.) caused by problems and defects is one of the major wastes to eliminate. Therefore, it focuses on introducing quality at every step of the process, minimizing waste, and fostering a habit of working with quality and continuously improving.


Quality is a broad concept that can cover various aspects: external quality (how it is perceived by the customer or end user), internal quality (related to the ease of evolution and maintainability of the code or system), and even the quality of our processes and work environment.

In this article series, we will place special emphasis on external quality, as perceived by customers or users. However, we will also address aspects of internal and process quality, since they are fundamental for sustaining high external quality in the long term. In addition, since Lean Software Development places strong emphasis on continuous improvement of both the processes and the system to ensure such quality, we will also address aspects related to process and work environment quality.

Basic Concepts

Let’s start with some concepts and definitions that will be useful throughout the rest of the articles:

  • Error: We consider an error as any deviation from the expected result. For example, in our application’s execution, an error may be a failed test, code that functionally does not meet expectations, among others.
  • Problem: In the context of Lean Software Development, a problem is any obstacle that prevents us from delivering value to the customer in the most efficient and effective way. This includes the gap between the current state and the desired state, but also covers situations such as
    • Ambiguous requirements leading to incorrect implementation.
    • Communication problems within the team that result in misunderstandings and delays.
    • Technical limitations preventing desired performance.
    • Inefficiencies in the deployment process causing delays.
  • Defect: A defect is an error that prevents the product from fulfilling its function or purpose, directly affecting customers. Defects are one of the main sources of waste identified in Lean Software Development (see “Eliminating Waste in Development”). Therefore, it is a priority to avoid them whenever possible and eliminate them immediately when detected. The linked article explores how Lean Software Development adapts the seven types of waste from Lean Manufacturing to software, focusing on eliminating activities that do not add value to the customer and promoting practices like reducing “Partially Done Work” and avoiding “Extra Features and Code.” See article for more details.

It is important to remember that defects are a subset of errors, and problems can include multiple errors and defects.

Although this classification may seem confusing at first, distinguishing between problem (a general situation that includes errors and defects), error (any deviation from expectations, like a failed test, an incorrect variable name, or a poorly implemented feature), and defect (an error that affects the customer or the system’s purpose and is a fundamental source of waste) helps us better understand the situation and define appropriate processes to address each case.

Lean Approach to Reducing Errors and Defects

  • Jidoka: Also known as "autonomation", this principle seeks to detect and correct defects automatically, stopping production as soon as an error is identified. A classic example in Lean Manufacturing is the use of “Andon Cards,” visual signaling systems that allow any worker to stop the production line upon detecting a problem. In our software development context, this translates to automated testing at different stages, alarms and monitoring, the use of a CI pipeline that stops deployment if tests fail, among other mechanisms.
  • Poka-Yoke: Translated as "mistake-proofing", this approach relies on systems and designs that prevent human errors. In software development, Poka-Yoke applies both to our internal processes and the user's interaction with the product.
    • In the development process, we implement Poka-Yoke through practices such as using strongly typed programming languages, tools with optimal default settings, automated tests, ValueObjects, and preconditions. These techniques help prevent coding and design errors.
    • In software usability, we apply Poka-Yoke through error-prevention systems, effective contextual help messages, and intuitive designs. This is where User Experience (UX) plays a crucial role. Although UX is not a direct Poka-Yoke mechanism in the technical sense (like a physical sensor stopping a machine), it shares the fundamental goal of preventing user errors. Good UX design anticipates possible mistakes, guides the user clearly, and provides immediate feedback, significantly reducing the chances of user error. In this sense, UX complements and reinforces Poka-Yoke principles, ensuring the software is as intuitive and error-free as possible for the end user.
  • Kaizen: This is the principle of continuous improvement applied to our processes and systems. As one of the pillars of Lean, it promotes the constant search for ways to reduce errors and improve quality, both in the product and in the processes and tools we use.
Majo statt Senf, CC BY-SA 4.0
  https://commons.wikimedia.org/w/index.php?curid=38767688

Avoiding Errors from Becoming Defects

In all the teams I’ve worked with, we make mistakes continuously. Some are just development process errors, while others are misunderstandings. The latter arise because we’re constantly learning about the problem. This is the reality: we make mistakes constantly, and everyone I know in this profession makes them. Some more, some less, but no one is error-free.

If we acknowledge and accept this reality, the important thing is to understand that the goal is not to completely avoid errors (something impossible), but to ensure that those errors don’t become defects that impact the final user experience. The article “Be humble, no rockstars allowed” advocates for humility and teamwork in software development, emphasizing that continuous learning and effective error management are crucial—rather than relying on individual “rockstars.” See article for more.

This is Lean Software Development’s approach: acknowledge and accept that we make mistakes, and focus on having a process that allows us to detect them, protect ourselves from them continuously, resolve them, and prevent them from becoming defects.

How do we achieve this?

  1. Incorporating quality in every phase of development, automated whenever possible (Jidoka + Poka-Yoke).
  2. Continuously improving the mechanisms and techniques that ensure quality at each stage (Kaizen).

In the next article, we will explore how to detect errors as early as possible, stop the flow when they appear, and learn from them without pointing fingers. This could mean stopping the CI/CD pipeline if an integration test fails, blocking code commits if unit tests don’t pass, stopping a pairing session if a critical error is discovered, or even temporarily halting new feature development to address a severe performance issue. Because working with quality also means having a strategy so that errors don’t become defects.

Friday, April 11, 2025

Lean Software Development: Detectar errores antes de que duelan

Segunda parte de la serie sobre cómo construir con calidad desde Lean Software Development. Si aún no has leído la primera parte de esta serie, donde explicamos los fundamentos de Lean Software Development, puedes encontrarla aquí: https://www.eferro.net/2025/04/lean-software-development-construyendo.html. Tras entender por qué la calidad no es solo el resultado final, en esta entrega nos enfocamos en cómo detectamos errores lo antes posible, detenemos el flujo cuando aparecen y aprendemos de ellos para mejorar el sistema.

Para evitar confusiones, utilizaremos "error" para referirnos a cualquier desviación del resultado esperado, y "defecto" para errores que impactan al cliente en producción.

Detectar los errores lo antes posible

En enfoques más tradicionales, suele priorizarse la clasificación de los defectos según su criticidad, lo que en algunos casos determina si se corrigen o no, y en qué plazo. Sin embargo, en Lean Software Development, que considera la calidad como una parte fundamental del producto y se enfoca en la mejora continua de procesos y sistemas, es más común clasificar los errores (potenciales defectos) según dónde y cuándo se han identificado dentro del proceso.

Disponer de esta información nos permite identificar las etapas del proceso donde los errores son más comunes y así enfocar nuestros esfuerzos de mejora para detectarlos lo antes posible (shift left), reduciendo la probabilidad de que se conviertan en defectos.

En mi experiencia, resulta muy útil clasificar la detección de errores dependiendo de la etapa en la que se identifican. Yo suelo usar la siguiente clasificación:
  1. Máquina local (trabajo en pareja o ensemble).
    1. Ciclo de desarrollo (incluyendo el ciclo TDD, verificación continua de estilo (linting), tipos, errores, etc).
    2. Pre-commit.
    3. Pre-push.
  2. CI Pipeline:
    1. Checks.
    2. Tests (clasificados de más rápidos a más lentos).
    3. Despliegue (incluyendo smoke tests y validaciones durante el rollout).
  3. Entorno de producción.
    1. Pre-release (desplegado, pero no activado para el cliente).
    2. Activación para el cliente.
Cuando una funcionalidad ya está activada para el cliente, es útil también clasificar los errores o defectos según quién o qué los detectó:
  1. Sistema automático, antes de que afecte al cliente (Monitoring).
  2. Usuario interno, como un miembro del equipo realizando pruebas o alguien de otro departamento.
  3. Usuario final, reportado el defecto a soporte.
Independientemente de la etapa o del responsable de la detección, el objetivo principal siempre es el mismo: detectar (y resolver) el error lo antes posible, idealmente antes de que sea identificado por un usuario final en el entorno de producción (cuando ya se considera un defecto).

Lean Software Development acepta que cometemos errores continuamente, y entiende que el coste (desperdicio generado) es mayor cuanto más tarde se detecte y arregle.




Para ilustrar cómo se estructura y visualiza esta detección progresiva de errores, a continuación muestro dos ejemplos reales de pipelines que usamos. En ambos casos, se organizan los distintos pasos (checks, tests, publicación, despliegue, validaciones en producción, rollback...) de forma que sea fácil detectar cualquier error lo antes posible, detener el proceso y corregirlo. Esta visualización ayuda no solo a estructurar mejor el flujo de trabajo, sino también a que todo el equipo entienda claramente en qué etapa puede aparecer cada tipo de error.



En este primer pipeline, cada componente (webapp, API, event processor…) tiene sus propios checks, tests unitarios, de integración y de aceptación, así como procesos de publicación y despliegue diferenciados por entorno (dev y mgmt). Además, se automatizan las pruebas end-to-end en producción antes de activar los cambios, y se incluye una lógica de rollback si algo falla. Esta estructura refuerza el principio de detener el flujo automáticamente ante errores y facilita la trazabilidad en cada etapa.



En este segundo ejemplo, más centrado en validaciones estructurales y pruebas especificas de ciertas tecnologías (argo workflows en este caso), se destacan fases adicionales como los static checks, tareas de cleanup antes de publicar la imagen en ECR y pruebas de integración con diferentes configuraciones. Este tipo de pipeline muestra cómo incluso tareas auxiliares como la validación de configuración o la limpieza del entorno son parte integral de un enfoque que busca detectar errores antes de que duelan.

Política parar y arreglar

Jidoka, también conocido como "autonomación" o "automatización con un toque humano", es un principio clave de Lean Software Development. No se trata solo de automatizar procesos, sino de automatizarlos de manera que se detengan automáticamente cuando se detecta un problema, permitiendo que los equipos investiguen y resuelvan la causa raíz antes de continuar. Aplicando el concepto Jidoka, los equipos que trabajan con Lean Software Development diseñan procesos de desarrollo en los que es muy fácil detectar errores. Bien porque son detectados de forma automática en la mayoría de los casos, o bien de forma manual gracias a un proceso de desarrollo que facilita la identificación de esos errores.

Para que la mejora continua de la calidad funcione, no solo necesitamos detectar esos errores, sino que es clave parar inmediatamente (Ver https://en.wikipedia.org/wiki/Andon_(manufacturing)) y contar con una política que nos obligue a priorizar su resolución de inmediato. Esta forma de trabajar puede parecer demasiado radical al principio y dar la impresión de que puede reducir la velocidad del equipo. Sin embargo, mi experiencia es justo la contraria. Si desde el principio adoptas un enfoque de trabajo en el que, al detectar un error, lo analizas, aprendes de él y lo solucionas de raíz  —por ejemplo, añadiendo un test que evite que vuelva a ocurrir—, en muy poco tiempo consigues un proceso que resuelve los errores lo antes posible. Esto elimina un montón de retrabajo y de potenciales problemas para el cliente final, a la vez que genera una gran confianza en el equipo para moverse rápido, tomar riesgos y experimentar.

De hecho, considero que es la mejor forma de ir rápido de forma sostenida, y los estudios de los DORA Reports y el libro Accelerate confirman que la mejor manera de ser rápido es hacerlo con calidad. Al menos, si hablamos de desarrollo de producto.

En mi caso, esta aplicación del enfoque Jidoka se refleja en:
  • Tests automáticos que, al detectar un fallo, interrumpen temporalmente el flujo de desarrollo para evitar que el error se propague.
  • Git hooks (pre-commit, pre-push, etc.) que interrumpen el flujo si se intenta subir código con errores.
  • Trabajo en trunk based development, una estrategia donde todos los desarrolladores integran sus cambios en una única rama principal. En esta configuración, ejecutamos todas las validaciones en la rama principal. Cuando un test falla en integración continua, paramos para arreglarlo inmediatamente. Esto es crucial en "trunk based development" porque cualquier fallo bloquea la capacidad de integrar nuevos cambios, asegurando que la rama principal siempre esté en un estado estable y listo para ser desplegado. Esta disciplina es fundamental para mantener la calidad y evitar la acumulación de errores, permitiéndonos movernos rápido con confianza.
  • Priorización automática en el flujo de trabajo para la resolución de bugs detectados en producción, siguiendo un proceso de gestión de incidentes con postmortem para los incidentes de producción. Esta priorización automática se basa en la severidad del error y su impacto en el cliente, determinando qué bugs se abordan primero. (https://www.eferro.net/2024/12/using-blameless-incident-management-to.html).
  • Pair programming o ensemble programming, donde varias personas trabajan juntas en una misma tarea, lo que permite detectar malentendidos o errores potenciales desde el inicio. Esta colaboración intensa actúa como una revisión continua que previene muchos errores, tanto en el entendimiento del problema como en la implementación de la solución.

Tratar con defectos y aprender de ellos

Asumir que vamos a cometer errores —y que algunos se convertirán en defectos— es parte fundamental del enfoque Lean. En lugar de negarlo o esconderlo, lo aceptamos como algo natural. Trabajamos en entornos complejos, con múltiples dependencias, incertidumbre constante y, además, somos humanos. Somos falibles por definición.

Eso no quiere decir que no intentemos evitar los errores o defectos. Al contrario, ponemos mucho esfuerzo en prevenirlos con técnicas como poka-yoke, tests automáticos, pair programming, diseño evolutivo y muchas otras prácticas que forman parte de nuestro día a día. Aun así, sabemos que ocurrirán. Y cómo lo sabemos, nos preparamos para que su impacto sea mínimo y el tiempo de recuperación, el menor posible.

Este cambio de mentalidad es clave: pasamos de una obsesión por evitar el error a toda costa a una estrategia más robusta y sostenible, basada en la rapidez de recuperación (resilience) y en la capacidad de aprendizaje. Porque cuando un defecto llega a producción, el primer objetivo es recuperar el servicio lo más rápido posible. Y justo después, aprender.

Durante los últimos años, en varios de los equipos en los que he trabajado, hemos ido refinando y aplicando un enfoque basado en blameless incident management. La idea es sencilla: cuando ocurre un incidente, no buscamos culpables. Nos centramos en entender qué ha pasado, cómo el sistema ha contribuido al error y qué podemos hacer para evitar que se repita o para reducir su impacto la próxima vez.

Este tipo de enfoque, por simple que parezca, ha tenido un impacto enorme en la cultura de los equipos. Aporta seguridad psicológica, genera confianza, fomenta la transparencia y ayuda a que la gente se atreva a hacer visibles los problemas sin miedo. En TheMotion, Nextail y ClarityAI lo usamos no solo como forma de gestionar incidentes, sino como una de las principales palancas para evolucionar la cultura hacia una más colaborativa, orientada al aprendizaje y centrada en la mejora continua.



Por ejemplo, en un incidente reciente donde un servicio falló, aplicamos los 5 Whys y descubrimos que el problema inicial (un error en una configuración) había desencadenado una serie de eventos en cascada debido a la falta de manejo de errores en otro servicio. Esto nos llevó a añadir tests de integración más robustos y a mejorar la resiliencia del segundo servicio.

Nuestro proceso de gestión de incidentes blameless se apoya en varios principios:
  • Mantener la calma. No entrar en pánico. Incluso en los procesos de entrevista valoramos esta capacidad como una señal de madurez profesional.
  • Asignar un Incident Commander que coordine la respuesta y asegure que nadie se quede solo apagando fuegos.
  • Recuperar el servicio cuanto antes. A veces implica desactivar una funcionalidad, comunicar al cliente o hacer una mitigación temporal. Lo importante es estabilizar el sistema.
  • Analizar en profundidad lo ocurrido, sin buscar una “causa raíz” única. Entendemos que los incidentes suelen deberse a una combinación de causas y circunstancias. Usamos técnicas como los 5 Whys, repitiendo la pregunta “¿por qué?” a partir del primer síntoma visible. Al hacerlo en grupo, desentrañamos los distintos factores que contribuyeron al incidente. Muchas veces descubrimos fallos en el proceso, en los supuestos, en la comunicación o incluso en la interpretación de los datos.
  • Definir acciones correctivas y preventivas que no solo eviten el problema, sino que reduzcan el tiempo de recuperación futuro y aumenten la resiliencia del sistema.
  • Integrar estas acciones en el flujo normal de trabajo, para que no se queden en papel mojado.
  • Usar un informe de incidente blameless, público dentro de la empresa y colaborativo, que sirva como base para el aprendizaje colectivo. Este proceso de análisis y aprendizaje continuo es un ejemplo de Kaizen, la mejora continua aplicada a la gestión de incidentes, donde buscamos constantemente formas de mejorar nuestros procesos y evitar errores futuros.

Estos informes incluyen resumen, línea temporal, causas, acciones y aprendizajes. Compartirlos abiertamente refuerza el mensaje: los errores no se esconden, se aprenden. Y cuando todo el equipo lo interioriza, la organización mejora más rápido.

Al final, se trata de reforzar un mensaje claro: los incidentes son inevitables, pero cómo respondemos a ellos es lo que realmente define nuestra cultura. Podemos esconderlos, culpar y pasar página... o podemos usarlos como catalizadores de mejora y aprendizaje continuo. En Lean Software Development, elegimos lo segundo.

En la siguiente entrega…

Una vez que hemos establecido cómo detectar y responder rápidamente a los errores, es crucial construir una base sólida. En la tercera parte, profundizaremos en la calidad interna como base del desarrollo sostenible. Veremos por qué menos es más, cómo la simplicidad y el diseño bien cuidado aceleran el desarrollo y cómo una buena base técnica permite moverse rápido sin romper cosas.

Thursday, April 10, 2025

Vibe Coding III: Complexity Creeps—Unless You Don’t Let It

Getting Back to My Picks

Over the past few days, I revisited the development of eferro Picks. The project had been well received, and I’d gotten some interesting feedback, so I decided to give it a proper push. But this time, it wasn’t just another playful exploration. The project had grown in complexity and was starting to demand a more deliberate approach—especially if I wanted it to scale or stay maintainable.

My first step: simplify the code, strip out anything superficial, and wrap everything in tests to make the project more sustainable.

The next step was to process all the feedback I had received about the site and implement some improvements that would make the experience smoother and more intuitive. This included better navigation, clearer tooltips, and a toggle to filter talks that actually had notes. It was a great excuse to turn real user feedback into practical functionality—while continuing the vibe coding experiment in a slightly more purposeful direction.

https://eferro.github.io/eferro-picks-site/

The AI as My Pair Programming Partner (With Superpowers and Flaws)

For this session, I continued my vibe coding experiment, but with a twist. Since my experience with front-end technologies like TypeScript, React, and Tailwind CSS is limited, I focused solely on accepting or rejecting the AI's proposed changes. I didn’t review or modify the code directly—instead, I observed the AI’s behavior and guided its direction from a higher level. It was pure vibe coding: trusting the AI and seeing where it would lead.

I used the Cursor Pro IDE, following the vibe coding rules I outlined in the first post of this series (Vibe coding: building things from curiosity and flow). That meant focusing on flow, intuition, and rapid iteration—even when partnering with an AI. Those original principles shaped how I interacted with it, emphasizing exploration over rigid planning.

In this new phase, the AI stopped being just a tool and started acting more like a pair programming partner. One that—despite chaotic moments—can be surprisingly effective when guided with care and intention. I say “guide” because without boundaries, it tends to suggest overly generic or needlessly complex solutions. Almost like it's channeling the collective ego of every public repo out there.

Left unchecked, the AI naturally gravitates toward generality, abstraction, and unnecessary flexibility—just like many of us do when we’re overthinking or trying to be clever.

This approach helped me refactor with focus, prioritizing simplicity and clarity. But it wasn’t all smooth sailing—and I think it’s important to be transparent about that.

During the process:
  • Two or three times, I had to stop because the AI entered loops it couldn’t escape.
  • Once, it even pushed invalid changes directly to production (I work in trunk).
  • Two or three times, I had to revert committed changes just to get back to a stable state.
  • At least twice, when a test wouldn’t stabilize, the best option was to delete it, move on, and return later with fresh eyes. That worked far better than endless poking.


These moments—while frustrating—reinforced something I already believed: the only way to build sustainably with this much raw power at my fingertips is to work in small, safe steps.


The Power of Small Safe Steps (Now More Than Ever)

One of the clearest takeaways from this session: with this much speed and assistance, working in Small Safe Steps becomes more essential than ever.

It’s valuable to know what I want to achieve, have techniques to move forward in parallel, and break tasks into manageable chunks. In practice, most of these “steps” were completed in sessions of 25 minutes or less. Each one designed to be:
  • Small: a change that takes just a few minutes.
  • Safe: unlikely to break production code or existing tests.
Also, due to my current responsibilities, I can only code in short, scattered bursts. I no longer have the luxury of regular pair or ensemble sessions. That’s why the benefits of small steps are fundamental for me—especially:
  • Interruptibility: I can pause anytime without losing the thread.
  • Safety: Each change is low-risk and easy to roll back.
  • Autonomy: I can keep moving forward, even solo, without creating chaos.


This way of working also offers continuous feedback. Geepaw Hill’s article MMMSS: The Intrinsic Benefit of Steps describes this beautifully. I highly recommend reading the full Many More Much Smaller Steps series—it sticks with you.

Small Improvements, Real Impact

In these sessions, I also tackled some of the feedback I had received.

For example, the note icon next to some talks was unclear. After digging into the data with the AI, we discovered that many of those records didn’t contain actual notes—just whitespace or line breaks. So:
  • I implemented a filter to clean those up.
  • Added a tooltip explaining what the icon meant.
  • And (because experimenting is cheap) added a toggle to filter talks with real notes.
Another piece of feedback was trickier: when users navigated to a talk and returned, they lost their filters and scroll position.

This was tough—mainly due to my lack of frontend experience. The AI kept trying random libraries and approaches, switching strategies mid-flight. It felt like pure trial-and-error with no clear rationale.
Things only improved when I stepped in with a specific suggestion:
“Can you change the approach and just use a specific key for the index (scroll_index) so we can use this key when coming back? This way we avoid the problem of overwriting the key.”
The AI immediately responded:
“Ah, that’s a much simpler approach! Instead of handling different paths and normalizing them, we can:
  • Use a specific key for the index page scroll position
  • Save it when leaving
  • Restore it when returning
  • No need to handle paths or base URLs. Let me update the code.”
This moment captured a recurring pattern: complexity is still my responsibility
Left unchecked, the AI naturally gravitates toward generality, abstraction, and unnecessary flexibility—just like many of us do when we’re overthinking or trying to be clever.

But if I pause, question, and steer things back toward simplicity, the AI follows—with surprising grace.

I can (and must) be the guardian of simplicity—with or without AI.

A Retrospective with the AI (and Tangible Results)

In the middle of all this, I ran a mini-retrospective. I asked the AI:
“We’ve been working together for a while. I think you already know what concerns me. Can you describe our workflow and what behaviors you've changed based on my instructions?”
The AI replied:
“Our workflow has evolved into a highly collaborative process where:
  • We work in small, focused steps
  • Each change is preceded by clear reasoning
  • You maintain control
  • I provide options and wait for your decisions
  • We verify each change before moving forward
  •  Communication is direct and natural
I’ve adapted to:
  • Present one change at a time
  • Show my thought process
  • Ask for confirmation before proceeding
  • Focus on practical, simple solutions
  • Maintain test coverage
  • Keep methods small and focused
  • Would you like me to adjust anything? 😊”
Then I followed up:

“Several times I’ve had to revert your changes because you didn’t run the tests before committing. Sometimes even several commits at once.
  • Why do you think this happened?
  • How could we have avoided it?
  • What have we learned?
  • How should we adapt to keep complexity under control?
The AI responded:
“To improve our process, I suggest:
New workflow steps:
  • Make one small change
  • Run tests immediately
  • Fix issues
  • Get your review
  • Only then commit
  • Never push without your explicit approval
Quality gates:
  • All tests must pass
  • Changes must be minimal
  • Each commit should be independently testable
  • No pushing without verification
Communication improvements:
  • Always report test status
  • Ask before pushing
  • Be explicit about what’s tested
  • Show test results

Would you like me to implement these changes in our workflow? 😊”
And the best part? It followed through. For the rest of the session, it ran tests before suggesting anything and asked for confirmation before every commit or push.

Closing: Being a Responsible Coder, Even When Alone

Will the AI forget all this once I close the IDE? Probably. But I’ve learned that these kinds of iterations and retrospectives have value. Therapeutic for me. Educational for the AI—at least within the session.

This experience reinforces the critical importance of working in Small Safe Steps, especially when collaborating with powerful AI. It’s not just about speed, but about maintaining control and understanding.

Continuous simplification is key. Left unchecked, complexity always grows. I have to actively fight that—and help the AI do the same.

Ultimately, it’s up to me to be the guardian of simplicity.

And given my current role, I don’t get to code as much as I’d like—certainly not in pair or ensemble settings. So this is the closest I get to coding responsibly. Because anyone who knows me knows:
 Leaving me to code alone is rarely a good idea. :)


Previous posts in the series:

Related links:

Monday, April 07, 2025

The Phrases That Power My Leadership

Over the years, I’ve noticed that certain phrases have become central to how I lead teams. They're not just casual remarks—they're powerful tools that shape our culture and drive our decisions.

My leadership style leans more on intuition than rigid structure. But through reflection, I’ve realized I often return to the same phrases. These serve as mental shortcuts that influence how we work, decide, and collaborate.

"What’s the worst that could happen?" encourages experimentation without fear. It reflects a mindset of calculated risk-taking and a commitment to creating a safe space for innovation. This question only works if we can confidently answer, “Nothing serious—and we can fix it quickly.” That confidence comes from a strong foundation: technical excellence, sound design, automated testing, and continuous deployment. 


When we trust our systems and processes, experimentation becomes second nature. Then, questions like “What’s the next seemingly impossible goal we’ll accomplish?” don’t sound wild—they become grounded ambition. This mindset grows by breaking down complex problems and making thoughtful decisions step by step.

"Can we avoid doing it?" and "Can we achieve the same impact with fewer resources?" drive us toward efficiency. These align with Lean principles: reduce waste, maximize value. They help us cut complexity and focus on what truly matters. "What if we only had half the time?" forces us to prioritize and think about vertical slicing, ensuring we get feedback sooner. When technical quality is high, it’s easier to identify what’s unnecessary, simplify with confidence, and adapt quickly. Even a suggestion like “Let’s remove it and monitor the impact; we can restore it if needed” becomes a safe, low-risk move. In an environment with solid observability, metrics, and rollback capabilities, deletion becomes just another experiment. Another phrase I often use is, “Don't do anything I wouldn't do.” This encourages a sense of shared responsibility and ensures that everyone feels empowered to make decisions within the established boundaries and values of the team. It promotes trust and reinforces that we're all in this together.

These phrases go beyond individual decisions. They spark a positive feedback loop:

  1. Technical excellence builds speed and confidence.
  2. That enables experimentation, learning, and impact.
  3. That impact strengthens the team’s autonomy and trust within the organization.
  4. That trust leads to more investment—and more progress.

It’s a virtuous cycle. It doesn’t happen overnight, but once it begins, it changes everything.

In the end, these phrases work because they reflect a way of working built on ambition, focus, and care. We aim for innovation and execution while supporting each other along the way. And they work because there are teams who don’t just understand them—they live them every day.

This approach is also only possible thanks to the patience and insight of my colleagues. Thank you for joining me in exploring new ideas—and for steering us wisely when needed.

These are just a few of the phrases that guide my leadership. I’d love to hear the ones that guide yours. Feel free to share them in the comments below!

Related Articles (for more context and related principles):


Ultimately, these phrases—and the principles behind them—help build a culture of trust, innovation, and continuous improvement. It’s about empowering teams to do their best work and make a real impact.