SPA 2021 Agile Data Science Retrospective Write Up

I recently ran a virtual retrospective session at the excellent SPA 2021 titled “Agile Delivery and Data Science: A Marriage of Inconvenience?” about agile deliveries which include some kind of data science component. I decided to submit the session proposal as over the last few years I’ve observed some repeating pain points whilst working on agile data platforms, and from conversations with others outside our organisation, anecdotally these did not seem to have been unique to us. I was keen to understand how widespread these issues might be, and to share and learn from the participants which solutions they had tried, and what had succeeded and failed in which contexts. Given the limited set of captured data points, the following is clearly not conclusive. However hopefully it might still be of some help to others experiencing similar challenges. 

Using the recently discovered and excellent retrotool, the retrospective was structured in two parts: the first focusing on pain points, and the second focusing on what attempted solutions had worked and what had not worked (unfortunately the final discussion ended up being a bit pushed for time, so apologies to the participants for not managing that better). In reviewing the points raised I then tried to structure the resulting themes in an “outside-in” manner, starting with the customer and then working back up the delivery dependency tree to the more technical delivery concerns. The outputs were as follows.

Pain Points

Customer Identity

The Data Mesh architectural pattern has recently gained a lot of popularity as an adaptable and scalable way of designing data platforms. Having worked on knowledge graph systems using domain-driven design for many years now, I’m a pretty strong advocate of the data mesh principles of domain ownership and data products. Data Mesh “solves” the problem of how to work with data scientists by treating them as the customer. That’s a neat trick, but it only really works where that model reflects the actual delivery dependencies – i.e. in the instance that there are no inbound dependencies from the outputs of data science back into product delivery. In my experience, that most typically applies in enterprise BI scenarios where organisations are becoming more data-driven and need to aggregate data emitted from services and functions across their value streams. In such situations there is generally a unidirectional dependency from the data services out to the data science teams, and the data mesh approach works nicely. 

In other contexts however, it doesn’t work so well. Firstly, in situations where data science outputs are an integral part of the product offering e.g. where you are serving predictive or prescriptive analytics to external paying users, then data scientists can’t just be treated as “the customer”. They need to be integrated into the cross-functional delivery team just like everybody else, otherwise the dynamics of shipping product features can become severely disrupted. Secondly, in situations where data science outputs are being used to reconfigure and scale existing manual processes then again the primary customers aren’t the data scientists. Instead they are typically the impacted workers, who most often see their roles changing from task execution to process configuration, quality assurance and training corpus curation.

Product Ownership

The role of agile product owner demands a skillset that is reasonably transferable into the data product owner role in data mesh enterprise BI scenarios. However in contexts where data science outputs are an integral part of the product offering, the data product owner role generally requires domain expertise orders of magnitude deeper than for typical agile product ownership roles. In sectors like healthcare and life sciences, understanding a.) which algorithms could best help solve the problems your customers/users care about most, and b.) the limitations and assumptions on which those algorithms are based which will determine the contextual boundaries of their usefulness – such critical product concerns are only addressable with deep domain expertise. Such specialists almost certainly don’t have additional skills in agile product management (or production-quality software development for that matter either – more on this below).

Stakeholder Expectation Management

Due to a general lack of understanding about machine learning amongst many stakeholders, there are frequently unrealistic expectations about what is achievable both in terms of algorithm execution speed and precision/recall. This can be compounded by the challenges of delivering trained statistical models iteratively, where often the combination of poor quality initial training data and a lack of target hyperparameter ranges can result in early releases with relatively poor ROC curves or F1 scores. Such deliverables combined with a lack of stakeholder understanding about technical constraints can have a compounding effect on undermining trust in the delivery team.

Skills

Simon Wardley has developed his excellent Pioneer/Settler/Town Planner model for describing the differing skill sets and methods which are appropriate to the various stages in the evolution of technology. It offers a very clear explanation for one of the most common friction points in agile data science programs, namely that it pulls Pioneer-style data scientists most familiar with research work using Jupyter notebooks and unused to TDD or CI/CD practices, plus Town Planner-style data engineers used to back office data warehouse ETL work and/or data lake ingest jobs, into Settler-style cross-functional agile delivery teams which aspire towards generalising specialist skill sets rather narrow specialisations. This can manifest both during development via substantially different dev/test workflows, and during release via the absence of due consideration for operational concerns such as access control, scaling/performance (via a preference for complex, state-of-the-art frameworks over simplicity) and resilient data backup procedures.  

Fractured Domain Model

Where data scientists aren’t fully integrated into delivery teams, a common observation is the introduction of domain model fractures or artificial boundaries in the implementation between “data science code” and “software engineering code”. Most typically this occurs when data science teams develop blackbox containerised algorithms which they then hand over the wall to the delivery team for deployment. It is quite common for different algorithms in the same problem space to share first or second order aggregates which are calculated in the early part of the execution sequence. By treating the whole algorithm as a blackbox, these lower level aggregates are needlessly recalculated time and time again, impacting performance, scalability and caching. Also, any variation in the formulae used to calculate these lower order aggregates remains hidden. Whilst the resulting differences in outputs might be well within the error margins of any prediction being generated, where such data inconsistencies become visible to end users it can undermine their trust in the product.

Technical Debt 

Agile data science programs seem to be particularly prone to incurring payback on existing legacy technical debt in the very early stages of delivery. Often this is due to the presence of data lakes, which historically swapped the upfront data cleansing and normalisation costs of data warehousing for a just-in-time approach driven by use case demand. Whilst reasonable enough from a cost optimisation perspective, this approach unfortunately violates Postel’s Law of being “liberal in what you accept and conservative in what you send” (or at least implies a very substantial lag between data being accepted and then sent onwards). Migrating data from a data lake into a modern collection of microservice data products almost always requires those cleansing and normalisation debts to be paid off again. The other factor is that predictive analysis normally highlights any data quality issues which may previously have lurked beneath your radar, as their impact gets amplified once they are used to train for forecasting. 

What Didn’t Work Well

Treating Data Science as One Thing

Thinking about data science from a Wardley Mapping point of view, it quickly becomes clear that the discipline in fact covers a wide range of skills and activities: from true R&D on innovative machine learning architectures (genesis), to statistical software engineering and integration of existing model frameworks (product), through to data pipeline engineering (commodity). Treating all of these as a single activity and skillset is the underlying cause of many delivery problems. 

Data Science as an upstream R&D function

The most common manifestation of “treating data science as one thing” is to move all the data scientists into an upstream R&D function. If what they are working on is in fact “product” phase deliverables, the unavoidable result is a shift into a more waterfall-style delivery model which will fundamentally impact your agile delivery capabilities. Conversely if they are working on true “genesis” phase innovation work, it begs the question of whether strategically this really makes sense for many organisations. There has been a massive recruiting push for data scientists recently, many of whom have been building machine learning capabilities which now exist as cloud services or else will do very shortly as the technology matures. Unless you work for a big tech company or else your organisation has very niche requirements and differentiators, it is frequently unlikely that you will really need a data science R&D capability going forwards. 

Additional team between Data Science and Software Engineering

Some organisations try to keep data scientists in ringfenced teams and then add an interface layer between them and software engineering via a team with roles named something like “machine learning engineer” or similar. The fundamental problem with this approach is that it is trying to solve a dependency problem by adding more dependencies, therefore it is pretty much guaranteed not to work.

What Worked Well

Integrating Data Scientists into Cross-functional Teams

The best way to deal with dependency problems is to restructure your delivery process so that its boundaries better reflect the flow of value. Unless you have outbound-only dependencies into your data science teams, that means integrating “product” phase data science into your cross-functional delivery teams. Pairing data scientists with software engineers, TDDing your algorithm invocation code, and adding data scientists into Three Amigos (Four Amigos?) feature discussion/planning might have some early impact on team velocity whilst unfamiliar skills are shared and learned, however it rapidly pays off in terms of maintaining a high quality agile product delivery capability.

Domain Expert as Product Owner

For complex domains which require deep subject matter expertise, it is generally better to pair a domain expert with an agile coach to serve as product owner rather than trying to cross-train an experienced agile PO. A detailed understanding of your customers’ problems should always trump knowledge of delivery methodologies, otherwise you simply risk the common failing of building the wrong thing in the right way. 

Use of Demonstrator Apps/Test Harnesses to Show Progress

Interestingly, participants found that using demonstrator apps, test harnesses and tools such as AWS Quicksight or Google Data Studio were a lot more effective for demonstrating early progress to stakeholders compared to early software product releases. One reason for this is that poor algorithm performance viewed in the context of the end product tended to generate panic and create blindness in stakeholders, whereas they were a lot more open to explanations of current constraints when viewing an algorithm being executed in isolation.

Design for Incremental Automation over Big-Bang

Finally, the agile principles of incremental, adaptive delivery should be applied as much to process automation via machine learning as they are to the product delivery itself. The extent to which it is optimally cost effective to augment human processes with machine assistance is generally not something which can be decided upfront. For this reason, gambling everything on 100% full automation creates a high risk of failure or else major cost overruns. Instead it is much safer to iterate from manual to increasingly automated processes, designing your systems around the sources of variability until the point of acceptable performance is achieved. As an example, a machine learning service which classifies research literature may have results which vary by subject area (e.g. it might be better at classifying drug interventions than behavioural interventions) and source (e.g. it might perform better against well structured documents from one source compared to other documents from another source). In such circumstances, designing quality assurance tools that support sample rate checks which can be varied by subject area and data source is much more likely to succeed compared to a delivery which naively assumes machine learning can solve everything within cost effective limits.

Thanks

I would like to thank all the participants who attended the retrospective session, and in particular Paul Shannon and Immo Huneke for their contributions.

Rapid Response Covid-19 Delivery at Cochrane

We’ve posted details about the work we’ve been doing with Cochrane on their Covid-19 Rapid Response. It’s definitely one of the deliveries I’ve been most proud to work on.

    • Info on the business challenge and tech/data strategy is here.
    • Details on the platform and deployment architecture are here.

 

LSSC11 pt2: Lean as Natural Science

The previous post we discussed control in complex adaptive systems. We examined the different categories of constraints that shape emergent behaviour, and highlighted the dangers created by failing to recognise the limits of human control. In this post I would now like to examine the extent to which it’s meaningful to think about lean product development as being ‘natural science’. In doing so, I intend to avoid getting drawn into philosophical debate about the scientific method. Rather, taking my cue once more from David Snowden’s LSSC11 keynote, I’d like to start by examining the narrative context of appeals to natural science (i.e. claims that field F ‘is science’). Historically, such appeals have commonly been made for one of two reasons:

  1. Legitimacy
  2. Power

Legitimacy

A long tradition exists of appeals to science in order to assert the reliability and validity of new or poorly established fields of intellectual inquiry. That tradition has been based in academic circles, and can be traced back through management and education science; political science, economics and anthropology; through to linguistics and famously even history.

Whether this is something lean product development needs concern itself with, I would question. As a discipline based on practical application rather than theoretical speculation, we can rely on natural selection to take care of that validation for us: if the methods we use aren’t effective then we will simply go out of business. Economic recession makes this process all the more reliable. Earlier this year I came across a really great risk management debate between Philippe Jorion and Nassim Taleb from 1997, where Taleb makes the following point:

We are trained to look into reality’s garbage can, not into the elegant world of models. [We] are rational researchers who deal with the unobstructed Truth for a living and get (but only in the long term) [our] paycheck from the Truth without the judgment or agency of the more human and fallible scientific committee.

For me this summarises lean practice. Yes, we are performing validated learning – but simply because doing so is rational behaviour, and more effective than taking a blind punt. Beyond that, whether the techniques employed constitute science seems to me an unnecessary diversion.

That would be of arguably little consequence were it not for one issue: namely, risk. Reminding ourselves that risk is a function of value rather than an entity in its own right, the value that science strives for is truth (however that might be defined). However the value that complex systems risk management strives for is survival. This difference creates a diametrically opposing attitude to outlier data points. Risk management in science is concerned with protecting the truth, and so all outlier data points are by default excluded unless they are repeatable and independently verified. On the other hand, risk management in complex systems is all about managing the cost of failure. It doesn’t matter if you are wrong most of the time, as long as in all those cases the cost of failure is marginal. What counts is being effective when the cost of failure is highest, as that creates the greatest chance of survival. As a result, outlier data points are by default included and are only ever discounted once we are highly confident they are not risk indicators.

The financial crisis has demonstrated the consequences of confusing these two different perspectives: of risk management techniques which are right most of the time but least effective when the cost of failure is greatest.

Power

The other common motive for appeals to science has been power. Going back to Descartes and the birth of western science, its narrative has always been one of mastery over and possession of nature – in short, the language of dominion and control. This takes us back to the themes of the previous post.

Such a perspective has become so deeply embedded in our cultural consciousness that it now influences our comprehension of the world in all sorts of ways. A recent example has been the debate in software circles about avoiding local optimisation. As a technique for improving flow through systems, the principle is sound and highly valuable. However it is entirely dependent on the descriptive coverage of the system in question. Some systems, such as manufacturing or software delivery pipelines, are amenable to such complete mapping. Many others however, such as economic markets, are not. A technique commonly used to describe such systems in evolutionary biology is the fitness landscape:

The sectional view through such an environment might be as follows, leading the unwary to highlight the importance of avoiding local optimisations at points A and C and always evolving towards point B.

The problem here is that for landscapes such as economic markets, the above diagram represents the omniscient/God view. For mere mortals, we only have knowledge of where we have been and so the above diagram looks simply like this:

Whilst it is a valuable insight in its own right simply to understand that our product is a replicator competing in a fitness landscape, as much as we might like to avoid local optimisations doing so is impossible because we never know where they are (even at maxima, we can never tell if they are global or local).

It is for these reasons that I think it is unhelpful to think of lean as science. The narrative context of lean should be not one of arrogance but humility. Rather than succumbing to the illusions of mastery and control in the interests of appeasing our desire for certainty, we are performing validated learning simply because it is the most effective course of action once we recognise the actual limits of both our control and understanding.

LSSC11 pt1: Constraints and Control

It’s been a couple of weeks since my return from LSSC11 in Long Beach, California. My congratulations to David Anderson and the team for putting on a great conference. I was particularly impressed by the diversity of content and choice of keynote speakers. I’m sure the continued adoption of such an outward-looking perspective will keep the lean community a fertile breeding ground for new ideas. For me, personal highlights of the week included an enthusiastic and illuminating conversation about risk management with Robert Charette, meeting Masa Maeda who is obviously both very highly intelligent and a top bloke (thanks a lot for the blues recommendation btw!), a discussion about risk with Don Reinertsen after his excellent Cost of Delay talk (more on this in another post) and catching up with a number of people I haven’t seen for too long.

I think the talk I found most thought-provoking was given by David Snowden. I’d read a couple of articles about the Cynefin framework following a pointer from Steve Freeman and Suren Samarchyan a couple of years ago, but I’d never heard him speak before and the prospect of complexity science and evolutionary theory making it into the keynote of a major IT conference had me excited to say the least. Overall he presented lots of great content, but by the end – and to my surprise – I was left with a slight niggling ‘code smell’ type feeling, something I’d also experienced in couple of the Systems Engineering talks on the previous day. Reflecting on this during his presentation, I realised the cause was essentially two concerns:

1.) The lack of acknowledgment that often we have little or no control over the constraints in a complex adaptive system
2.) The extent to which it’s meaningful to think about lean product development as being ‘natural science’

The first of these will be the subject of the rest of this post. The second I will discuss in my next post.

Constraints in Complex Adaptive Systems

For any agent acting within a complex system – for example, an organisation developing software products and competing in a marketplace – the constraints of that system can be divided into the following core categories:

a.) Constraints they can control. These include software design, development and release practices, financial reward structures, staff development policies, amongst others.
b.) Constraints they can influence. These include how a customer perceives your product compared to the products of your competitors, and the trends and fashions which drive product consumption.
c.) Constraints they can do nothing about and must accept. These include competitor activities, legal constraints, economic climate, and more typical market risks such as exchange rates, interest rates and commodity prices.

Each type of constraint then requires a specific management approach, as follows.

a.) ‘Control’ constraints: this is the domain of organisational and capability maturity models.
b.) ‘Influence’ constraints: this is the domain of marketing and lobbying (both internal/political and external/advertising): for example, one of the most effective growth strategies is to promote ideas which cast the market-differentiating characteristics of your competitors in a negative rather than positive light. However such techniques are only reliable to the extent that no-one else is using them. They also create risk because in such circumstances they create an illusion of control. Once competitors adopt similar techniques then that illusion is revealed, and influence is lost until more sophisticated strategies are devised.
c.) ‘Accept’ constraints: this is the domain of risk management and resilient program management practices.

If an organisation mis-categorises the constraints within which it operates, it can have consequences which are terminal. Almost always this happens because of the illusion of or desire for control, where constraints are mistakenly identified as the first category. The illusion of control is commonly created when complex adaptive systems are in temporal equilibrium states, and so behave predictably for limited (and unpredictable) periods of time. Applying control management techniques in such situations is worse than doing nothing at all, as it creates the illusion of due diligence and hides the real levels of risk exposure. Thinking about the recent financial crisis in these terms, it can be seen as the misapplication of the mathematics of natural science (e.g. Brownian motion in pricing models) in an attempt to manage capital market systems constraints that were actually outside the domain of full human control.

A key to this is something I have blogged about previously: failure handling. Snowden discussed a strategy employed by a number of large organisations where they preferentially select delivery teams from previous failed projects, as the lessons learnt they bring to the new project increase its chance of success. He likened this approach to treating exception handling as an integral part of software design. This idea was a recurring theme across the conference, to the point where a number of people were tweeting suggestions for a ‘Failure’ track at the next LSSC. However I don’t buy this argument, and in his closing remarks I was pleased to hear David Anderson announce that it won’t be happening. The reason I don’t buy it goes back to the software design metaphor. If your application was brought down by a exception, then clearly failure handling was not treated as a first order concern in the design process. Similarly, if resilient failure handling had been designed into your program management process, then rather than your project failing you should have conducted the necessary risk mitigation or performed a product pivot.

In terms of the categories described above, failure related to the first type of constraint generally indicates an immature delivery capability: it is failure that was avoidable. On the other hand, failure due to the third type of constraint indicates a lack of understanding of resilience and the principles that Eric Ries and others have been promoting for a number of years now. It is failure that was absolutely inevitable. Neither of these are characteristics I would want to bring to a new project. Failures due to the second type of constraint are arguably more interesting, but for me they quickly descend into subject areas that ethically I find questionable. Finally and of most interest, are failures due to constraints unique to the particular business context. The uniqueness of the constraint means that the failure is necessary, and not just an attempt to find a solution to an already solved problem. In this case, the failure becomes part of the organisation’s learning cycle and a source of highly valuable insight going forwards. However, even here it could be argued that such lessons should be possible via effective risk management processes rather than requiring full-blown project failure.

I had a brief chat to David Snowden after his talk, and regarding the extent to which systems constraints can be controlled he had the slightly disappointing answer that ‘C-level people get what I’m talking about, it’s only middle managers who don’t understand it.’ Whilst that may or may not be true, afterwards it put me in mind of Benjamin Mitchell’s excellent presentation about Kanban and management fads. I think a large part of the history of management fads reduces down to the exploitation of CxO denial regarding a.) the actual limitations of their control and b.) the difficulty and cost of high quality product development. I think a key task this community can perform to help prevent Kanban going the same way is to stay faithful to ‘brutal reality’, to borrow Chet Richards fantastic phrase, by remaining transparent and honest about the true limits of achievable control.

Optimal Exercise Point

Although I have been making use of them over the last 18 months in various presentations and the real options tutorial, I recently realised I’d omitted to publish graphs illustrating the optimal exercise point for real options on this blog. As a result, here they are:

The dotted blue line represents risk entailed by the agilist’s Cone of Uncertainty, and covers technical, user experience, operational, market and project risk – in short anything that could jeopardise your return on investment. The way this curve is negotiated is detailed below, by driving out mitigation spikes to address each risk.

The dotted brown line represents risk from late delivery. At a certain point this will start rising at a rate greater than your mitigation spike s are reducing risk, creating a minimum in the Cumulative Risk curve denoted in red. Remembering that

Feature Value = ((1 - Risk) x Generated Value) - Cost

this minimum identifies the Optimal Exercise Point.

One point worth exploring further is why Delayed Delivery is represented as risk rather than cost. The reason is because Cost of Delay is harder to model. For example, let’s say we are developing a new market-differentiating feature for a product. Let’s also say that there are X potential new customers for that feature in our target market, and that it would take Y weeks of marketing to convert 80% of those sales. Providing whenever we do launch, there remains Y weeks until a competitor launches a similar feature then then the cost of delay may be marginal. On the other hand, if we delay too long and a competitor launches their feature before us then there will be a massive spike in cost due to the loss of first mover advantage. However the timings of that cost spike will be unknown (unless we are spying on the competition), and therefore very hard to capture. What we can model though is the increasing risk is of that spike biting us the longer we delay.

I have found it very helpful to think about this using the insightful analysis David Anderson presented during his excellent risk management presentation at Agile 2009 last year. He divides features into four categories:

  • Differentiator: drive customer choice and profits
  • Spoiler: spoil a competitor’s differentiators
  • Cost Reducer: reduce cost to produce, maintain or service and increase margin
  • Table Stakes: “must have” commodities

Differentiators (whether on feature or cost) are what drive revenue generation. Spoilers are the features we need to implement to prevent loss of existing customers to a competitor, and are therefore more about revenue protection. And Table Stakes are the commodities we need to have an acceptable product at all. We can see how this maps clearly onto the example above. The cost of delay spike is incurred at the point when the feature we intended to be a differentiator becomes in fact only a spoiler.

This also has a nice symmetry with the meme lifecycle in product S-curve terms.

We can see how features start out as differentiators, become spoilers, then table stakes and are finally irrelevant – which ties closely to the Diversity, Dominance and Zombie phases. There are consequences here in terms of market maturity. If you are launching a product into a new market then everything you do is differentiating (as no-one else is doing anything). Over time, competitors join you, your differentiators become their spoilers (as they play catch-up), and then finally they end up as the table stakes features for anyone wishing to launch a rival product. In other words, the value of table stakes features in mature markets is that they represent barrier to entry.

More recently however, I have come to realise that these categories are as much about your customers as your product. They are a function of how a particular segment of your target market views a given feature. Google Docs is just one example that demonstrates how one person’s table stakes can be another person’s bloatware. Similarly, despite the vast profusion of text editors these days, I still used PFE until fairly recently because it did all the things I needed and did them really well. Its differentiators were the way it implemented my functional table stakes. The same is true of any product that excels primarily in terms of its user experience, most obviously the iPod or iPhone. Marketing clearly also plays a large part in convincing/co-ercing a market into believing what constitutes the table stakes for any new product, as witnessed in the era of Office software prior to Google Docs. The variation in the 20% of features used by 80% of users actually turned out not to be so great after all.

So what does this mean for anyone looking to launch a product into a mature market? Firstly, segment you target audience as accurately as possible. Then select the segment which requires the smallest number of table stakes features. Add the differentiator they most value, get the thing out the door as quickly as possible, and bootstrap from there.

Diving Into Large Scale Agile

The phrase “Agile in the Large” is one I’ve heard used a number of times over the last year in discussions about scaling up agile delivery. I have to say that I’m not a fan, primarily because it entails some pretty significant ambiguity. That ambiguity arises from the implied question: Agile What in the Large? So far I have encountered two flavours of answer:

1.) Agile Practices in the Large
This is the common flavour. It involves the deployment of some kind of overarching programme container, e.g. RUP, which is basically used to facilitate the concurrent rollout of standard (or more often advanced) agile development practices.

2.) Agile Principles in the Large
This is the less common, but I believe much more valuable, flavour. It involves taking the principles for managing complexity that have been proven over the last ten years within the domain of software delivery and re-applying them to manage complexity in wider domains, in particular the generation of return from technology investment. That means:

  • No more Big Upfront Design: putting an end to fixed five year plans and big-spend technology programmes, and instead adopting an incremental approach to both budgeting and investment (or even better, inspirationally recognising that budgeting is a form of waste and doing without it altogether – thanks to Dan North for the pointer)
  • Incremental Delivery: in order to ensure investment liability (i.e. code that has yet to ship) is continually minimised
  • Frequent, Rapid Feedback: treating analytics integration, A/B testing capabilites, instrumentation and alerting as a first order design concern
  • Retrospectives and Adaptation: a test-and-learn product management culture aligned with an iterative, evolutionary approach to commercial and technical strategy

When it comes down to it, it seems to me that deploying cutting-edge agile development practices without addressing the associated complexities of the wider business context is really just showing off. It makes me think back to being ten years old and that kid at the swimming pool who was always getting told off by his parents “Yes Johnny I know you can do a double piked backflip, but forget all that for now – all I need you to do is enter the water without belly-flopping and emptying half the pool”.