Don't fall for the naive 'always'

Avoiding waste with data and AI by knowing what isn't always true

Mar 01, 2024

Many interesting things happen in the offline world i.e. the world that exists not in the cloud and not in a web browser. And many of these ‘majority offline’ industries are now stepping up to the starting blocks of data and AI value creation as digitisation of key workflows occurs and/or new market regulation requires quantification and optimisation of core processes or auxiliary outputs.

So what can we share that will be really useful? From areas that have invested in data and AI for many years to those who are just beginning.

I’ve been thinking about this a fair bit recently and so, true to my intent with this blog, I sat down to see if I could chase that thinking onto a page and into useful format useful for myself and others. After some pondering, a first key theme has emerged: beware of assuming an ‘always’.

Heavily digitised industries with decades of experience using data to create value (and hence decades of mistakes, wrong turnings and sunk costs!) know what doesn’t work but predominantly talk and publish about the times when things go right. From the outside looking in, it’s easy to absorb only the broad brush strokes of paths to value and miss the ‘except whens’.

Data is not always a valuable asset

While it might be a liability worth holding onto against future opportunities for some time, it’s a mistake to think that data has inherent value and is hence something to be hoarded across the board. Why a liability? Because at a minimum any data you hold needs to be stored, to be backed up, to have auditable access control and to be kept safe from cyber attacks.

Watching from the periphery of data-intensive industries over the past decade, it would have been easy to have heard the message ‘storage is cheap now’ but to have missed the later trend of ‘so we will generate and store WAY WAY more data than we did before, to the point that the bills mount up again’.

How do you figure out whether data ‘X’ is valuable enough to retain? First you need an estimation of the cost to retain. I would start by drafting a data retention framework that covers at a minimum

categorisation of data into sensitivity levels with enough articulation of those levels so folk can map both data that exists and data that is touted for collection to the appropriate level
retention durations for each of those sensitivity levels (that are compliant with regulation if regulated and compliant with clearly and publicly articulated periods if not)
storage options for both operational and analytically focussed data with links to current costing for each option
(and of course, a timetable and process for updating said framework)

That gives you something to baseline at least a guesstimate of the raw storage costs of holding on to whatever data you might be considering stockpiling. Not a total cost of ownership cost of course but multipliers for personnel and overhead costs can help out with that at a guesstimate level.

Now the liability label is more self evident as that data is something you need to pay regular amounts of money to hang on to on an ongoing basis.

How substantial are the upside benefits of the data you are thinking about storing, be that through lower support costs, cross sales, targeted marketing, lower churn via stickier products, data sharing partnerships or actual new revenue streams via value added and hence higher priced products?

Sure it’s all guesstimates and whiteboard estimates but it’s better than storing everything, sleeping through your next 36 months of inexorably increasing cloud storage bills and making the front page of the newspaper for all the wrong reasons when the hackers look your way. All before the value you were contemplating makes its way off the backlog.

More data will not always be better than less

While I’m on record as currently thinking that value from Generative AI will mostly accrue to the major players over the next 5 years, that’s not because they hold more data per se. It’s because they hold a lot of workflow and because they have the economic power to run immature and expensive technologies at a loss for many years by cross subsidisation.

More than once in my career to date, I’ve expected to find that a data/AI based product available from a (much) larger player would be superior to something built in house by a highly skilled team with access to orders of magnitude less data. And more than once, in a genuine bakeoff, been proven wrong.

Uniqueness of the data that a smaller player has access to can be a hugely important factor and more than offset for a much smaller amount of data. This is particularly true when the data includes high quality, human user generated labels. The new areas of possibility opened up by Gen AI in the last 18 months not withstanding, today the majority of economic value from AI is derived from supervised learning and supervised learning needs labels. Yes you can generate them but the simplest and highest quality labels will generally come from user actions.

So pay careful attention to the data you may have, or may be able to generate that is not widely available, particularly if you have access to the actions and / or decisions of humans at substantial scale. Commonly used examples but still super evocative ones are the worker sorting fruit, the farmer spraying weeds or the school teacher correcting homework.

Sharing useful content is a great way to brighten someone’s day. Go ahead and make the sun shine!

Standardising data is not always worth it

Don’t get me wrong, understandable data in a well thought through schema designed to accomodate many usage patterns, shared between players and across workflows is a thing of beauty which could be the basis of substantial economic value. But that’s definitely a could not a will, and even the could does not come cheap.

I’m most of the way towards believing that if it is deemed not worth standardising the data on the front end (i.e. at the point of generation), it may well not be worth putting in the work after the fact, massaging it into conformity. Usefully operational single customer views in my experience remain about as rare as unicorns and data engineering team bloat is driven in large part by the endless hamster wheel of fixing data feeds broken by ‘non compliant’ upstream changes, flagged late or not at all.

Software engineers (i.e. those who tend to work at the generation end of much interesting data) are a lot more data literate than they were 10 years ago and for industries that are just beginning their data journey, I’d be taking a really hard look at ‘fixing’ the data at point of generation to be ‘standard’. Or agreeing as a business that the economic benefit really isn’t worth the work and we’ll make do business silos and multiple definitions of metrics. Nothing wrong with that if done mindfully.

I’d be particularly thoughtful about this if you operate in an organisation that acquires new businesses. Integration backlogs can quickly stretch into decades for an acquisition that is still an experiment. Sometimes coping with the data and metrics chaos and being a portfolio company not a wanna be platform company is the right call.

Till next time

For those in the exciting position of working with data in an industry that is just waking up to the possibilities, I hope the above is helpful food for thought. For those who are old hands with many mistakes under their belts, what other ‘except whens’ would you add?

I’m off to cut grass and generally tidy up the home paddock ahead of actually welcoming a guest to the farm on Sunday. What fun. Take care and I hope you get to be somewhere relaxing at least for a little while this week.

Data Runs Deep