The New Art of AI Engineering

Why the real impact of AI is yet to come and how to make that happen

Wouter Huygen
28 min readMay 27, 2020

There is a famous story about how Ford and Mazda applied new information technology entirely differently — one of them successfully. The story takes place in the early nineties, when most corporations had heavily invested in automation technology like SAP. Ford and Mazda had both implemented ERP systems to improve their procurement processes. Ford owned a minority stake in Mazda and therefore had insights into their operations. At some point, Ford noticed that Mazda employed only about 5 people to run procurement whereas Ford had over 500. How could both companies run the same process, for the same type of business, with the same new technology, but only one of them was reaping the benefits? And what does this have to do with AI?

There is a common set of challenges to technology adoption that is repeated throughout history. Understanding how these challenges pan out for a specific new technology helps to devise successful change approaches. The ICT revolution that peaked in the 80s throughout the 90s led to a wave of new business practices, such as Six Sigma and Business Process Re-engineering. These were complementary skills and activities needed to embed new automation systems and make them productive.

The same is true for AI. Organizations will need to develop a new set of engineering capabilities to adopt and diffuse AI at scale. These engineering capabilities break down into three key areas. The first is to re-engineer your business model and business processes to become “AI-first”. The second involves engineering capabilities to develop end-to-end AI solutions and integrating them into an existing systems landscape. And thirdly, the engineering of AI platform capabilities to make these solutions sustainable and run at scale. Together, we aptly call this the New Art of AI engineering. This article explains why and how.

AI is in vogue and numerous executive surveys about the progress of AI paint a very consistent picture. Firstly, there is widespread belief in the opportunities presented by AI. Second, there is a rising fear about the disruptive risks — someone else might get there first. And thirdly, the gap to impact remains: despite ample investments, many companies struggle with AI adoption and don’t (yet) see the value from their initiatives.

Indeed, deploying AI at scale is hard. It would be cliche to say there is no silver bullet. However, there is a set of principles that we can draw from successful leaders in AI. We will explore them through the following key questions:

  • Why is deploying AI so difficult? What are the specific challenges? Where do organizations typically get stuck?
  • What can we learn from the few front runners that have managed to lead with AI? What are their keys to success?
  • How can we bottle the magic? How can we replicate proven success factors through a structured, repeatable formula?
  • And once we get this formula right, what does success look like? What are the new forms of value it can bring for established companies, and how to bring this about?

Exploring these questions will give you a frame of reference to think about the AI transformation of your organization. This will help you apply AI more successfully, but also inspire you to think about the real potential AI brings — which might be different from your current focus for AI.

1. Adopting AI at scale is a transformation challenge

The AI Productivity Paradox

When Ford introduced their ERP, they did not change their processes accordingly. Procurement orders were still manually checked against delivery notes, which subsequently were checked against invoices before payments would be made. A very laborious process prone to errors. After re-engineering their processes with BPR, they started using the ERP as it should. A central database now contained all orders which could be updated with new information by different departments. When goods were received, the delivering company would be paid automatically. No more invoices and no more back-and-forth messages between departments . Once they figured out how their ERP system could support a new way of working and organizing, Ford realized a 75% reduction in headcounts.

Similarly, the advent of the internet did not turn existing companies into full fledged e-commerce players over night. There are many examples of brick-and-mortar retailers that failed to make the switch online. Ecommerce is not the same as online retailing and requires new capabilities to be a success. Such as providing a seamless customer experience with superior service. And using the benefits of digital shelf and advertising space to test different offerings (A/B testing). Booking.com runs thousands of such experiments per day to continuously optimize their website. And so on.

The moral of these stories is that organizational rewiring, re-engineering or even re-imagination is needed before new technologies can bear fruits. This kind of transformation takes time. For the ICT age, this became known as Solow’s IT productivity paradox — coined by the economist Robert Solow. He famously noted that “computers show up everywhere but in the growth statistics”. The paradox refers to the discrepancy between measures of investment in information technology and measures of output at the national level. Erik Brynjolfsson, professor at MIT Sloan, has re-popularized this idea and extended it to AI. Although different factors are cited for the observed discrepancies between investments and productivity gains, Brynjolfsson points to implementation lag as the dominant factor for why AI is not (yet) driving observable economic growth at large. Most organizations are on a learning curve about how and where to apply AI best.

“Any new technology tends to go through a 25 year adoption cycle”, Marc Andreessen

That being said, the leaders in AI are showing the art of the possible. Amazon and Google have AI ingrained in almost every core process. Google has an internal mantra that has gone from “mobile first” (referring to the focus on internet user experience for their services on smart phones) to “machine learning first”. Focus on AI-powered automation is part of their DNA and core to everything they do. These are the obvious examples of the digital natives. In most industry sectors there are leading incumbents and start-ups following suit. But even though 90% of executives indicate they are investing in AI, only about 40% has seen results to date [1].

“The future is already here — it is just not evenly distributed”, William Gibson

To examine the barriers to impact, let us look at the different ways in which most companies approach AI. Based on our experience, there are two major dimensions to AI transformation in which companies can get stuck. Let’s look at both in turn.

Dimension 1: developing and implementing AI solutions

Clearly AI is not plug-and-play technology. Rather it both drives and requires a new level of technology-enabled business innovation. Organizations that have embarked on AI initiatives encounter a common set of barriers to be addressed as they progress with AI solutions. We distinguish five different levels of maturity in development and implementation of AI solutions. Companies that take AI development seriously typically use a funnel with similar stage gates that initiatives go through. Lasting impact is achieved only by the initiatives that go all the way. However, the success rate of these initiatives is not merely dependent on their inherent potential and feasibility. Rather, it depends on the organizations ability to do this — a new organizational capability.

  • Level 1: Vision & strategy defined. Most companies have defined some sort of vision and strategy for AI. Much has been said about this topic, such as the imperative to integrate AI into your overall strategy and not treat it as separate. Set your AI priorities strategy-backwards instead of developing your strategy AI-forward. It is crucial to have a narrative for how AI will empower your business, and articulate a vision of what it will mean for your organization in the long run. This vision is subsequently translated into priority areas for AI application development. Very few companies get stuck here.
  • Level 2: Proof of concept developed. The next step is to prioritize a few opportunity areas for PoC development, commonly based on feasibility versus potential value creation. Get ideas flowing, gather data, and develop a prediction model for the opportunity at hand. New insights are created and people are getting excited. Perhaps the results are being piloted through batch initiatives by the business. This is a relatively comfortable place to be and also deceptively easy to do. The risk is that it stops here. There is a huge gap between having a model developed in a sandbox as PoC, and having it running in a production environment ready to drive a business process.
  • Level 3: Production ready. Getting to this level is a crucial step for sustainable impact. It means having your algorithm running in a fully automated pipeline, either in batch or real time. Data flows are automated and predictions are generated with the right intervals. Model outputs feed into a front-end application or dashboard to aid human decision making.
  • Level 4: Embedded in work flows. Operationalized solutions are embedded in key processes to augment or fully automate decisions. In case of full automation, humans are taken out of the loop and the predictions directly feed into other systems. These machine learning pipelines essentially manifest as product features. Examples can be product recommendations in e-commerce, ETA prediction in Uber, or content recommendations by Netflix.
    In the case of augmentation, the solution typically provides machine learning based recommendations through a user interface. Users can accept the results, or make adjustments when deemed necessary. In particular this latter form of augmented intelligence suffers from huge adoption challenges. While full automation presents a binary change in processes — either you do it or you don’t — augmentation is prone to user bypasses and retreat to old ways of working.
  • Level 5: Scaled and continuously optimized. The greatest fallacy in AI solution development is to think that at some point you are done. AI systems are never done. They have to be scaled and optimized to reach maximum impact. They have to be maintained to keep their level of performance. They have to be improved and expanded. That is not to say that all or some of these level-5 imperatives cannot be automated. But in particular for core processes, the algorithms require ownership that takes care of them. Sure, there are examples of out-of-the box applications that you can run with. But those are table-stakes. The AI solutions that will drive your business model and provide a competitive edge will have to be nurtured, grown and innovated continuously.

Regardless of innate potential, viable AI initiatives commonly get stuck before they reach their summit (level 5). In summary, the biggest challenge to AI solution development and implementation is the last mile problem of AI. Analogous to the logistics challenge of making home deliveries profitable, the last mile in AI is getting a solution operational and really embedded in processes. The last mile can be half of the total effort.

Dimension 2: building enterprise AI capabilities

The second dimension of AI transformation deals with the technology infrastructure that allows AI solutions to be developed at scale across an organization. Here again, we observe different maturity levels typically encountered.

  • Level 1: Legacy and fragmentation. This will be the starting point for most incumbent players. Data sits in different silos, on different data integration platforms. Collecting and integrating new data can be challenging, either due to cumbersome governance and integration practices or incompatible data definitions at source systems. Models are mostly developed in sandbox-like environments that are suitable for development but lack flexible and scalable production pipelines. End-points in the form of APIs or events to a real-time platform are considered exotic or non-existent. Models are typically outdated and true out-of-sample performance is unknown (i.e. only theoretical performance is looked at, not performance in actual operation).
  • Level 2: Frog leap innovation. In an effort to accelerate and escape the difficulties inherent to level 1, some organizations choose to take a well-managed and thoughtful frog leap to modern infrastructure and architecture — mostly cloud based (AWS, MS Azure, Google Cloud Platform). This can be the result of an innovation or experimentation effort on behalf of IT, or driven by the business for development a high-value use case. Done well, this can be a prelude to the next level.
  • Level 3: Platform standards. Companies that aim to really diffuse and scale AI across their enterprise will eventually need a coherent platform of uniform AI capabilities. Such a platform contains many components and tools — as a best of breed— that allows users enough flexibility to build solutions for different requirements but also maintain certain standards that can be managed centrally. More on this in section 2.

One might argue there is a level 4, in which an ideal state is reached where all legacy architecture is migrated to a perfect model. Let’s call this the Utopian AI Enterprise Architecture, and leave it at that.

The crux: the two dimensions are usually not in sync

Both dimensions are usually driven by different parts of the organization and have partially different objectives; or at least different perspectives. For organizations starting out on their AI journey, that typically means one of the 2 dimensions is overweight depending on where the AI initiative originated: IT, business, or some other function (see figure 1). Or worse, they both run their own course independently. It’s like a row boat with both peddles being operated by separate individuals, who are not even aware of each other, but both are trying to row the boat to the best of their abilities. It will either send the boat turning circles, or get moving without clear direction. They may even perceive to be working in opposite directions, not realizing that synchronized movement will accelerate them both.

Figure 1. AI transformation paths.
  • When IT is leading the charge, you can end up with strongly governed cloud platforms but few examples of high impact AI solutions actually under development and in production (3b in figure 1). Although this is not a bad situation to be in as it provides a future-proof starting point, it does require business-led teams with multi-disciplinary skills in data science, data and software engineering to design and develop working AI solutions.
  • On the other extreme, organizations can start out developing AI solutions from a legacy infrastructure starting point (1a). Strong business leaders that believe in the value of AI mobilize the right resources and mandate to start an initiative. While the pressure is on, such a rush to the front can create a lot of excitement and momentum for change. In most cases however, the aspired impact is not sustainable for long because production-grade infrastructure is lacking. That in turn makes it hard to bring solutions to a level of maturity that is required for convincing business adoption. In theory, if all required tooling is in place and can be stitched together, a single use case could travel the last mile (1e). However, in our experience this is rare and getting to production level (1c) is already an achievement.

Next, let’s examine what the optimal path of AI transformation looks like and how to get the row boat moving to the upper right corner of figure 1.

2. The new art of AI engineering

“Specifically, the most impressive capabilities of AI — those based on machine learning — have not yet diffused widely. More importantly, like other general purpose technologies (GPT), their full effects won’t be realized until waves of complementary innovations are developed and implemented.” E. Brynjolfsson et al

AI as GPT requires new engineering discipline

This section will introduce a new paradigm for AI transformation. As mentioned earlier, the information technology wave of the previous century required complementary practices like BPR and Six Sigma to weave automation technology (ERPs) into organizational fabrics. Diffusing AI requires a similar approach to organizational change to make it truly “general purpose” throughout an entire company. It requires a holistic, highly integrated, multidisciplinary engineering approach which we call AI engineering. The practice of AI engineering consists of 3 key pillars of capabilities that have to work in sync:

  • AI Business Engineering. Re-imagining your business processes, business model, and changing your way of working.
  • AI Solution Engineering. Developing end implementing AI solutions to drive those new processes.
  • AI Platform Engineering. Building enterprise AI platform capabilities to sustainably build AI solutions at scale.
Figure 2. The New Art of AI Engineering

AI Business Engineering

When it comes to the impact of AI on business and society, Solow’s paradox springs back to mind. On the one hand, we are already seeing AI being widely adopted. Plentiful use case examples have been written about and we experience the benefits of AI-powered personalization everyday (e.g. Netflix, Spotify, etc.). On the other hand, we hear claims that we are only at the beginning of the AI revolution. Sundar Pichai, CEO of Google and Alphabet, argued at World Economic Forum beginning 2020 that AI will ultimately have a greater impact than electricity.

So which is it? Is the future already here, but just unevenly distributed? Or will the real impact of AI look unimaginably different from today?

My prophetic abilities have yet to proof themselves, so I will refrain from definite answers. Moreover, a thorough view on the future of AI — or even of all different flavors in which AI is already powering business — goes beyond the topic of AI transformation. For the purpose of this article however, I would like to call out 3 horizons of progressive impact. If I have to make a prediction: the real excitement from AI in the near term will not come from “more of the same ML in business”, but from the reconfiguration of value chains it will start to drive — and the resulting disruptive potential.

The first horizon is operational excellence. Across sectors, machine learning applications can be used to optimize and automate operational decision making. In Telecom, AI is used to optimize network investment decisions, apply predictive maintenance or detect fraud. In insurance, AI is used to automate claims handling and underwriting. In retail, AI is used to predict stock-outs and optimize supply chains. Oil majors like Shell use machine learning on sensor data to predict equipment failure and apply predictive maintenance —allegedly for potentially hundreds of thousands of pumps across the globe. Optimizing marketing & sales through predictive campaigning and pricing optimization are arguably also examples of “doing the same things, but much more effective and efficiently”. The upside of applying AI for operational excellence is therefore worthwhile but ultimately limited.

The second horizon of impact is to leverage AI for improved or new products & services. Personalization is a prime example. Stitch Fix uses AI to learn your style and customize your wardrobe. Some use cases which drive operational excellence also have positive impact on customer experience. For example, straight-through-processing of mortgage or other loan applications save costs, but also dramatically improve the customer experience. Energy companies can develop “energy-as-a-service”, offering predictive care-free services on your in-home heating system.

Both of these areas offer lots of growth potential. Many industry experts argue that the application of AI for revenue growth (area 2) will gain traction over operational excellence in the near term.

I would like to add a third horizon which I believe will drive most of the value in the mid term : disruptive business model innovation. The key word being disruptive, because some of the well known examples above also entail new business models. What I’m referring to here is a sort of platformization of AI capabilities. Predictive pricing is such a capability — e.g. pricing of assets such as cars, or pricing of risk for insurance or loans, or pricing of car damage. Let’s look at an example from the automotive industry.

CarNext.com is a spin-off of global car leasing company LeasePlan, which uses AI to predict the optimal retail price of second hand cars. A capability that initially helped to boost the bottom-line of leased car remarketing to consumers, now morphs into a full fledged marketing & sales platform on which other sellers can sell their fleet. CarNext.com is able to maximize value capture by optimizing sell price versus cost-of-stock and leverages this capability to build a platform. An entirely new business actually.
Platform capabilities can also emerge outside-in, in the form of start-ups that develop AI capabilities which end up being platform-worthy. Let’s stay with the automotive example for a minute. Fixico is a European start-up on a mission to uberize damage repair for car owners. Consumers can upload their repair needs and will receive offers from different repair shops. Basically they aggregate supply (available repair capacity in the market), bidding down the price to consumers. Since they gather image material from car damage, I guess they will soon be able to predict the damage value based on pictures. They can offer this capability as a service (just an API!) to fleet operators (such as leasing companies) and insurance companies, automating away much of the overhead currently involved at handling car damage. Needless to say, the repair of the damage itself will be business on top.

Just like previous technology waves, AI will not only optimize existing businesses, create entirely new ones, but also lead to a reconfiguration of supply chains. Another well known example is the banking sector. ING’s CEO Ralph Hamers has repeatedly stated that, under the rise of FinTech, banks have a couple of existential strategic choices to make. One would be to stick to the core of banking: lending/investing money long term, and finance that short term at lower rates. The spread between the two is what makes up the business model. Another choice would be to focus on the customer and create a platform for superior customer service. The back-end (the financing) could even be spun off. The platform uses AI to create seamless personalized service and experience. 3rd party providers can tap into such a platform and offer their financial services to the bank’s customer base — all under the safeguarding of the bank’s mother AI-ship. I’m not sure where incumbents will end up, and whether Google might get there first, but it serves to illustrate how AI fundamentally shifts value chains.

This presents disruptive opportunities, but also disruptive risks for those falling behind. The waves of complementary innovation, as cited by Erik Brynjolfsson, in part consist of business innovation. It involves the creativity and imagination to develop new types of business models enabled by AI. Something entirely different from running the current business models slightly better.

Therefore, it pays huge dividends to figure out which are the critical AI use case areas for your business to develop as mission critical, versus the table stakes that will merely keep you afloat in the long run. Here are a few pointers that can help guide prioritization:

  • Create sustained focus. Creating real impact from AI is hard enough and requires sustained effort. Too much fragmentation of scarce resources and management attention will dilute progress. When I hear managers boast “We have 50 AI projects under way”, my alarm bells go off. It’s probably true, but chances are none of them will have any impact on transformational scale.
  • Aim for clusters. In many instances, use cases come in clusters around a common data foundation or machine learning problem. Predicting car retail prices for instance is a stepping stone to forecasting their residual value — a key financial metric for any car leasing company.
  • Smartly balance feasibility with impact. A celebrated method of prioritizing any business case is to score impact versus feasibility. Of course, the sweet spot of use cases that score high on both are easy to prioritize. Other than that, it can make sense to have a small portfolio of highly feasible front-runners on horizon 1 to show the organization what is possible, to create momentum to go after the real big ticket opportunity areas on horizon 2 and 3.

AI Solution Engineering

Once you have figured out where and how AI can power your business, it is time to develop the accompanying solutions and change. To overcome the barriers presented in section 1, a proven change approach is needed. Many companies working on AI initiatives develop a step-wise approach of some sort, which is honed by experience over time. So have we, for which the key principles are listed below. It would be overkill to fully lay out all the detailed work steps here. Moreover, the best music does not come from exactly playing all the notes right, but from the artist’s own creative interpretation and execution of the underlying essence. So it is for practitioners leading AI initiatives.

  • Apply an integrated design & build approach. Building the algorithm is typically 10–20% of the effort. The remainder consists of building the solution as a scalable system in robust infrastructure, and redesigning and implementing different work processes. To address the last mile problem, it is crucial to work on all components of a solution in parallel, in an iterative fashion. That is to say, work towards an MVP as soon as possible and improve from there. Develop a template with different development phases and train the organization to adopt one design and development philosophy. For instance start with clearly defining the opportunity objective and value creation hypothesis. Then create a first high-level design of the MVP. Subsequently explore and analyse the data to test hypotheses and build a prototype model. Iterate on the MVP design, validate it, and then build it (data pipelines, models in production, application front-end, process workflow redesign). Once the MVP is operational, it can be further improved, optimized and scaled.
  • Start with the business process. A natural tendency especially for analytical team members is to dive right in, gather some data, start doing analysis develop a first model. Although an experimentation mindset is super valuable, when the task is to build a mission critical AI capability you have start with the end in mind: what process are we supporting? What decisions are supported by the AI? Will it run fully automated, or will it be augmented decision intelligence where human and machine work together? What feedback do we expect from the results, and how will they be used to train the system? What is the performance threshold for the algorithm required to create value? What edge cases do we expect — situations where algorithms perform poorly — and how do we deal with it? These are all questions to address upfront and incorporate in the MVP design.
  • Adopt agile and cross-functional collaboration. This may be trivial nowadays, but it does constitute a critical success factor. It is paramount to have all stakeholders on the bus from day 1. The business needs to be part of the entire design and implementation journey, to maximize chance of adoption but also to bring essential domain knowledge to the team. Engineers need to be brought in before the engineering work actually starts, to ensure they are involved in all the design choices and prevent unpleasant surprises down the road. An iterative approach creates a lot of flexibility, but there is always a dose of path-dependency down the line from choices made earlier in the process.
  • Invest in superior integrator talent. AI solutions are very multidimensional, requiring involvement and contributions from many different disciplines. Many organizations come a long way having at least some of the required skills, with some subset of people. Successful solution development however requires integrators — super generalists sufficiently knowledgeable across all disciplines (business, modeling, data engineering, process design) combined with strong project and stakeholder management skills. This skill set is more rare than machine learning or data engineering.

Most companies are not shaped for iterative innovation processes needed for AI. This creates 2 sorts of risks. The first is that the development gets stuck before the last mile is reached (as discussed in section 1). The second is that they discontinue their efforts before they have had time to mature. A push for fast results and quick wins limits the space of data-driven innovation. In many instances, an MVP presents only 20% of the opportunity. Further optimization and scaling is required. This often requires a role shift by the business: the AI takes over part of their previous work steps, for which they need to become the “custodian”. Their task is not only to use the system, but maintain and improve it continuously. A long term view is needed to create the right conditions for AI to flourish.

AI Platform Engineering

Most organizations starting out on their AI transformation lack the platform infrastructure to really scale AI. The crux is to leverage the requirements of new AI solutions to build capability components step-by-step. This way, both dimensions of change from the previous section are addressed in a mutually reinforcing way. New components and platform services are built with a sense of urgency, because they are immediately required. The platform is expanded through the lens of the primary users: AI developers. New innovations can be proven “locally” and on-boarded to the platform once proven. On the other hand, AI solutions are built with generic infrastructure components. This in turn ensures robustness and maintainability. We have observed organizations take this approach along 2 different paths, as shown in figure 2:

  • Set up a dedicated AI platform team and have them closely working together with multidisciplinary solutions teams. The platform is enriched jointly as part of a development effort, building new components along the way. This is route 3b to 3e in figure 2.
  • When such as platform is not yet in the making and you want to accelerate a certain AI solution, taking a frog leap is a good alternative approach. Set up a cloud environment, preferably using infrastructure-as-code to be provider-agnostic, and develop your first high impact AI solution on this environment (2d). Later on, once the first solution is already up and running and creating business impact, this platform can be used as a blueprint for enterprise standards (2d to 3e). The benefits of this approach are minimal time-to-market. The only downside is that you might cut some corners that have to be reworked to fully comply to your future standards (3e).
Figure 3. Solution-driven capability development.

What capabilities are needed?

An AI platform comprises of a set of capabilities that form the toolbox to build end-to-end AI solutions uniformly across the organization. The key challenge for many organizations is that they still rely to a great extent on classical BI-oriented infrastructures, which fall short on multiple dimensions in supporting AI at scale.

  1. Scalable data architecture for AI production pipelines

Compared to traditional BI, the development and deployment of machine learning models creates new requirements for data architecture along the lifecycle of ML systems:

  • Discovery. Rapid access to new raw data sources for exploring and designing new use cases, e.g. in a sandbox environment on a data lake. Here, speed and flexibility are most important to enable innovation.
  • Development. Generic data sets that are re-used often across different use cases, combined with use-case specific data.
  • Production. Building solution-specific data pipelines.The productionization of algorithms requires at least 2 different data pipelines: one for (automated) model retraining, and one for model scoring (inference).

In particular the production route creates new challenges for many organizations still operating from a legacy BI architecture, because the default for production is to run data through a data warehouse. However, AI production pipelines have additional requirements which are often not met by classical BI data architecture:

  • Scalability & speed. AI requires easy access to a wide range of data for discovery and development, combined with fast development of production pipelines for solution-specific data. Classical data warehouses satisfy neither of these requirements. Date warehouses are built as generic, re-usable building blocks. That is great for some very common data objects, but not scalable enough for supporting the ever growing amount of solution-specific pipeline data. To enable scalable development, a more flexible data architecture is needed that combines the best of data warehouse structures with other routes.
  • Bi-directional. Traditional BI only consumes data. The output is a dashboard with insights and metrics. AI systems however require 2-way interaction with other systems. Model outputs are fed to other systems for operational decisions. The feedback and results of those processes are in turn fed back to the algorithm to evaluate and improve performance. This implies that the classical separation of operational and analytical data architectures starts to break down and merge. An AI system is not only a data consumer (like BI) but also data producer. The data architecture (and governance) needs to support that.
  • Real-time. Depending on their specific use case, machine learning predictions are required in real-time, or make use of real-time data. For instance e-commerce sites provide live product recommendations, based on your real-time browsing behavior.
Figure 4. Key differences in requirements for data architecture between BI and AI

So new architectural patterns are needed to support these requirements. The overall trend is that the classical separation of analytical and operational domains vanishes. Machine learning systems become increasingly integrated within the overall IT architecture.

  • Abandoning of the one-size-fits-all approach to data integration. Leading players have realized that the classical set up of one single data integration platform — a data warehouse — no longer serves the requirements of flexibility and speed needed in a world of big data and AI.
    A data warehouse is ideal for structured data, for which the purpose is known in advance and which has a high degree of re-usability. The cleaning, data model definition and ETL development creates heavy upfront efforts and investments. The upside is a high integrity data repository, with easy retrieval and usability. The rigidity and batch orientation of a DWH make it less suitable for AI.
    By its very nature, AI development requires more flexibility in use of data. Adding a new parameter to a model should not require a full DWH remodeling for just one feature. That is not to say that a DWH is not part of an AI data architecture. The scope of the DWH is just limited to core data objects, such as customer, product and financial information.
    A data lake on the other hand offers much more flexibility. It can handle any data type and allows for both batch and streaming data. All data is acquired and stored. The downside is that the data needs to be cleansed and structured upon use (schema on read). The notion that a data lake is typically of lower quality is unwarranted. It is entirely possible to distinguish different zones within the data lake for different purposes. E.g. raw data for discovery, and a curated data zone which feeds production pipelines for specific AI solutions. The latter contains more upfront data cleansing and structuring to yield high quality.
Figure 5. Data Lake versus Data Warehouse
  • Event-driven architecture & microservices. A second key trend is the use of event-driven architectures (EDA) and microservices. EDA is an architectural design pattern for data exchange, application development and integration. An EDA consists of a platform which exchanges data in the form of messages between event producers and event consumers. A key feature is that consumers and producers are uncoupled: they are not aware of each other. This is different from more traditional request-response architectures, where one application would publish a request and wait for a response. This uncoupling has many benefits, such as great horizontal scaling. Any consumer can subscribe to any event without the need for specific system integration. While EDA is a very general design pattern, it has specific benefits for the deployment of AI.
    In particular, EDA is great for real-time AI applications that make predictions or trigger business actions based on real-time analysis of streaming data. This enables businesses to build situational awareness. For example, factories can use an event platform to connect large numbers of sensors that monitor the state of a production process. Machine learning models can analyse all these data real time and detect anomalies or predict failure. This in turns allows maintenance to act and perform repairs before the actual failure occurs.

2. Model Engine as an enterprise capability

In 2015, a group of machine learning researchers at Google published an article titled “The hidden technical debt of machine learning”. They made the point that machine learning is only a tiny fraction of the end-to-end task of building robust machine learning systems. More so than “standard software”, machine learning systems carry high risks of introducing technical debt: complexity that has to be addressed later by cleaning up code and untangling system components.

Figure 6. ML code (black box) is only fraction of complexity of entire ML system (source: D. Sculley et al, NIPS 2015)

Imagine what happens when companies start to build hundreds of ML systems, for which new and partially unknown levels of complexity will be introduced.

To run ML models at scale, organizations need a new type of platform capability: a model engine, or model factory. A model engine consists of a standardized capability to train, deploy and run machine learning models at unlimited scale, fully automated. There are at least 3 key reasons for this new requirement.

  • Enable (high speed) innovation. Provide solution teams a prepackaged way to operationalize ML models, through automating the entire data pipeline and required infrastructure. No need to reinvent the wheel.
  • Ensure manageability & governance. Having a central repository and view on all deployed ML models enables proper management of both the models as well as the infrastructure (i.e. the model engine) it runs in.
  • Minimize technical debt. The uncoupling of components helps to minimize creation of any technical debt. The components needed to operationalize the model are part of one harmonized capability, rather than being part of every single AI solution.

Technically, a model engine typically produces a model in the form of a (micro) service that can be launched as a container and called through an API. The model engine manages the entire workflow and lifecycle of a prediction model. This means automatically retraining, periodically or trigger-based, testing and deploying.

3. Model Management

Having a standardized Model Engine enables another critical capability: Model Management. Unlike traditional software, AI systems do not possess fixed performance levels. Prediction models are prone to change over time, as the environment, context and training data change. The practice of Model Management consists of the following elements:

  • Performance Monitoring. Just like a chemical factory has a control room to monitor temperatures, pressures and other process performance metrics, machine learning systems also need performance monitoring. A key concept in machine learning is generalization: how well does a model perform beyond its training data? During development, data is typically split into a training and a test set. That is the first step to test generalization. But the real proof of the pudding is in the eating. This out-of-sample model performance determines how well predictions perform in real operation. It is common for prediction models to deteriorate over time, and require retraining or even redevelopment. Automated performance monitoring provides the baseline metric to take performance management actions.
  • Performance Management. To ensure prediction performance, various actions can be taken. First, a model can be retrained on a more recent data set. This action can be taken automatically, based on preset triggers on performance thresholds or periodically. Second, a different model can be put in production. A model engine can run multiple models in production in parallel as shadow models — only one of them is really used for the operational process. Based on preset business ruling, the best performing model can be put into production automatically. Third, a data scientist can update the model. This can involve adding new parameters (features) based on new data sources.

In summary, figure 7 below depicts the main components required for an AI-ready data & technology platform.

Figure 7. High-level components of AI platform architecture

3. How to get started

I believe it was Andrew Ng, a well known entrepreneur and researcher in the deep learning arena, who first compared AI to electricity. At first I liked the analogy, but was skeptical about the reality of that statement. Not anymore. I believe we are rapidly entering an acceleration period where in particular the deployment of machine learning models for just about any decision process will become the norm in business. If AI is truly general purpose like electricity, it will flow through the veins and arteries of the enterprise, powering every decision cell out there. I believe this to be true, because all prerequisites exist. Cloud providers are offering pre-trained (transfer learning) models behind a single API. The same goes for speech, text and image recognition: these are almost commodity capabilities (as long as you don’t seek edge performance). Machine learning libraries are standard material for any python developer. Platforms are emerging that allow solution building on company-wide scale.

We are entering times in which we can start playing Corporate AI Lego — letting our creativity and imagination go free flow. Dream up and build entirely new products, services and business models using a myriad of technological AI building blocks available in the cloud or from (other) open sources. The opportunities are endless — but it’s no easy feat to pull off. So what can you do tomorrow to get on the bandwagon, or make it go faster if you are already on it?

Stay tuned!

Wouter Huygen
Managing Partner at MIcompany, a European AI services firm.

--

--