By Yann Albou.
Platform Engineering is a major challenge for companies and organizations wishing to develop powerful, scalable, secure digital products that meet the challenges of the business. What exactly is Platform Engineering or Internal Developer Platform (IDP)?
The notion of platform is not new but what is the difference with classic SaaS?
How can it help companies create efficient and scalable products and services? What are the best practices and essential tools to succeed in this field?
In this article, we will explore the world of Platform Engineering and the Internal Developer Platform in detail and answer all these questions to help you understand the foundations of this strategic evolution.
On this same theme, and for french speakers Sokube organized a round table. The replay is available here:
Originally, the DevOps approach emerged in order to improve the speed, efficiency and quality of software development by automating processes, fostering transparent communication and encouraging shared responsibility.
By breaking down the infamous DevOps wall, and thus bringing together the different players in the supply chain (in particular Dev, Security, Ops), DevSecOps emphasizes the creation of a continuous development cycle and continuous deployment, integrating Agile practices, source code management, test and deployment automation, security and quality.
The goal is to solve communication problems which is already a huge improvement over the traditional approach.
Concretely, this translates into a DevOps team in charge of an application with the responsibility of managing containerization, CI (Continuous Integration), CD (Continuous Delivery / Deployment), Kubernetes, infrastructure (IaC with Terraform or Ansible for example), Cloud account management, observability, security, upgrades, patching, … all on the different environments!
With all of these responsibilities comes the cognitive load (I will discuss this point in detail later) of managing all of these tools and processes in addition to the rest (in particular the Business application).
Having this approach for each application or group of applications does not scale well.
For each team that will manage applications, the technical challenges and needs are very similar. Factorization and centralization therefore become key elements to maximize efficiency and avoid an effect of inconsistency at the company level.
However, the autonomy, innovation and freedom of the teams remain necessary, especially when applying the SRE principle "You build it you run it”. But that can’t work if the team also has to manage the platform, operate it and master it.
If we also add the need for expertise on each of these subjects, it then becomes almost impossible, for a business team in charge of an application, to build an efficient, secure, compliant, reliable and efficient platform.
In short, you will have understood it, knowing how to correctly configure kubernetes for its application, knowing how to set up image scanning on the different environments, knowing how to automate deployment without service interruption or knowing how to set up an observability solution is not necessarily the best use of development engineers on whom many responsibilities already rest!
The CNCF Cloud Maturity Model maps the DevOps transformation journey in several stages: Build -> Operate -> Scale -> Improve -> Optimize, by analyzing several criteria: People, Process, Policy, Technology, Business outcomes.
This maturity model shows that as the company matures and evolves, it better understands the impact of its tools and improves and refines its DevOps practices. This same model makes it possible to highlight the problems of inefficiency and the need for consolidation with a shared platform.
Before explaining these concepts in detail, let’s start with a simple definition:
Platform Engineering is an approach in which organizations develop a shared platform (considered as a product) to improve developer experience and productivity across the organization by providing self-service capabilities with automated operations. automated infrastructure using DevOps techniques.
When we introduce new terms, it always gives the impression of innovative and disruptive concepts and that what was done before was not good!
DevOps and agile approaches are still totally valid but it’s more about how to bring them and how they can work better when scaling.
The aim here is to have a more optimal, better organized and scalable approach compared to the classic use of DevOps.
The approach will focus on the needs of users with a product approach in self-service mode but applied to the platform.
A new term with different approaches often implies a new job title. This is the case here with the "Platform Engineer" who could have several roles: Solutions Architect, DevOps Engineer, Site Reliability Engineer, System Engineer, Cloud Engineer, Platform Architect, Software Engineer, Infrastructure Engineer…
We could say that ultimately the main role would be to remove the barriers between dev and production! And so to talk about “developer enablers” or “enablement teams.”
Each of the points in the following sections are not particularly innovative, but taken as a whole, they constitute a clearly differentiating approach to building an efficient, secure and controlled approach to your IT platform, in particular from an organizational and from a communication flow point of view.
The chain of delivery, implementation of security, observability, management of versions and configurations, construction of infrastructures and any other step necessary to put software into production becomes an increasingly substantial and complex to meet all the necessary market standards.
Teams need to refocus on their value and productivity and there are several reasons for this trend.
The cognitive load of a team corresponds to the storage capacity of information in the working memory. In other words, it corresponds to the amount of information that a team needs to carry out its work.
If this capacity is exceeded, then the team will find it difficult to solve problems, to learn and to have perspective on the effectiveness of their work.
Having a specific team organization by clearly delimiting the boundaries and making explicit the modes of interaction between the teams, makes it possible to avoid this overload and to promote the productivity of development and delivery.
The teams in charge of developing and delivering a service or product to the end customer, called "Stream-aligned Team", therefore rely on other teams such as "Platform teams" to simplify, accelerate, standardize the entire flow of deliveries and to release the load.
Reducing the cognitive load of teams is a key point for improvement in organizations
To learn more about these notions, I suggest the excellent book “Team Topologies” by Matthew Skelton and Manuel Pais.
Business changes, the needs for new features, innovations, the impact of the outside world on our IT require increasingly rapid development and delivery flows across all of the company’s teams.
These teams must focus on changes and business needs and therefore avoid reinventing the wheel in each team to deliver the software.
A platform must enable these flows to be accelerated in a standardized way.
Accelerating to deliver issues or bugs faster is useless. The capabilities of the platform must provide a high level of resilience and reliability with a "design for failure" in order to provide a very high continuity of service (both from a production point of view and for the developer).
This applies to all platforms, not only the production runtime, but also the software construction chain, the delivery flow, the observability platform, the different environments, the middlewares, the security…
Reliability is the prerequisite for generating feedback!
The notion of platform security must be designed from the start and include the principles of "Zero trust", "Shift left" and "Least principle privilege" throughout the chain.
The sharing of information is key in the different environments and on the different platforms by bringing a consolidated and clear view of the application system with governance and this also applies to security.
For the platform or for its customers, it is important to adopt an approach of openness towards security acting in advisory and support mode rather than by constraints.
The notion of risk does not only focus on technical aspects, and for example it is common that procedures, processes, knowledge are not properly documented, if at all, and live entirely in the heads of certain key people. organisation. This makes communication complicated when the organization grows, which contributes to the notion of "Bus factor" of your teams:
« The bus factor is a measurement of the risk resulting from information and capabilities not being shared among team members, derived from the phrase "in case they get hit by a bus" »
Building a platform also makes it possible to pool and centralize but also to optimize costs. Cost management is not just a question of optimization but also a question of visibility, responsibility and efficiency by bringing techniques of:
The centralization of a platform makes it possible to bring this FinOps culture which reuses the same approaches as DevOps and Platform Engineering:
FinOps expertise is built by level of maturity ("Crawl, Walk, Run") with a change that takes place over time and which requires having a global approach to reap all the benefits and in particular in the phase application and platform design.
Platform Engineering is ideal for implementing these FinOps principles.
For now, we are not yet talking much about GreenOps approaches to optimizing the carbon emissions of our applications and infrastructures, but it is highly likely that this will also integrate into the roles of the platform.
This is where Platform Engineering comes in.
This team is in charge of managing all the tools and products of the platform and bringing a layer of standardization across all the teams, for example the choice of basic images by technology, the way of using kubernetes, the choice of CI/CD solutions…
In summary, the goal is to take care of the non-technical needs of the applications: VCS, CI/CD, runtime, provisioning system, logging, monitoring, metrics, security, AD, networking… .
And obviously each of these non-technical needs corresponds to a set of tools or products. The CNCF landscape illustrates the volume and the combination of solutions well:
The "Platform Engineers" select, standardize, factor, configure and administer the use of these tools and products to offer a set of services to other teams.
For this to work, the “Platform Engineers” have total responsibility for the tools, ie they are administrators of the whole and give “User” roles to external teams. It is an "Administrator" role with all that entails: implementation (with best practices), deployment, patching, configuration, backup, security, monitoring, access management, operational management…
This corresponds well to 2 different categories of skills that we found before within a DevOps team.
With this approach, the workload and the cognitive load of the dev teams are greatly reduced in order to leave more capacity to generate business value.
Is it sufficient ? No, because otherwise we risk finding ourselves in a situation with siled teams and producing a platform that will not be adapted to the needs of the client: the developers.
Teams platforms create an abstraction layer on top of the platform with a simple, understandable, and easy-to-access API and User Interface (UI).
This is called an IDP: Internal Developer Platform which will allow you to interact with the platform and thus provision all the necessary elements.
We are therefore here in a self-service approach with a portal to provide a service like PaaS (Platform as a Service) but for internal developers and which addresses a targeted business need.
The goal here is to abstract and simplify as much as possible the underlying complexity of such a platform while providing flexibility, security and speed.
Like the API contract definition for application services, IDP helps make cross-team relationships explicit by clearly defining what services are available and how to use them .
One would think that the standardization of tools and products by a centralized team would impose their choices: ArgoCD and not Flux, GitlabCI and by CircleCI, Artifactory and not Nexus, PostgressDB and not MySQL … which would reduce the flexibility and creativity of the teams .
Above all, the Platform Engineering team does not want to become a blocking element and generate frustrations with the other teams.
On the contrary, they must be at the service of the development teams and work with them to understand the need, the use cases and integrate these new tools into the platform, in order to then offer this service to the other teams.
The only constraint imposed is to go through the platform, leaving flexibility in the choice of tools but also in the way of using them.
This platform is therefore at the service of development teams with an emphasis on the "Developer eXperience (DevX)".
There are situations where it does not make sense to integrate a new tool because it is too specific and too custom to a team. In this case, this team can take responsibility for the tool in agreement with the Platform Engineering team.
It is important to respect the principles mentioned here to draw the full power of this approach at the risk of having the opposite effects.
In particular, it is important that the teams can be autonomous on the use of the platform and the way in which it must be implemented must integrate limits and safeguards.
For example, the use of templates makes it possible to introduce "best practices" in terms of use and configuration.
Whether for Infra-As-Code (Terraform, Ansible, Pulumi, …), for CI and its pipelines (GitLab, Github, CircleCI, Jenkins, …), for CD (Helm, ArgoCD, Flux, …), each time there is a template system to factorize, standardize and set the rules of architecture, compliance and security while providing a good level of flexibility.
The goal is clearly to bring consistency of uses through the organization and to remove the load from the other teams and in particular the dev team known as Stream-aligned Teams
When you start this process, you must avoid making a hyper generic master plan that includes all the use cases.
You have to start with an agile approach by determining an MVP (Minimum Viable Product) and iterate in small increments, the goal being to have quick feedback to adjust the need, priorities and direction.
It is necessary to identify the ideal target in relation to the context of the teams and the company and then move in this direction gradually, both for those who implement it and for those who will have to migrate to it.
An essential point for the implementation of this IDP is to treat it as a product and not as a classic project that ends and goes into maintenance mode.
We are with a Platform-as-a-Service approach that requires continuous development and improvement. And this development is ensured by the Platform Engineers teams who are therefore responsible for bringing change (new functionalities, fixed bugs, improvements, updating of services and products) for the company’s internal developers. Hence the name IDP: Internal Developer Platform.
And since we are talking about product, it also means having an associated versioning, release management, visibility (Release, Roadmap, …), documentation, a clear API.
This is done according to the needs of the internal teams to understand their priorities, what blocks them, the challenges they need to overcome.
The observability of your product is essential to measure the contribution of each version and to validate that what is put in place is well aligned with the intended use.
To build this product approach of the platform and make it evolve correctly just as we would do with a classic application product, it is necessary:
Typically, a platform team will work closely with a development team to improve the platform on a topic that the development team needs.
If we consider that this need makes sense for the teams, then at the end of the collaboration, the platform team switches to an "As-a-Service" mode of the functionality by making it available to all development teams.
Through this approach, we quickly understand that topologies and team structures are essential to the success of this type of platform.
The objective is to deliver value for customers (internal and external) and not to focus on the technical aspects.
This means that different organizations may need different team structures for collaboration between different actors to be effective.
The site devopstopologies defines patterns and anti-patterns (or "anti-types") oriented towards DevOps.
Remember: "There is no ‘right’ team topology, but several ‘bad’ topologies for any one organization."
These types of platforms have a measurable impact on the productivity and performance of the organization. It is therefore important to measure it for the team, the customers and the company.
You can’t improve what you don’t measure!
To measure it, we will classically use DevOps indicators (see the study DORA), such as:
But if we stopped at this level, we would miss the vision of the users!
Thus an indicator "NPS" (Net Promoter Score) makes it possible to evaluate customer satisfaction and thus to highlight the "Developer eXperience" compared to the platform.
Just like a classic product, it is all about maximizing the satisfaction of its users, the developers.
Finally, with the criteria stated previously: Product approach, DevX, Visibility, API, Release, Documentation, clear contracts… we could speak of Open-source platform for the company.
For example, the CNCF makes the code, architecture, organization, governance… of these projects completely transparent:
This Open-source approach is healthy, makes the overall operation very explicit, simplifies communications, facilitates improvement and allows the creation of a long-lasting platform.
In summary, there is a shared responsibility: the platform team is responsible at the operational level while the application teams are responsible for the use of the tools and their integration into the application lifecycle.
Business development teams therefore still need to know how to deploy their applications in a kubernetes cluster, for example, or how to use Terraform modules to influence their infrastructure, or how to use Gitlab CI templates to customize them according to needs.
The need for non-functional requirements still remains valid even if the scope is greatly reduced by what the IDP brings. And so the need for a DevOps engineer within the teams remains quite relevant but based on the platform.
Note that it is entirely possible for DevOps to contribute to the platform, as one would in an open-source product approach.
What becomes even more interesting, since we treat the platform as a product, is to apply the same approaches ("Eat your own dog food") and therefore to have a DevOps approach on the platform itself. even, which requires having DevOps Engineers in the application teams who are therefore called “Platform Engineers”.
To build such a platform, the tools and products used are obviously the same as for DevOps.
Typically and in a non-exhaustive way, we find this type of "DevOps" tools that will allow the platform to be built:
But also new tools oriented catalog of centralized services allowing to have a framework or to build a developer portal:
You could decide to make your own framework and your portal, but before embarking on this type of approach, it is important to evaluate the desired functionalities (thinking in a medium or long term time horizon) and to determine if existing products (open source or not) on the market make it easy to do this.
Conversely, the use of a tool or a product must be done without creating “over-customization”. It is important to use a product for what it can do and not pay the price, a few years later, of excessive customization.
Platform Engineering is not at all a replacement for DevOps, but rather fits into a continuity where the collaboration of DevOps Engineers with Platform Engineers is essential to create a centralized platform in self-service mode that meets the needs of development.
The capacities of this type of platform are of several kinds and provide several services and functionalities which could be summarized by the following diagram of the CNCF:
Keep in mind that since platform engineering is a relatively new term, there is still some wiggle room when it comes to its definition. The industry is still determining how this will actually evolve. One thing is certain: the role that automation and efficiency plays in the software development process will only grow in importance.
As shown in this Gartner article: Top Strategic Technology Trends for 2023: Platform Engineering, as systems distributed clouds become more commonly used and as architectural patterns continue to evolve, the demand for platform engineers and in-house development portals is expected to increase.
There are many different use cases and possible organizations depending on the context to create a platform. This article was intended to give enough elements to implement a Platform-as-a-Product vision that will allow you to scale your IT.