Platform Engineering – Part 5 – Data Platforms

Data Platforms – more than just large data pools

Data platforms are not a new phenomenon, but it seems that the need to build re-usable assets around collected data is growing in a rate we have not seen before. Specifically, IOT (internet of things) use cases – based on the data collection from field devices, is a driving factor. Once data is collected from devices and numerous other systems, the data should be used in the most effective way so that new business features and even completely new business models may prosper. The data collections quickly grow into hundreds of Gigabytes or even Petabytes and, with the right tools and proper research, this data contains massive valuable information. Data Lake is an often used term in this context, depicting a place where all this mass data, structured and unstructured, is stored in cost effective fashion, waiting for more and more use cases that generate valuable insights from them.

We will not discuss Big Data or Data Lake concepts here since this is a problems space for its own. However, under the assumption that data is not just sitting in one place and people are just randomly using it somehow, data platforms also trend to fall under the same challenges as traditional platform development. With one additional complexity: mass data access.

Why is that the case? Small and large data pools that shall be used among a set of users slip into a growing space of functional requirements. Storing and curating data is not free of cost, so a demand rises to control data access or even bill users for the use. Also, there are recurring problems when products / solutions want to make use of the data. Here is a list of functional building blocks which can be found in many data centric platforms:

Data access management, defining who is allowed to read / write which data. This can be on every possible level, e.g. data collection, object, file, row, column, … or a combination of such. With ever growing data and more and more users that should use it, the management of the access can be a cumbersome and error prone operational task. Depending on the restrictions and security requirements around the data, a substantial problem domain. The solution to this problem often directly depends on the underlying data storage technologies, since not all technologies support all desired authentication & authorization concepts. The automation plays a significant role here, specifically when a diverse set of technology is used to store mass data; e.g. a mixture of files, databases and data warehouses.
Data access control, enforcing the access rights on data storage technology level. This includes access control logging. In some environments it might be necessary to be able to proof who accessed which data (e.g. GDPR relevant data).
Data at rest and transit security, e.g. encryption
Data loss protection, e.g. disaster recovery implementations
Meta data management and data catalogs. If use of data in multiple use cases is a goal, then it is often required to ease the finding of relevant data and to be able to navigate the data pool. Meta data management deals with the semantic content and the structure of the data in the data collection, storing this information in a central place and allow users to browse this information. Which relations exist in my data, where do it find data of devices XYZ, how is data reflecting device hierarchies, do mean “temp” and “temperature” the same in these 2 data sets … Such and many more different questions may arise. This is specifically important when new use cases are in an exploration phase. Several products exist that specifically try to address this problem with standard data models and highly automated meta data extraction and catalog features.
Data ingest, transformation, cleansing and loading. Many organizations what to solve the problem of ingesting and integrating data from multiple sources in a single place. Not all products should cope with this problem. Data should be provided in the central data pool in a way that shows high quality with well managed meta data around it. Also here, we see a large list of companies that provide software product around this specific problem.
Cost effective mass storage. One very common feature is, that data that is only regularly used should be stored in cheaper (and less performant) data stores to reduce operational cost. This concept is often called “tiered storage”, which multiple storage tiers that provide different availability, performance and costs combinations. Also “temperature models” are often considered here, differentiating data between hot, warm and cold states which relate to how often the data is queried per time frame.
Use case optimized storage. In contrast to cost optimization, data platforms also want to provide data query performance optimized data storage. So, depending on the data query behavior of clients, different storage technologies may be used and data may be existing multiple times in multiple optimized forms. To achieve this, data pipelines need to be maintained that transform data from its raw format to one or more target formats and store the consumable data into a dedicated storage technology (e.g. an object store, database or other). The variability of the client’s data query behavior over all clients of the platform is a central complexity and cost driver. Optimizing for performance often times also consume considerable resources on infrastructure level (e.g. CPU power, network bandwidth, drive IO and RAM). Also, the different storage technologies may require to make use of their specific performance optimization features which requires highly specialized people.
Data analytics and machine learning tools and frameworks. In order to make data useable for a concrete use case it is first required to analyze the data and find ways to generate actionable insights from it. Machine learning is also often utilized here and custom model training is used to calculate valuable inference from data sets and streams. It occurs that setting up such environments can be complex and requires specialized knowledge, so embedding such features into the platform can shorten the required time to build new products and solutions dramatically. The AWS platform provides a service called Amazon Sagemaker that does exactly this, it encapsulates the machine learning environments into an easy-to-use service. Building, training and deploying machine learning models so becomes quite easy.
Data visualization tools and frameworks. When every of your products wants to explore and show data graphically to end-users, then it is considered to embed a generalized visualization solution into the platform.

Of course there is more to mention. All this functionality is designed to shorten the time for commercial use of the data, manage the data and its security, reduce the cost to store it, provide the data in a most optimal way to the users – the solution / product builders. So, the fundamental challenges are the same as in the classic platform development because you also need to consider which of all this functionality should be part of the platform, and which not. Where should I build myself vs. using a platform feature vs. using a 3^rd party product or service? How to deal with variability specifically in the area of operational requirements?

Cost driver: Mass Data Interfaces

Data platforms however have one more complexity driver: the mass data interface. While traditional platforms, with a focus on services that expose functionality along with some smaller (ok, that’s a rather blurry definition, agreed) data, data platforms also need to provide access to large amounts of data. Thinking of API calls and the amount of data that you can return in a single response – maybe some Megabytes – you need to define ways around that problem if you want to make Gigabytes and Petabytes available. So, data platforms typically provide two classes of interfaces: 1) functional, service interfaces and 2) mass data interfaces.

Dealing with large amounts of data is a complex problem to solve, and it needs good consideration how data volume, data velocity and data variety is managed. The more different use case you want to work on top of your data, the more complex data query optimization and mass data access requirements you will encounter. Here are just some keywords as examples for different data query requirements:

Query volume and velocity
Use data behind a service vs. raw data access
Predefined vs. custom queries
Structured vs. unstructured data
Type and volume of required data calculations at query time
Type of traversable data relations
Compression, Aggregation
Explorative analysis vs. static
Batch vs. real-time data access
Required data formats and types
On-demand data integration
Protocol support, e.g. available and supported drivers in a given technology

Architecture for such data platforms need to consider the above requirements when the mass data interfaces are designed. An often underestimated factor here, is the fact that transporting mass data (e.g. Gigabytes and more) has a cost associated and consumes considerable bandwidth on the network.

This leads to the question of cost effective mass data APIs and transport mechanisms where new features may be required like compression algorithms, filtering and others.

As a conclusion, build resalable data platforms around large amounts of data, like with other software platforms, is not cheap. Doing it right will cost you again up to 5 times the effort and time, compared to directly using the data in a plain software product.