The Data Lifecycle

6 min readDec 7, 2020

When developing a data-driven application there are a few major concerns. All of these revolve around the data itself (since it is the center of the application/project) and how you utilize it. Below is an image of the stages we will go over today. They include, raw data collection, pre-processing, storage, post-processing, and application. The final item we will discuss, once we have a full understanding of the entire data lifecycle, is the Universal Data Model (UDM as I prefer) which is perhaps the most important of all to ensure long-term application success and stability.

Data Collection

This stage is the easiest to understand and perhaps the simplest overall. When collecting data the most important thing is THAT IT IS COLLECTED, by any means necessary. Spreadsheets, interviews, handwritten documents/forms, etc are all viable forms of data collection and all serve the purpose of, well, collecting data. If you are developing a more comprehensive application where you will have a data collection tool, then it is imperative this tool has client-side validation. Data entry forms should provide instant(ish) feedback if you are expecting one data type (number) and the user inputs another (string), you shouldn’t have to wait until data submission to get this error/feedback.

Pre-Processing

The next stage is pre-processing, this is where we take our raw data and transform it into acceptable format. This is where we convert the user entered ‘Nick ‘ (note white-space) into ‘Nick’. In this stage, we enforce our UDM ruleset. This is one of the simpler stages to explain, but often takes the longest to develop.

Storage

The storage stage is very simple… this is when you store the data! What is more complicated is how you store your data. Which cloud provider do you choose (do you even need one?)? Which database type (NoSQL/SQL — relational vs document vs graph) do you utilize? These questions cannot entirely be addressed without a thorough understanding of the UDM for your application. We will discuss this more below, and once you have established your UDM, your database type/choice should become more obvious and come down to other functionality such as exportation/accessibility. Our data storage engine should enforce good database standards (3NF) and minimize downtime. Another factor to consider here is the flexibility of your data storage engine, if you will need a more flexible UDM as time progresses and your application grows. Finally, a hybrid approach may work here and has become a favorite of mine. For example, I have found many data forms are quite compatible with the document-style database, but I prefer a relational SQL-style database for managing users. I often combine these two and have one database for user-related data (accounts/profiles/etc) and another for application related data.

Post-Processing

This is the stage where you retrieve data from storage and prepare to use it in your actual application be it as a visual (chart/network/graphic/etc). The most effective way to accomplish this is to build an API to interact with your database. The API’s job is to retrieve data from the database (quickly, speed is imperative here), and provide easy access to it (REST APIs utilize URLs) in a standardized form (again enforcing UDM ruleset. GraphQL does a great job here by allowing diverse and evolving APIs while limiting response data to only what is requested. This also allows the addition of data. For example, you ahve a field ‘first_name’ and ‘last_name’ and in your application you frequently use the two together as ‘full_name’. This can be made into an API field!

One common error in this stage is that data aggregation is done at the database level and NOT here, at the post-processing level. For example, you should NOT have a ‘full_name’ field in your database, as we just discussed, but should create it in your post-processing model. While most people do not make this mistake, they DO perform calculations (addition, subtraction, multiplication, counts — ESPECIALLY COUNTS) in their database! This is wrong! These are all post-processing models that should be done in this stage to limit the data storage costs and keep the database as simple as possible. The one exception to this is long-running tasks (although even now ML models are available via APIs). Longer tasks or calculations can be stored in the database (assuming their results change infrequently, if at all) to improve responsiveness.

The Application

Our final stage is very straightforward, this is where you use the data! Make a map, or a chart, or a cool graphic! Or open-source your API! Share your data with the world in the best way you can! This is the step we all love and live for, but it is not possible without all the previous work!

The UDM

Finally, we have to talk about the UDM (Universal Data Model). This step is the most time-consuming (perhaps next to pre-processing). This involves spending some serious time working with and thinking about your data. What structure is it? What fields are there for each entity? What types are those fields?

IT IS CRUCIAL IN THIS STEP, WHEN ASKING YOURSELF THESE QUESTIONS, THAT YOU THINK NOT ABOUT HOW THE DATA CURRENTLY IS, BUT HOW IT IDEALLY SHOULD BE!

I believe there are 4 primary aspects to any UDM (your mileage may vary):

Structure

Is your data nested? What are the top-level entities?
Can your data be broken down into seperate, smaller, micro-service-style UDMs?
What relationships exist? Is the data/analysis heavy on relationships? (hint at graphs)

Fields

Limit your UDM to only the fields you need NOW, allow room to grow later, but extra fields now only creates extra work.
What are the data-types of these fields that we can standardize them as? Can we simplify these types? (i.e. numbers/strings into booleans)
Ensure high-quality naming conventions are developed here, and utilized across the entire project consistently.

Usage

When you think about your application itself, how will you be utilizing the data? Is there a way to encorproate your use cases into the UDM itself without compromises elsewhere in the lifecycle?
Think about the post-processing step, are there repetive tasks done there that could be enforced in pre-processing?
Ensure high-quality returns… this means if you expect and integer between 0–100 (perhaps to represent a percentage — which could also range from 0–1), return an integer from 0–100, document these restrictions and follow them.

Sharing

All of these steps combine to form a UDM that is consistent across the entire data lifecycle for your entire application stack. When devloping this, collobarte with others, ask opinions, and write down your decisions (and the reasoning for those decisions). This forms the basis of the documentation for your entire application that, should you choose to share your lovely data with the rest of the world, becomes invaluable as an asset to those you share with.
A note here about the UDM Document: This should be a living, breathing document. It not only can, but SHOULD change, it should evolve over time as your application evolves. The important thing is documenting these changes, the reasons for them, and sharing this information with your users.

We talked about a lot today, but I hope most of it sunk in. I am always here to talk, I love this part of the job and when I see companies/organizations on top of their UDM I am an immediate fan. Unfortunately this is not in everyone’s ‘wheelhouse’, so to speak, but for me its the lifeblood of why I started in tech in the firstplace.

Anyways, have a great day! Stay safe, stay healthy!
-Nick

Originally published at http://github.com.