top of page

Data as an Asset: the Critical Role of Data Engineers in Supply Chain Analytics

We are often told that the key to success in supply chain management is the deft use of data. But what does that actually mean? In the world of supply chain, data is being generated everywhere, from the most sophisticated apps and IoT devices to manually kept spreadsheets. Nevertheless, not all data is useful, and not all useful data is ready to be used. As anyone who has ever worked with data knows, there is a huge gap between raw data and valuable data. The path from one to the other is the key to gaining a competitive edge in this industry and is the purview of a discipline known as Data Engineering.


What is a Data Engineer?

In short, a Data Engineer is responsible for collecting data from source systems and transforming it until it is optimized for downstream consumption. But Data Engineers are more than mere technicians. They work in close alignment with Executives, Product Owners, and Project Managers to ensure that the resulting data meets business needs and obeys business logic. The process, from collection (files, ERPs, CRMs, IoT, APIs, etc.) through use in analytics and machine learning, is captured in the Data Engineering Lifecyle defined by Joe Reis and Matt Housley in their book Fundamentals of Data Engineering (O’Reily, 2022).


Figure 1: Data Engineering Lifecyle from Fundamentals of Data Engineering, Joe Reis and Matt Housley, 2022.
Figure 1: Data Engineering Lifecyle from Fundamentals of Data Engineering, Joe Reis and Matt Housley, 2022.

While the whole Lifecycle spans from data generation through analytics, a Data Engineer is mainly focused on what occurs inside the central section (ingestion, transformation, and serving). As the diagram shows, storage underlies the whole process. In addition to managing the flow and transformation of data, Data Engineers are also typically responsible for designing and maintaining data storage. A Data Engineer works along with the Data Architect to choose the most adequate storage and computing tools for a given project. Then, Data Engineers build the pipeline.


Key Responsibilities of a Data Engineer


Separating Facts from Data

As mentioned, raw data isn’t immediately useful. In fact, not every data point even represents a fact! A digital thermometer could malfunction and report false temperatures. Accidentally duplicated rows in a sales table will produce inflated revenue reports. In these examples, collected data is misleading instead of factual. Without a Data Engineering pipeline, those data points could be included in analysis, skewing results and leading to detrimental business decisions. A key responsibility of the Data Engineer is to ensure that the tables or datasets that Analysts rely on are a source of truth. A typical exercise in data validation is to ensure primary key constraints. The primary key is what designates a unique record, so should only exist once in the relevant source table. For example, if you receive two purchase records with the same supposedly unique order ID, you must assume that they are the same purchase order and investigate and correct the duplication (through deletion or updating). While trust builds slowly over long periods reliably valid data, just a single misleading row can compromise a Data Engineer’s reputation.


Maintaining Security and Privacy

Data Engineers deal with sensitive data daily. This includes, but is not limited to, information about employees and customers’ private lives and companies’ confidentialities. A data breach could plunge your company into ruin, causing financial loss, disclosure of competitive information, or reputation damage. Leaked data usually appears without context and is often reported in the most negative light. While many regulations exist that address data protection (including FERPA, HIPPA, GDPR, SOC-2, ISO 27001), the most important defense is maintaining good security habits. Security is more than simply fulfilling an obligation and checking a list once or twice a year. A good Data Engineer will understand the ultimate goal of security practices and embody those principles in his daily work.


An example of these practices is the Principle of Least Privilege, where a person or system is only given the access privileges and data required to fulfill their task at hand, and nothing more. An efficient way to implement this principle is to automatically hide or mask the data based on Role Level Security (RLS). Because individuals are often the weakest link when it comes to security, this simple practice can make the difference for a highly secure organization.


Managing Data Storage, and a Word on Cloud Resources

Because the data pipeline relies so heavily on well-designed data storage for efficiency, Data Engineers work closely with Data Architects to design and implement storage solutions. These solutions will differ across industries, individual businesses, and sometimes even specific projects depending on the business needs. While there is no single “correct” storage solution, many companies are now turning to cloud resources to manage their large-scale data needs.

One of the advantages that cloud computing brought to industry is the ability to separately manage computing and storage resources. In any data solution, both are necessary but, in the cloud, they are only loosely coupled. Cloud resources are also more reliable than many on-premise servers (owned exclusively by a single company). The cloud provider looks after server maintenance so you can focus exclusively on use. While this works well for many businesses, it requires careful monitoring of resource utilization. Providers will be happy to sell as many computing resources as requested. Consequently, cloud resources can become unnecessarily expensive if data processing is not implemented in a logical and efficient manner. Data Engineers are responsible for the pipelines’ optimization to keep costs down without throttling progress.

A digital pipeline
A digital pipeline. Image generated by Copilot

Assure Pipeline Reliability

Creating a data pipeline is not a short-term solution. It is not enough to provide a batch of data on Monday morning for Analysts to connect to a Power BI report and build visualizations. Those same Analysts expect to wake up on Tuesday, grab their coffee, and find the report populated with updated data. Therefore, pipelines must be reliable.


An optimized pipeline flows data consistently without exceeding allocated resources. Data Engineers implement code with the lowest computational complexity necessary to achieve business needs. This simplicity allows Engineers to keep resource consumption as low as possible without risking exceeding those resources and “crashing” the pipeline. Data Engineers also regularly validate data flow and accuracy. Part of my daily tasks as Data Engineer at Ventagium is to carefully review, orchestrate, and test our code and tools so that our clients wake up on Tuesday (and every other day) and find their reports updated and ready to guide great business decisions.


Data as an Asset


In 2006, mathematician Clive Humby stated that “data is the new oil”. There is perhaps no clearer example of this analogy than in comparing pipelines, both oil and data. Oil is only valuable once it has been extracted, refined, and transformed into useful components. Before refining, oil is functionally useless. Like with oil, data must be transformed before it can be considered valuable to a business. A data pipeline is the orchestration of programs that extract, transform, and serve data to end users. It is Data Engineers who design, implement, and oversee the process that refines this raw resource with potential value into a true business asset. 

コメント


bottom of page