Go back to list
Article
Demand Management Replenishment
Other

How to take care of business data to work with machine learning models?

The key to being data-driven is data. It is thanks to data that retailers and manufacturers can use artificial intelligence and machine learning algorithms that open up entirely new opportunities for business growth and increased profits. How to collect and organize data to take full advantage of its potential?

Today’s business is undergoing a massive digital transformation related to digitization, automation, implementation of artificial intelligence, and more. According to the Siemens report “DIGI INDEX 2021: The level of digitization of manufacturing in Poland”, companies have increased spending on digitization – from 6.5 percent in 2020 to more than 9 percent in 2021. The average percentage of profits that companies allocate to digitization has also increased – from 6.48 percent to 9.12 percent per year. To successfully implement innovative solutions in an enterprise, one needs data. After all, it is what drives the entire digital infrastructure, and for the retail and manufacturing industries, it is the fuel for effective and accurate demand and sales forecasts in warehouses and stores.

What kind of data management strategy should one implement in the company? How to collect data and how long to store it? What does “valuable data” mean? Here are four key tips to properly prepare business data for effective use in machine learning models.

Collect data consistently

The greatest business potential lies in data that is complete and consistently collected. In case of sales data, the optimal horizon for historical data is at least 2-3 years. A sufficiently long sales history which machine learning (ML) models use to search for causal regularities, allows for the development of a better solution and positively influences the quality of forecasts generated based on artificial intelligence (AI) algorithms. However, it is worth emphasizing that having data with a shorter history – which happens in the case of many emerging retailers – does not preclude the use of advanced ML and AI algorithms. This state of affairs affects the modeling process and the results obtained, but still brings entrepreneurs and companies closer to improving processes and achieving better business results.

Data completeness is also of key importance. The more complete the data, the better the quality of the forecasts. What does this mean in practice? For example, if the supplier of a demand and sales forecasting platform requires pricing information for individual products, each SKU should be entered into the system with its price marked accordingly. This also applies to promotional pricing, return markdowns or bulk sales. Thus, if a given retailer runs promotional campaigns, loyalty programs or offers discretionary discounts to customers, any such pro-customer activity should be included in the data. This is crucial, because if the price on the receipt differs from that on the official price list, the relationships captured by the model may be erroneous, resulting in a disrupted production and distribution planning process. In addition, from the point of view of data completeness, one should also have analogous data that takes into account the future. For example: when forecasting the demand for a commodity one month in advance, we should compile historical data, i.e. prices, promotions, the moment of the season, or special holidays, and take into account exactly the same variable elements for the month ahead.

It is worth remembering that even if we have consistently collected complete data, ML models cannot predict everything. They do not work like a crystal ball. Models are trained on the basis of existing data and the patterns and regularities learned from that data, so if there are structural changes in the modeled phenomena, changes in the internal and external environment or, finally, in ordinary random events, the model will not predict them in advance. For example: if a town has so far had one grocery store, but two other stores have been opened recently, the model will not predict the impact of competition on sales as at the date of their launch. At the same time, since the models are fed with data, they – as required – can even make analyses that take into account the new business conditions within a day’s horizon. They are therefore able to respond to changes in an instant, they just need high-quality data.

Remember about consistency and history

To get started with artificial intelligence and machine learning models and use them in demand and sales forecasting in the retail industry, you need to have valuable data, namely data that is regularly collected, complete and consistent. Ordered and methodically structured data is to ML models like oil to a machine – it makes them work quickly and efficiently.

What data to start with? You will need receipt or sales data, price lists, information on promotions, etc. In addition, product dictionaries including both products currently available in the offer and historical ones. For example: how the price will affect sales of a given product in the coming period is determined based on how it affected sales in previous months and years, taking into account factors such as the moment of the season, consumer trends, weather, price changes, marketing actions, etc. This is because all these variables significantly affect interest in a product, or lack thereof. Importantly, historical data should include not only receipt data (actual sales), but also inventory and store information.

Completing data and adhering to consistent data collection rules can be difficult for retailers, especially when outlets are numerous and geographically dispersed internationally. This is because often information, e.g. on store or stock levels, is noted down by employees on pieces of paper and entered into the system with a delay, depending on the availability of free time. This causes many discrepancies that adversely affect the work of forecasting systems based on AI and ML algorithms. It is therefore worthwhile to ensure that a clear and precise data collection policy is defined in order to receive high-quality forecasts as soon as possible. 

Don’t turn data into junk

In any business – regardless of industry or size – the quality of the input data determines what output you get in demand and sales forecasts. No artificial intelligence algorithm or machine learning model, regardless of the sophistication and technology used in the creation and training process, will find solutions to business challenges if the data is shredded, incomplete, inconsistent and fudged. In IT terminology, this relationship is referred to by the phrase: “garbage in, garbage out.”

Incomplete data, stored in different formats or systems, is described as “junk.” If it was fed to the systems irregularly, in a non-uniform way, the machine learning model will work on what it received, trying to find relationships between sales volume and the factors influencing it. This will lead to a situation in which the model possesses false knowledge of historical relationships between sales, prices and promotions, and ultimately to unreliable and inaccurate forecasts. Therefore, it is worth thinking of data as a garden that requires regular care. Without this care, it will get overgrown with weeds or invasive plant species, and thus will lose its charm. However, if you take the time to take care of it, you will be able to enjoy its beauty and take full advantage of the potential it offers.

Therefore, it is crucial for a potential customer of a demand and sales forecasting platform to properly equip themselves with data and define their expectations. Questions should be asked as to why and for what purpose a particular solution is to be implemented and what benefits it is expected to bring to the business. Having knowledge of the data and its quality, as well as defining the business needs, are necessary but also sufficient to start working with mathematical modeling.

Generalize (through our hands)

A machine learning model from a technical point of view is a program that generates predictions for the future relying on established relationships and rules, on the basis of historical data. It is optimized to automatically calculate values that have been carefully selected beforehand to respond to the most important business challenges.

A key role in creating and training models is played by the data science team. While larger retail chains more and more frequently employ properly qualified staff, small and medium-sized retailers often do not have personnel with such competencies. This, however, does not derail their chance to benefit from the potential of AI and ML. By choosing a sales and demand forecasting platform such as Occubee, they receive the know-how and support of the system provider – both during implementation and later throughout the lifecycle, while remaining under the care of experienced data scientists.

The data scientist is responsible for entering all of the client’s data into the model and selecting the information that will predict the future with the best results. After all, not every piece of information will be useful from the point of view of the ML model, although every data and variable should be carefully collected. A process of selection and… generalization of information is necessary.

Paradoxically, too detailed information will not be useful at all. There is only a certain level of generality of information that has a significant impact on predictions. If the model is fed with too much nuanced data, it can become “overlearned”, which will adversely affect the output results. ML model learning can be compared to studying for an exam. When revising for a math exam, we do not learn each sentence by heart along with the values given in the task. If we do, we will fail the exam very quickly. We should learn some general relationships that are true under all conditions. The correct learning process involves generalizing the information coming to us and not paying attention to random or insignificant information – exactly the process we want to replicate for ML models.

Another example: if we decide to take out a bank loan, the bank needs our data. The decision to grant the loan is made on the basis of our age or occupation, among other things. If the bank made the credit decision dependent e.g. only on gender – this would be too general a category. If, on the other hand, it made the decision dependent on a specific name – the level of data would be too specific. 

That is why model training involves applying analytical processes to prices, which are sometimes rounded to a certain approximation (e.g., to decimals) in order to achieve a balance between the generalization and detailing of the data. Machine learning models must test the level of detail of information on prices and promotions, because it may turn out that a minimal change in price has not affected the forecast in any way, but on the contrary – maintaining a very high level of detail turned out disastrous. It is also worth noting that in every ML model training process there comes a point when a limit is reached, beyond which after even many months of adjusting the settings the forecasting results will remain unchanged. However, it is difficult to estimate in advance when such a limit will be reached.

There are businesses that operate on rich data sources and environments from which it takes longer to extract information about product features, special offers, events, holidays or historical sales. When the information at hand is based mainly on historical sales, the model development time will be shorter as well. While it is difficult to determine the time in which satisfactory demand and sales forecasts can be obtained, we can determine the element that will make the quality of the forecasts up to par. This element is, of course, the data. The data is what starts the entire process.

Machine learning and statistics work best when many independent, minor factors come together, and an individual decision does not significantly affect the final number of sales transactions at the end of the day. Therefore, historical data, data that is consistent, valuable, regularly collected and takes into account sales details will all have a major impact on the accuracy of output forecasts, and will determine the power of artificial intelligence and machine learning to drive business growth.

Would you like to know more?

Subscribe to our newsletter

Newsletters are sent in email format, no more often than once a month or immediately in case of significant news/changes/educational content. More in the terms and conditions.