DATA IS KING

Data is king in the world of machine learning (ML). A well-known phrase to many: good data in, good data out. So, what happens when data becomes fragmented by layers and applications of machine learning? The handling and processing of data heterogeneously will lead to discontinuous and obsolete data sets. This counteracts the benefits of aggregating and storing, as the processing will quickly yield skewed results. ML configurations need to have six qualities, as stated by Data Scientist Joe Doliner: limitless, versioned, lineal, accessible, parallel, and driving.
- Limitless: Data storage must be effectively indefinite in scale. If it isn’t, you wind up hitting its limit and fragment large and small datasets.
- Versioned: Data changes constantly, and if your storage system can’t efficiently store changes, you wind up with temporal fragmentation.
- Lineal: As data is transformed into new models and data, this relationship is itself data, and the system must track it quickly and automatically. It should not be something your team has to think about; it should just happen.
- Accessible: If the system doesn’t expose the data through standard data interfaces, you wind up copying the data to access it in different ways.
- Parallel: If the system doesn’t process the data in parallel and merge results, you wind up with another runtime storage layer.
- Driving: If the data doesn’t drive the processing, another system does, and that system can easily get out of sync with the data itself.
The management of data is key to its practical application. Doliner’s qualities outline rules to ensure when data is collected, stored, and used, the processing does not skew the outcome if we were to alter the phrase: good data management, good data out.
Reference:
“The Fragmentation of Machine Learning,” Joe Doliner, 2021.