Concepts#
Machine learning pipelines, or even complex data pipelines, are made up of several components. For instance:
Keeping track of data flow in and out of these components can be tedious, especially if multiple people are collaborating on the same end-to-end pipeline.This is because in ML pipelines, different artifacts are produced (inputs and outputs) when the same component is run more than once.
Knowing data flow is a precursor to debugging issues in data pipelines. mltrace
also determines whether components of pipelines are stale.
Data model#
The two prominent client-facing abstractions are the Component
and ComponentRun
abstractions.
Test
#
The Test
abstraction represents some reusable computation to perform on component inputs and outputs. Defining a Test
is similar to writing a unit test:
from mltrace import Test
class OutliersTest(Test):
def __init__(self):
super().__init__(name='outliers')
def testSomething(self; df: pd.DataFrame):
....
def testSomethingElse(self; df: pd.DataFrame):
....
Tests can be defined and passed to components as arguments, as described in the section below.
mltrace.Component
#
The Component
abstraction represents a stage in a pipeline and its static metadata, such as:
name
description
owner
tags (optional list of string values to reference the component by)
tests
Tags are generally useful when you have multiple components in a higher-level stage. For example, ETL computation could consist of different components such as “cleaning” or “feature generation.” You could create the “cleaning” and “feature generation” components with the tag etl
and then easily query component runs with the etl
tag in the UI.
Components have a life-cycle:
c = Component(...)
: construction of the component objectc.beforeTests
: a list ofTests
to run before the component is runc.run
: a decorator for a user-defined function that represents the component’s computationc.afterTests
: a list ofTests
to run after the component is run
Putting it all together, we can define our own component:
from mltrace import Component
class Featuregen(Component):
def __init__(self, beforeTests=[], afterTests=[OutliersTest]):
super().__init__(
name="featuregen",
owner="spark-gymnast",
description="Generates features for high tip prediction problem",
tags=["nyc-taxicab"],
beforeTests=beforeTests,
afterTests=afterTests,
)
And in our main application code, we can decorate any feature generation function:
@Featuregen().run
def generateFeatures(df: pd.DataFrame):
# Generate features
df = ...
return df
See the next page for a more in-depth tutorial on instrumenting a pipeline.
mltrace.ComponentRun
#
The ComponentRun
abstraction represents an instance of a Component
being run. Think of a ComponentRun
instance as an object storing dynamic metadata for a Component
, such as:
start timestamp
end timestamp
inputs
outputs
git hash
source code
dependencies (you do not need to manually declare)
If you dig into the codebase, you will find another abstraction, the IOPointer
. Inputs and outputs to a ComponentRun
are stored as IOPointer
objects. You do not need to explicitly create an IOPointer
– the abstraction exists so that mltrace
can easily find and store dependencies between ComponentRun
objects.
You will not need to explicitly define all of these variables, nor do you have to create instances of a ComponentRun
yourself. See the next section for logging functions and an example.
Staleness#
We define a component run as “stale” if it may need to be rerun. Currently, mltrace
detects two types of staleness in component runs:
A significant number of days (default 30) have passed between when a component run’s inputs were generated and the component is run
At the time a component is run, its dependencies have fresher runs that began before the component run started
We are working on “data drift” as another measure of staleness.
Reviewing erroneous outputs#
Oftentimes there is a bug or error in some output of a pipeline that surfaces after the output has been produced. ML and data bugs are extra elusive because it can take a nontrivial number of mispredicted or buggy outputs to indicate that there is actually an issue with the pipeline. Given a set of erroneous outputs, it can be challenging to know where to begin debugging! Fortunately, mltrace
can help with this.
The idea here is to identify the common ComponentRun
s used in producing the erroneous outputs, as these might provide a good suggestion for what component to debug first or artifacts (inputs and outputs) to dive into. See steps on how to use the reviewer tool in the Querying section.