Machine learning pipelines, or even complex data pipelines, are made up of several components. For instance:
Keeping track of data flow in and out of these components can be tedious, especially if multiple people are collaborating on the same end-to-end pipeline.This is because in ML pipelines, different artifacts are produced (inputs and outputs) when the same component is run more than once.
Knowing data flow is a precursor to debugging issues in data pipelines.
mltrace also determines whether components of pipelines are stale.
The two prominent client-facing abstractions are the
Test abstraction represents some reusable computation to perform on component inputs and outputs. Defining a
Test is similar to writing a unit test:
from mltrace import Test
def testSomething(self; df: pd.DataFrame):
def testSomethingElse(self; df: pd.DataFrame):
Tests can be defined and passed to components as arguments, as described in the section below.
Component abstraction represents a stage in a pipeline and its static metadata, such as:
tags (optional list of string values to reference the component by)
Tags are generally useful when you have multiple components in a higher-level stage. For example, ETL computation could consist of different components such as “cleaning” or “feature generation.” You could create the “cleaning” and “feature generation” components with the tag
etl and then easily query component runs with the
etl tag in the UI.
Components have a life-cycle:
c = Component(...): construction of the component object
c.beforeTests: a list of
Teststo run before the component is run
c.run: a decorator for a user-defined function that represents the component’s computation
c.afterTests: a list of
Teststo run after the component is run
Putting it all together, we can define our own component:
from mltrace import Component
def __init__(self, beforeTests=, afterTests=[OutliersTest]):
description="Generates features for high tip prediction problem",
And in our main application code, we can decorate any feature generation function:
def generateFeatures(df: pd.DataFrame):
# Generate features
df = ...
See the next page for a more in-depth tutorial on instrumenting a pipeline.
ComponentRun abstraction represents an instance of a
Component being run. Think of a
ComponentRun instance as an object storing dynamic metadata for a
Component, such as:
dependencies (you do not need to manually declare)
If you dig into the codebase, you will find another abstraction, the
IOPointer. Inputs and outputs to a
ComponentRun are stored as
IOPointer objects. You do not need to explicitly create an
IOPointer – the abstraction exists so that
mltrace can easily find and store dependencies between
You will not need to explicitly define all of these variables, nor do you have to create instances of a
ComponentRun yourself. See the next section for logging functions and an example.
We define a component run as “stale” if it may need to be rerun. Currently,
mltrace detects two types of staleness in component runs:
A significant number of days (default 30) have passed between when a component run’s inputs were generated and the component is run
At the time a component is run, its dependencies have fresher runs that began before the component run started
We are working on “data drift” as another measure of staleness.
Reviewing erroneous outputs#
Oftentimes there is a bug or error in some output of a pipeline that surfaces after the output has been produced. ML and data bugs are extra elusive because it can take a nontrivial number of mispredicted or buggy outputs to indicate that there is actually an issue with the pipeline. Given a set of erroneous outputs, it can be challenging to know where to begin debugging! Fortunately,
mltrace can help with this.
The idea here is to identify the common
ComponentRun s used in producing the erroneous outputs, as these might provide a good suggestion for what component to debug first or artifacts (inputs and outputs) to dive into. See steps on how to use the reviewer tool in the Querying section.