.. _concepts:

Concepts
========

Machine learning pipelines, or even complex data pipelines, are made up of several *components.* For instance:

.. image:: images/toy-ml-pipeline-diagram.svg

Keeping track of data flow in and out of these components can be tedious, especially if multiple people are collaborating on the same end-to-end pipeline.This is because in ML pipelines, *different* artifacts are produced (inputs and outputs) when the *same* component is run more than once.

Knowing data flow is a precursor to debugging issues in data pipelines. ``mltrace`` also determines whether components of pipelines are stale.

Data model
^^^^^^^^^^

The two prominent client-facing abstractions are the :py:class:`~mltrace.Component` and :py:class:`~mltrace.ComponentRun` abstractions.

:py:class:`~mltrace.Test`
"""""""""

The ``Test`` abstraction represents some reusable computation to perform on component inputs and outputs. Defining a ``Test`` is similar to writing a unit test:

.. code-block :: python

    from mltrace import Test

    class OutliersTest(Test):
        def __init__(self):
            super().__init__(name='outliers')

        def testSomething(self; df: pd.DataFrame):
            ....
        
        def testSomethingElse(self; df: pd.DataFrame):
            ....


Tests can be defined and passed to components as arguments, as described in the section below.

:py:class:`mltrace.Component`
"""""""""

The ``Component`` abstraction represents a stage in a pipeline and its static metadata, such as:

* name
* description
* owner
* tags (optional list of string values to reference the component by)
* tests

Tags are generally useful when you have multiple components in a higher-level stage. For example, ETL computation could consist of different components such as "cleaning" or "feature generation." You could create the "cleaning" and "feature generation" components with the tag ``etl`` and then easily query component runs with the ``etl`` tag in the UI.

Components have a life-cycle:

* ``c = Component(...)``: construction of the component object
* ``c.beforeTests``: a list of ``Tests`` to run before the component is run
* ``c.run``: a decorator for a user-defined function that represents the component's computation
* ``c.afterTests``: a list of ``Tests`` to run after the component is run 

Putting it all together, we can define our own component:

.. code-block :: python

    from mltrace import Component

    class Featuregen(Component):
        def __init__(self, beforeTests=[], afterTests=[OutliersTest]):

        super().__init__(
            name="featuregen",
            owner="spark-gymnast",
            description="Generates features for high tip prediction problem",
            tags=["nyc-taxicab"],
            beforeTests=beforeTests,
            afterTests=afterTests,
        )
    

And in our main application code, we can decorate any feature generation function:

.. code-block :: python

    @Featuregen().run
    def generateFeatures(df: pd.DataFrame):
        # Generate features
        df = ...
        return df

See the next page for a more in-depth tutorial on instrumenting a pipeline.

:py:class:`mltrace.ComponentRun`
"""""""""

The ``ComponentRun`` abstraction represents an instance of a ``Component`` being run. Think of a ``ComponentRun`` instance as an object storing *dynamic* metadata for a ``Component``, such as:

* start timestamp
* end timestamp
* inputs
* outputs
* git hash
* source code
* dependencies (you do not need to manually declare)

If you dig into the codebase, you will find another abstraction, the :py:class:`~mltrace.IOPointer`. Inputs and outputs to a ``ComponentRun`` are stored as ``IOPointer`` objects. You do not need to explicitly create an ``IOPointer`` -- the abstraction exists so that ``mltrace`` can easily find and store dependencies between ``ComponentRun`` objects.

You will not need to explicitly define all of these variables, nor do you have to create instances of a ``ComponentRun`` yourself. See the next section for logging functions and an example.

.. _Staleness Overview:

Staleness
^^^^^^^^^^

We define a component run as "stale" if it may need to be rerun. Currently, ``mltrace`` detects two types of staleness in component runs:

1. A significant number of days (default 30) have passed between when a component run's inputs were generated and the component is run
2. At the time a component is run, its dependencies have fresher runs that began before the component run started

We are working on "data drift" as another measure of staleness.

.. _Reviewing Overview:

_Reviewing erroneous outputs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Oftentimes there is a bug or error in some output of a pipeline that surfaces after the output has been produced. ML and data bugs are extra elusive because it can take a nontrivial number of mispredicted or buggy outputs to indicate that there is actually an issue with the pipeline. Given a set of erroneous outputs, it can be challenging to know where to begin debugging! Fortunately, ``mltrace`` can help with this.

The idea here is to identify the common ``ComponentRun``s used in producing the erroneous outputs, as these might provide a good suggestion for what component to debug first or artifacts (inputs and outputs) to dive into. See steps on how to use the reviewer tool in the :ref:querying section.