Logging#

mltrace functions can be added to existing Python files to log component and run information to the server. Logging can be done via a decorator or explicit Python API. All logging functions are defined in the mltrace module, which you can install via pip:

pip install mltrace

For this example, we will add logging functions to a hypothetical cleaning.py that loads raw data and cleans it. In your Python file, before you call any logging functions, you will need to make sure you are connected to your server. You can easily do so by setting the environment variable DB_SERVER to your server’s IP address:

export DB_SERVER=SERVER_IP_ADDRESS

where SERVER_IP_ADDRESS is your server’s IP address or “localhost” if you are running locally. You can also call mltrace.set_address(SERVER_IP_ADDRESS) in your Python script instead if you do not want to set the environment variable.

If you plan to use the auto logging functionalities for component run inputs and outputs (turned off by default), you will need to set the environment variable SAVE_DIR to the directory you want to save versions of your inputs and outputs to. The default is .mltrace in the user directory.

Component creation#

For runs of components to be logged, you must first create the components themselves using mltrace.Component. You can subclass the main Component class if you want to make a custom Component, for example:

from mltrace import Component

class Cleaning(Component):
    def __init__(self, name, owner, tags=[], beforeTests=[], afterTests=[]):

        super().__init__(
            name="cleaning_" + name,
            owner=owner,
            description="Basic component to clean raw data",
            tags=tags,
            beforeTests=beforeTests,
            afterTests=afterTests,
        )

Components are intended to be defined once and reused throughout your application. You can define them in a separate file or folder and import them into your main Python application. If you do not want a custom component, you can also just use the default Component class, as shown below.

Logging runs#

Decorator approach#

Suppose we have a function clean in our cleaning.py file:

import pandas as pd

def clean_data(df: pd.DataFrame) -> str:
    # Do some cleaning
    clean_df = ...
    return clean_df

We can include the run() decorator such that every time this function is run, dynamic information is logged:

from mltrace import Component
import pandas as pd

c = Component(
    name="cleaning",
    owner="plumber",
    description="Cleans raw NYC taxicab data",
)

@c.run(auto_log=True)
def clean_data(df: pd.DataFrame) -> str:
    # Do some cleaning
    clean_df = ...
    return clean_df

We will refer to clean_data as the clean_data as the decorated component run function. The auto_log parameter is set to False by default, but you can set it to True to automatically log inputs and outputs. If auto_log is True, mltrace will save and log paths to any dataframes, variables with “data” or “model” in their names, and any other variables greater than 1MB. mltrace will save to the directory defined by the environment variable SAVE_DIR. If MLTRACE_DIR is not set, mltrace will save to a .mltrace folder in the user directory.

If you do not set auto_log to True, then you will need to manually define your input and output variables in the run() function. Note that input_vars and output_vars correspond to variables in the function. Their values at the time of return are logged. The start and end times, git hash, and source code snapshots are automatically captured. The dependencies are also automatically captured based on the values of the input variables.

Python approach#

You can also create an instance of a ComponentRun and log it using mltrace.log_component_run() yourself for greater flexibility. An example of this is as follows:

from datetime import datetime
from mltrace import ComponentRun
from mltrace import get_git_hash, log_component_run
import pandas as pd

def clean_data(filename: str) -> str:
    # Create ComponentRun object
    cr = ComponentRun("cleaning")
    cr.set_start_timestamp()
    cr.add_input(filename)
    cr.git_hash = get_git_hash() # Sets git hash, not source code snapshot!

    df = pd.read_csv(filename)
    # Do some cleaning
    ...
    # Save cleaned dataframe
    clean_version = filename[:-4] + '_clean_{datetime.utcnow().strftime("%m%d%Y%H%M%S")}.csv'
    df.to_csv(clean_version)

    # Finish logging
    cr.set_end_timestamp()
    cr.add_output(clean_version)
    log_component_run(cr)

    return clean_version

Note that in log_component_run(), set_dependencies_from_inputs is set to True by default. You can set it to False if you want to manually specify the names of the components that this component run depends on. To manually specify a dependency, you can call set_upstream() with the dependent component name or list of component names before you call log_component_run().

Testing#

You can define Tests, or reusable blocks of computation, to run before and after components are run. To define a test, you need to subclass the Test class. Defining a test is similar to defining a Python unittest, for example:

from mltrace import Test

class OutliersTest(Test):
    def __init__(self):
        super().__init__(name='outliers')

    def testComputeStats(self; df: pd.DataFrame):
        # Get numerical columns
        num_df = df.select_dtypes(include=["number"])

        # Compute stats
        stats = num_df.describe()
        print("Dataframe statistics:")
        print(stats)

    def testZScore(
        self,
        df: pd.DataFrame,
        stdev_cutoff: float = 5.0,
        threshold: float = 0.05,
    ):
        """
        Checks to make sure there are no outliers using z score cutoff.
        """
        # Get numerical columns
        num_df = df.select_dtypes(include=["number"])

        z_scores = (
            (num_df - num_df.mean(axis=0, skipna=True))
            / num_df.std(axis=0, skipna=True)
        ).abs()

        if (z_scores > stdev_cutoff).to_numpy().sum() > threshold * len(df):
            print(
                f"Number of outliers: {(z_scores > stdev_cutoff).to_numpy().sum()}"
            )
            print(f"Outlier threshold: {threshold * len(df)}")
            raise Exception("There are outlier values!")

Any function you expect to execute as a test must be prefixed with the name test in lowercase, like testSomething. Arguments to test functions must be defined in the decorated component run function signature if the tests will be run before the component run function; otherwise the arguments to test functions must be defined as variables somewhere in the decorated component run function. You can integrate the tests into components in the constructor:

from mltrace import Component
import pandas as pd

c = Component(
    name="cleaning",
    owner="plumber",
    description="Cleans raw NYC taxicab data",
    beforeTests=[OutliersTest],
)

@c.run(auto_log=True)
def clean_data(df: pd.DataFrame) -> str:
    # Do some cleaning
    clean_df = ...
    return clean_df

At runtime, the OutliersTest test functions will run before the clean_data function. Note that all arguments to the test functions executed in beforeTests must be arguments to clean_data. All arguments to the test functions executed in afterTests must be variables defined somewhere in clean_data.

End-to-end example#

To see an example of mltrace integrated into a Python pipeline, check out this tutorial. The full pipeline with mltrace integrations is defined in solutions/main.py.