modelstore¶
modelstore
is a machine learning model registry Python library.
By saving your models using modelstore
, you can:
Version your models;
Upload model artefacts to your choice of storage;
Collect meta data about the models your uploading;
Control models’ states;
Load models straight from storage back into memory
Installing the modelstore library¶
This library can be installed via pip:
pip install modelstore
You can find the latest version here: modelstore on Pypi.
You can also build the library directly from source by checking it out from Github.
Quick Start¶
Install using pip, import in your code¶
The model store library is available via Pypi:
pip install modelstore
In your code, import ModelStore
with:
from modelstore import ModelStore
Create a model store instance and point it to your storage¶
The model store library supports storing models to blob storage across different cloud providers:
A file system;
Google Cloud storage buckets
AWS s3 buckets
Azure blob storage containers
MinIO object storage
Create a model store instance by using one of the following factory methods.
File System Storage
model_store = ModelStore.from_file_system(root_directory="/path/to/directory")
Google Cloud Storage Bucket
model_store = ModelStore.from_gcloud(
project_name="my-project",
bucket_name="my-bucket",
)
AWS s3 Bucket
model_store = ModelStore.from_aws_s3(
bucket_name="my-bucket",
)
Azure Blob Storage
model_store = ModelStore.from_azure(container_name="my-container-name")
Upload a model to the model store¶
Model store has an upload()
function that will create an archive containing your model and upload it to your storage.
Whenever you upload a model, you need to specify which domain it belongs to. A “domain” is a string that model store uses to group several models that are for the same end-usage together.
For example, let’s say you’ve trained a scikit-learn model (which is stored in a variable called clf
) that is going to be used in a spam classifier domain.
To store the model, use:
meta_data = model_store.upload("spam-detection", model=clf)
The upload()
function returns a dictionary containing meta data about your model - including the id that has been assigned to it, which is in meta_data["model"]["model_id"]
.
Load a model from the model store¶
Once a model has been stored, you can load it straight from storage back into memory using model store’s load()
function.
clf = model_store.load("spam-detection", model_id="abcd-abcd-abdc")
Uploading a scikit-learn model¶
This example is based on the GradientBoostingRegressor tutorial from the scikit-learn website:
import json
import os
from sklearn.datasets import load_diabetes
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from modelstore import ModelStore
def train():
diabetes = load_diabetes()
X_train, X_test, y_train, y_test = train_test_split(
diabetes.data, diabetes.target, test_size=0.1, random_state=13
)
params = {
"n_estimators": 500,
"max_depth": 4,
"min_samples_split": 5,
"learning_rate": 0.01,
"loss": "ls",
}
reg = GradientBoostingRegressor(**params)
reg.fit(X_train, y_train)
# Skipped for brevity (but important!) evaluate the model
return reg
if __name__ == "__main__":
# In this demo, we train a GradientBoostingRegressor
# using the same approach described on the scikit-learn website.
# Replace this with the code to train your own model
model = train()
# The modelstore library currently assumes you have already created
# a Cloud Storage bucket and will raise an exception if it doesn't exist
# This example assumes that you have the GCP project name and bucket id
# saved as environment variables - replace the os.environ below with
# your values
model_store = ModelStore.from_gcloud(
project_name=os.environ["GCP_PROJECT_ID"],
bucket_name=os.environ["GCP_BUCKET_NAME"],
)
# Upload the model
meta_data = model_store.upload(
"sklearn-diabetes-boosting-demo",
model=model
)
# The upload returns meta-data about the model that was uploaded
# This meta-data has also been sync'ed into the cloud storage
# bucket
print("✅ Finished uploading model!")
print(json.dumps(meta_data, indent=4))
# Download the model back!
target = f"downloaded-{model_type}-model"
os.makedirs(target, exist_ok=True)
model_path = model_store.download(
local_path=target,
domain=model_domain,
model_id=meta["model"]["model_id"],
)
print(f"⤵️ Downloaded the model back to {model_path}")
Key Concepts¶
The modelstore
library is built around a few key concepts.
Model Archive¶
When you upload a model, an artifacts.tar.gz
file is created and then uploaded to your storage. This archive contains:
Files that are dumps from your model,
A
"python-info.json"
file that enumerates the version of the Python library of the model you are exporting.
Model Meta-data¶
The upload()
function returns a dictionary containing meta-data about the model, which includes an id for your model.
The meta-data includes:
A unique id for your model;
Details about where the model is being uploaded to (the bucket and prefix);
The Python runtime that was used (e.g., “python:3.7.0”)
The user who ran the upload.
Versions for the Python library and key dependencies.
Domains¶
A domain is how modelstore
denotes a group of models, that are all intended for the same end-usage. When you upload a model to the store, you will add it to a domain.
The model store library then allows you to list the models that are in a domain and retrieve specific models (e.g., the latest one).
Under the hood, a domain is just a string, so it is up to you how you would like to use it.
Model State¶
A model state is a tag that you can use to control the lifecycle of a model in a given domain.
For example, you may want to have some models tagged as being in state “production” or state “shadow.” You can achieve this by creating a state and then setting a model’s state.
Under the hood, a model state name is just a string, so it is up to you how you would like to use it.
Storage¶
When you pick a backend that stores data in files (e.g., Cloud Storage Buckets), the files are stored with a pre-defined structure.
The top-level, root prefix that this library hard-codes is operatorai-model-store
.
When you create and upload a model archive, this library will upload three files to different places in the bucket.
The artifacts archive will be uploaded to:
root/<domain>/<datetime>/archive.tar.gz
, where the datetime has the form"%Y/%m/%d/%H:%M:%S"
- denoting the time when the model was uploaded.The library creates a dictionary of meta-data about your model. This will be uploaded to
root/<domain>/versions/<model-id>.json
.This same meta-data is also stored in
root/<domain>/latest.json
, which tracks the _last_ model that was uploaded to the model store.
Supported Machine Learning Libraries¶
This library currently supports several different machine learning libraries. To save models trained with them, you should use the upload function:
model_store.upload("domain", <kwargs>)
Library |
Required kwargs |
Example code |
---|---|---|
model |
||
model, pool (for classification) |
||
learner |
||
model |
||
model, optimizer |
||
model |
||
model, epoch |
||
model |
||
model |
||
model |
||
model, optimizer |
||
model, trainer |
||
explainer |
||
model |
||
model |
||
model |
||
config, model, tokenizer |
||
model |
What to do if a library is not supported¶
If you are using a machine learning library that is not listed above, you can still use model store to upload and version your models by uploading a file. You will not be able to use load()
but you will be able to download()
them back.
model_path = save_model()
model_store.upload("my-domain", model=model_path)
You can also:
Let us know by raising an issue
Add support for the library by following this guide.
Supported Storage Types¶
This library currently supports several places where you can save your models. You specify the storage type when you create a ModelStore instance:
Storage |
Requires |
Example code |
---|---|---|
The name of an existing s3 bucket |
||
The name of an existing bucket and access credentials |
||
The name of an existing container |
||
The name of an existing bucket |
||
File system |
A path |
File system storage¶
The file system model storage assumes that (a) the root directory exists, and (b) the library user has permission to write to it.
If you want to create the root directory if it does not exist, pass along the create_directory=True argument.
model_store = ModelStore.from_file_system(
root_directory="/path/to/directory",
create_directory=True,
)
Additional upload functionality¶
Uploading more than one model file¶
This library supports uploading multiple models, as long as their keyword arguments do not overlap.
For example, you might want to upload a classifier and a shap explainer together:
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
explainer = shap.TreeExplainer(model)
model_store.upload("my-domain", model=model, explainer=explainer)
When you load these models, model store returns a dictionary with both models:
models = modelstore.load(model_domain, model_id)
clf = models["sklearn"]
explainer = models["shap"]
Uploading extra files with the model¶
This library supports uploading a model with one or more extra files.
For example, you might want to upload a classifier and the predictions it made on the test set.
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
file_path = "predictions.csv"
numpy.savetxt(file_path, predictions, delimiter=",")
modelstore.upload("my-domain", model=model, extras=file_path)
When you load these models, the extra files are not loaded into memory:
clf = modelstore.load(model_domain, model_id)
Additional download functionality¶
Providing read-only access¶
The AWS s3, Google GCS, Azure Containers storage types assume that (a) the bucket/container exists, and (b) the library user has both read and write permissions.
As of 0.0.74, modelstore also supports read-only access to public Google Cloud Storage buckets.
Download a model from the model store¶
If you would rather download the model, and not load it into memory, you can use model store’s download()
function.
file_path = model_store.download(
local_path=".", # Where to download the model to
domain="example-model", # The model's domain
model_id="model-id" # Optional; the ID of the specific model
)
Retrieving model and domain information¶
This library enables you to query your model registry programmatically.
The examples below assume you have created a model store instance already:
from modelstore.model_store import ModelStore
model_store = ModelStore.from_aws_s3(bucket_name)
Model domains¶
Models are uploaded into domains: a domain is created when you upload your first model to it. You can list all of the existing domains and get information about a specific domain with:
model_domains = model_store.list_domains()
meta_data = model_store.get_domain("my-domain")
Model states¶
Model states are tags that can be used to control the lifecycle of models in a domain. To see the list of model states that have been created, use:
model_states = model_store.list_model_states()
Note: there are some reserved states that modelstore uses to, for example, keep track of model IDs that have been deleted.
Model versions¶
Models are uploaded into domains: a domain is created when you upload your first model to it. You can list all of the existing domains and get information about a specific domain with:
# List all models
model_ids = model_store.list_versions("my-domain")
# List models with a given state
prod_model_ids = model_store.list_versions("my-domain", state_name="production")
Models¶
The main thing you can do with a model is download or load it back. You can also retrieve information about a specific model, and delete models from the registry.
# Get information about a specific model
meta_data = model_store.get_model_info("my-domain", "my-model")
Deleting Models¶
Deleting a model removes the files from the registry. If you query for a model that has been deleted, a ModelDeletedException
is raised.
# Delete a model
model_store.delete_model("my-domain", "my-model", skip_prompt=True)
# Will raise a ModelDeletedException
meta_data = model_store.get_model_info("my-domain", "my-model")
Controlling model states¶
This library enables you to control models by setting their state. For example, you may want to set a model to have state “production.” You can then query the model store for models by state, and change model states.
The examples below assume you have created a model store instance already:
from modelstore.model_store import ModelStore
model_store = ModelStore.from_aws_s3(bucket_name)
Create a state¶
Before doing anything with a model state, you need to create it. This is a one-time operation.
production_state = "production"
model_store.create_model_state(production_state)
Set and unset a model’s state¶
Once a state has been created, you can add a model to a state. You can add a model to more than one state, and you can add more than one model to a state.
model_domain = "my-domain"
model_id = "my-model-id"
production_state = "production"
model_store.set_model_state(model_domain, model_id, state_name)
To unset a model’s state, you can use:
model_store.remove_model_state(model_domain, model_id, state_name)
Find models by state¶
After setting the state of one or more models, you can find them by adding the state name to the list versions function:
model_ids = modelstore.list_versions(
model_domain,
state_name=production_state
)
Modelstore CLI commands¶
You can use modelstore (version > 0.0.71) from the command line to upload and download models:
# To upload a model
python -m modelstore upload <domain> </path/to/file>
# To download a model
python -m modelstore download <domain> <model-id>
Modelstore figures out how to read from your storage by looking for specific environment variables.
Your environment needs to define (1) a value for MODEL_STORE_STORAGE which tells modelstore what type of storage you are using, and (2) values that depend on the specific type of storage that you are using.
All of these are summarised in the table below:
Storage |
MODEL_STORE_STORAGE |
Other environment variables |
---|---|---|
aws-s3 |
MODEL_STORE_AWS_BUCKET
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
|
|
azure-container |
MODEL_STORE_AZURE_CONTAINER
AZURE_ACCOUNT_NAME
AZURE_ACCESS_KEY
AZURE_STORAGE_CONNECTION_STRING
|
|
google-cloud-storage |
MODEL_STORE_GCP_PROJECT
MODEL_STORE_GCP_BUCKET
|
|
File system |
filesystem |
MODEL_STORE_ROOT |
Troubleshooting¶
Common errors when setting up s3 storage¶
This page describes the steps you need to take to store models in s3.
Before you start, you will need to create the s3 bucket you want to use. The modelstore library does not create s3 buckets and assumes they exist already. To do this, you can follow the creating a bucket AWS documentation.
Next, install modelstore and boto3 in your Python environment:
pip install modelstore boto3
If you have not done this before, you will need to set up the AWS authentication credentials by following the boto3 configuration guide.
And you can then create a model store instance and point it to your bucket:
from modelstore import ModelStore
model_store = ModelStore.from_aws_s3("my-bucket")
The remainder of this page describes some common errors you may run into. If you need further support, please create an issue on Github.
ModuleNotFoundError: boto3 is not installed¶
The model store library works with several different types of storage, and therefore does not install all of their libraries. If you see a ModuleNotFoundError
, then you need to install boto3.
pip install boto3
The current version of modelstore requires boto3>=1.6.16,<1.8
.
botocore.exceptions.NoCredentialsError: Unable to locate credentials¶
You will need to set up the AWS authentication credentials. As this documentation page describes, boto3 “looks at various configuration locations until it finds configuration values.”
To start, follow the AWS documentation to get your access key and secret access key values. There are then two approaches you can use here.
Option 1: run aws configure
by following the boto3 configuration guide.
❯ aws configure
AWS Access Key ID [None]: my-access-key
AWS Secret Access Key [None]: my-secret-access-key
Option 2: set environment variables.
export AWS_ACCESS_KEY_ID="my-access-key"
export AWS_SECRET_ACCESS_KEY="my-secret-access-key"
botocore.exceptions.ParamValidationError: Parameter validation failed¶
You’ll see this error if you are passing a bucket_name
that boto3 cannot parse. Note: you do not need to include the “s3://” in the bucket name.
>>> model_store = ModelStore.from_aws_s3("s3://my-bucket-name")
[...]
botocore.exceptions.ParamValidationError: Parameter validation failed:
Invalid bucket name "s3://my-bucket-name": Bucket name must match the regex "^[a-zA-Z0-9.\-_]{1,255}$" or be an ARN matching the regex "^arn:(aws).*:(s3|s3-object-lambda):[a-z\-0-9]*:[0-9]{12}:accesspoint[/:][a-zA-Z0-9\-.]{1,63}$|^arn:(aws).*:s3-outposts:[a-z\-0-9]+:[0-9]{12}:outpost[/:][a-zA-Z0-9\-]{1,63}[/:]accesspoint[/:][a-zA-Z0-9\-]{1,63}$"
Exception: Failed to set up the AWSStorage storage¶
This exception is raised if modelstore can’t read from the bucket you are pointing it to. With logging enabled, you will see this line when you try to create a model store instance:
>>> model_store = ModelStore.from_aws_s3("my-bucket-name")
Unable to access bucket: <bucket-name>
[...]
Exception: Failed to set up the AWSStorage storage
To resolve this, you can check:
Does the bucket exist? If not, you can follow the creating a bucket AWS documentation.
Is there a typo in the
bucket_name
variable?
botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL¶
This exception is raised if modelstore can’t connect to the s3 bucket. One way this happens is if you specify a region that is not a known value. The full list of regions is available on this AWS documentation page.
For example, if you use a region name, you’ll see an error:
>>> model_store = ModelStore.from_aws_s3(bucket_name=os.environ["AWS_BUCKET_NAME"], region="Frankfurt")
>>> model_store.list_domains()
[...]
raise EndpointConnectionError(endpoint_url=request.url, error=e)
botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "https://operator-ai-modelstore-direct.s3.Frankfurt.amazonaws.com/?list-type=2&prefix=operatorai-model-store%2Fdomains&encoding-type=url"
But if you use the region code, it should not error:
>>> model_store = ModelStore.from_aws_s3(bucket_name=os.environ["AWS_BUCKET_NAME"], region="eu-central-1")
>>> model_store.list_domains()
['diabetes-boosting-demo']
Seeing another exception?¶
If you need further support, please create an issue on Github.
This documentation is open source. If you would like to add anything to it, please open a pull request on Github.
This documentation is open source. If you would like to add anything to it, please open a pull request on Github.
License¶
Copyright 2022 Neal Lathia
Licensed under the Apache License, Version 2.0 (the “License”); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Contact¶
If you have any questions or feedback, feel free to open an issue on Github or email me: neal.lathia@gmail.com
or