Let’s use Snowpark ML, the dbt Python models, and some feature engineering to perform some training, inference, and inference. To demonstrate how this new toolkit helps us scale with Snowflake, first with 50k rows and later with 50M rows, while the dbt Python models handle all the boilerplate.
Intro
A new set of tools called Snowpark ML is available in Snowflake for creating and deploying machine learning models. The best aspect is that you get all the power, security, and scalability of Snowflake along with well-known ML components (Scikit-Learn, XGBoost, LightGBM, etc.).
Why I’m writing this
In my previous piece, I discussed how Snowflake and dbt work incredibly well together to power the new dbt Python models.
The dbt Python models docs provide a detailed use case of how to prepare, clean, train, and predict ML models using Snowpark and dbt Python.
However, those documents don’t make use of the new Snowpark ML Toolkit, which would have improved the performance of the previous assignments.
The Snowpark ML Toolkit Quickstart provides comprehensive instructions on how to use this new ML Toolkit in the meanwhile.
But the initial step of that quickstart asks us to download and install Python libraries in our virtual machines. That move bothers me. Utilizing a cloud platform (like dbt Cloud) and having each library managed “magically” is what I would prefer to do.
Let’s do that, then: using the dbt Python models on the dbt Cloud, a thorough investigation of the Snowpark ML Toolkit. Just scalable machine learning fun; no packages to install.
What’s interesting about dbt in this process:
The Snowpark ML libraries are unknown to dbt Cloud, and they are not required to be.
The basic purpose of dbt Cloud is to wrap our function in a boilerplate Snowpark stored procedure and pass the rest over to Snowflake.
I simply appreciate how dbt Cloud keeps everything running on the cloud while we work on this code. You could do the same using the free source dbt-core on your desktop.
Details are available in my earlier post: