Published on 00/00/0000
Last updated on 00/00/0000
Published on 00/00/0000
Last updated on 00/00/0000
Share
Share
6 min read
Share
Note: The Pipeline CI/CD module mentioned in this post is outdated and not available anymore. You can integrate Pipeline to your CI/CD solution using the Pipeline API. Contact us for details.
Apache Spark on Kubernetes series: Introduction to Spark on Kubernetes Scaling Spark made simple on Kubernetes The anatomy of Spark applications on Kubernetes Monitoring Apache Spark with Prometheus Apache Spark CI/CD workflow howto Spark History Server on Kubernetes Spark scheduling on Kubernetes demystified Spark Streaming Checkpointing on Kubernetes Deep dive into monitoring Spark and Zeppelin with Prometheus Apache Spark application resilience on Kubernetes
Apache Zeppelin on Kubernetes series: Running Zeppelin Spark notebooks on Kubernetes Running Zeppelin Spark notebooks on Kubernetes - deep dive CI/CD flow for Zeppelin notebooks
Apache Kafka on Kubernetes series: Kafka on Kubernetes - using etcd
CI/CD series: CI/CD flow for Zeppelin notebooks CI/CD for Kubernetes, through a Spring Boot example Deploy Node.js applications to KubernetesWe've already published a few posts about how we deploy and use Apache Spark and Zeppelin on Kubernetes. This time, we'll describe how to set up Pipeline's CI/CD workflow for a Zeppelin Notebook project. This use case may seem a bit unusual (see the note below), but it has the benefit of removing the burden of managing infrastructure (eg. you can add or remove nodes based on the workload generated by a Notebook), allowing data scientists to focus on a Notebook's logic; all the heavy lifting necessary to provision and tear down environments is done by Pipeline.
A regular Zeppelin Notebook development cycle contains the following steps:Note: Notebook was developed and tested in a web browser (in the Zeppelin UI) against a persistent cluster, this is as opposed to projects (Java, Go) that are usually stored in various code repositories and have more elaborate build and deployment lifecycles. Check out our other examples to see how plain java / scala projects can be set up for the Banzai Cloud Pipeline CI/CD flow.
Note: Please read the following howto for a detailed description of the prerequisites that are necessary for the flow to work!Fork the repository into your GitHub account. You'll find a couple of Banzai Cloud Pipeline CI/CD flow descriptor templates for previously released cloud providers (Amazon, Azure, etc.). Make a copy of the template that corresponds to your chosen cloud provider and name it .pipeline.yml. This is the Banzai Cloud Pipeline CI/CD flow descriptor which is one of the spotguides associated with the project. Enable the build for your fork on the Drone UI on the Banzai Cloud control plane. In the project's build details section, add the necessary secrets (Pipeline endpoint, credentials). Check the descriptor for any placeholders and substitute them with your corresponding values.
Note: there is a video of Spark CI/CD example available, here, that walks through use of the CI/CD UIWith that, your project should be configured for the Banzai Cloud Pipeline CI/CD flow! The flow will be triggered whenever a new change is pushed to the repository (configurable on the UI).
create_cluster
- creates or reuses a (managed) Kubernetes cluster supported by Pipeline like EC2, AKS, GKEinstall_monitoring
cluster monitoring (Prometheus)install_spark_history_server
Spark History Serverinstall_zeppelin
ZeppelinNote: steps related to the infrastructure are only exectued once, and reused after the first run if the cluster is not deleted as a final step
remote_checkout
checks out the code from the git repositoryrun
runs the Notebookinstall_spark_history_server.logDirectory
and install_zeppelin.deployment_values.zeppelin.sparkSubmitOptions.eventLogDirectory
properties in .pipeline.yml.
curl --request GET --url 'http://[control-plane]/pipeline/api/v1/clusters/{{cluster_id}}/endpoints'
Warning! Be aware that clusters created with the flow on a cloud provider will cost you money. It's advised that you destroy your environment when your development is finished (or at the end of the day). If you are running on AWS, you might consider using spot instances and our watchguard to safely run spot clusters in production with Hollowtrees
Get emerging insights on emerging technology straight to your inbox.
Outshift is leading the way in building an open, interoperable, agent-first, quantum-safe infrastructure for the future of artificial intelligence.
* No email required
The Shift is Outshift’s exclusive newsletter.
Get the latest news and updates on generative AI, quantum computing, and other groundbreaking innovations shaping the future of technology.