Trigger a spark job on Databricks using Rest API
A lot of us may be already experienced at triggering spark jobs on databricks either using a time trigger or a data factory pipeline. But today we will cover how to trigger the spark job using a rest API from Postman. As you can imagine, this opens up a lot of new possibilities like triggering a spark job from a web app and displaying the results as soon as the job is completed.
For our demo, we will have the following steps :
- On databricks, we will create a cluster, a notebook and a job which will execute our notebook on the cluster and generated an output.
- Trigger the databricks job using a POST rest API call from Postman. We will pass some data in the API call. Our databricks job will then manipulate the data and generate an output.
- We will then make another call to check the status and output of our job.
Pictorially, this is what is looks like:
If you are looking for a UML diagram you can refer to the below.
The Databricks rest API details are detailed here. But we will only be using the Job related APIs which are detailed here.
Step 1: Create a Cluster, a notebook and a job.
Login to your databricks and click “Create”. Select “Cluster”.
You can give your cluster a custom name and use the defaults like I’ve shown below. This is just for a demo, but in your use case your cluster configuration may be different and it is completely fine to have it customized to your requirement.
Click on create cluster.
While the cluster is coming up, click on “Create” and then click on “Notebook”.
Once again, you can use any name. I’m creating my notebook in Python but you can use Scala, SQL or R too. Make sure you select the cluster you just created. Click on “Create”.
You can use the below python notebook code.
dbutils.widgets.text("myinput","")
var_a = dbutils.widgets.get("myinput")
print(var_a)
output = "The message which you had sent is : " + var_a
print(output)
dbutils.notebook.exit(output)
This is going to create an input variable called myinput. It will then store this in var_a. And then append another string to it before exiting with that as output.
Next we will create a job to execute out notebook. Go to “Create” and click on “Job”.
Give your job/task a fancy name. Select the notebook you just created and select the cluster you created earlier. Add a parameter with a key of “myinput” because that is the input your notebook is expecting.
Your screen should look like below. Note the default concurrent runs is equal to 1. This means your job can only run one at a time. If you are expecting some kind of concurrency and would like to control this, you can set it to a higher value. I have left it as 1 for my demo.
Also note the job id on the top right hand side. All jobs within your work space will be assigned unique values and you will need to use this to specify which job to execute when you make your Rest API call.
Step 2: Trigger a run-now API to execute the databricks job.
First step is to get a token to authenticate agianst your databricks workspace. You can do this by selecting “Settings” while in databricks and then click on “User Settings”.
Click on “Generate Token” and make a note of the token. You can also authenticate using Azure Active directory and this is the safer way but we will not be covering that in this demo. If you need info, please refer to my new article.
Next you will need to create the url. You will be able to get this from the Azure portal for your Databricks workspace as shown below.
Append “/api/2.1/jobs/run-now” to what you get from above and you will have your url which will be something like https://adb-xxxxxxxxxxxxxxxxx.azuredatabricks.net/api/2.1/jobs/run-now.
In Postman, set the authorization to “Bearer Token” and paste your databricks token in the space provided.
The body of the API should be JSON and the minimum fields are required are shown below
{
"job_id": 97,
"notebook_params": {
"myinput": "Hello Anupam"
}
}
Note the job id has to be same as what was created for you.
Let’s execute this and see what happens.
So we got a 200 and the run_id in the response. Our job has actually been triggered in the background. Now we need to check the status of this.
Step 3: Trigger a get-output API to fetch the status and output of the databricks job execution.
Let’s form the url for our get-output call. Just append “/api/2.1/jobs/runs/get-output?run_id=477” to the databricks url which you got earlier to give you the url to be used. It will look something like this.
https://adb-Xxxxxxxxxxxxxxxxx.azuredatabricks.net/api/2.1/jobs/runs/get-output?run_id=477
Note that the run_id has to correspond to what you received in the response from the run_now api call.
Make sure you add the bearer token like you did for Step 2. You can reuse the same token as long as it is still valid.
When you initially trigger this call, you may get something like this.
This tells you that your job is in pending because your cluster is starting up. If you wait a few minutes and trigger this again. You may get the completion message as below.
If you scroll over to the bottom of the response body, you should be able to see the output that your databricks generated.
If you are using this in an application, your application may either need to poll for the result until it gets a failure or success or you can request the user to check back using a specific button. Both will work.
You can use the same method to trigger the other databricks APIs.
If you have any questions, feel free to drop me a comment and I will attempt to respond to you as soon as I can.
Cheers.