How to load a file in the bucket into Bigquery in an event-driven way?
Cloud function is an event-driven tools that we could use when we don’t need to fix computer resource or network resources but just based on the resource used when we use. But what’s is event, when we put a file into a bucket, file added, that’s an event; when we delete a file, file changed, that’s an event; when we modify a file, file changed, that’s an event. We just need to pay for the resource we used.
When we are in GCP, we could use the Cloud function to implement this event-driven logic. The base is we will put a file in the bucket, then we will trigger a Pubsub message, then Cloud function will monitor this Pubsub topic, if we get that info, the Cloud function logic will be triggered.
We will use the bellow components in GCP:
1. Cloud Storage
2. Cloud functions
This is based on your own GCP project, just follow the below steps to make things happen.
I highly recommend to use **Notebook** to implement the logic as we could get interactive actions with code, we could use our **Local jupyter notebook server** or we could use **cloud-hosted notebook** like [Colab research] by Google, it’s free!
Let’s start our journey :)
1. Install dependencies
We will use `google-cloud-storage` and `google-cloud-bigquery` to do the data loading logic, also this is also based on that you have already installed `GCP SDK` in your local machine if you use `jupyter notebook`, if with `colab`, the `SDK` is already installed.
This is to ensure that we should have `google-cloud-storage` and `google-cloud-bigquery` to be installed first.
! pip install google-cloud-bigquery google-cloud-storage — quiet
2. Get Service account key
We need to get a `Service account` to get interaction with the `GCP` project, so the first step is that we need to apply a `SA` first. You could just follow the `SA` creation step.
2.1 Create SA
In the `IAM` tab, click with the `service account` tab, then just click `create`, give a name, then grant the SA account with the proper power you want, here for simplicity, I just provide with
2.2 Get SA key
After the account is created, then we need to get the `SA` account key with a `JSON` file, we will use this file to get interaction with `GCP` in fact, so please keep in mind to keep this file secretly.
After we have already got the key, then we should first insert project information into our running time environment, that’s easy, just add an environment variable with `GOOGLE_APPLICATION_CREDENTIALS` with the file, that’s it.
import osfrom google.cloud import bigquery# Get SA file namesa_file_name = [x for x in os.listdir('.') if x.endswith('.json')]os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = sa_file_name
3. Create a bucket and push notification for the bucket
As when we change the bucket file like `upload`, `change`, `delete` or other modifications in the bucket file, we know that there are some changes in the bucket, but how to notice **Cloud functions**? That’s **Pubsub** comes in.
When there is a modification, then we could create a message that contains some information that: oh, you have added a file called `data.csv` at what time by who, like this information.
Then we will use **Cloud functions** to get the published message from that topic.
After we have already created the bucket, then we should create a `notification` from the bucket into a `topic`, so we could just use this command:
gsutil notification create -f json -t topic_name gs://bucket_name
! gsutil notification create -f json -t gcs_to_bq gs://gcs_to_bq_cloud_func
4. Create datasets, table, and load data into BQ with cloud function
As we need to load the file into the bucket, so we could create the `dataset` and `table` that we need, we could both use the `portal` or `API`, here I just use `Python API` to do this.
After we have created the table, then we will use the cloud function to do the data loading logic.
Before we do anything, please do keep in mind that we should provide with a `requirements.txt` file that contains packages that you use for could functions, as cloud functions are event-driven, we don’t hold environment, so we need to tell the platform that we need these packages, please help me to install them.
The last step is that we should implement the logic with Python code that we would like to use, I have uploaded my code into my Github, you could find the whole code from my Github, also you could find many other tutorials about GCP with hands-on python code.
5. Deploy cloud function
The last step is to deploy our code into the GCP project, here we could use `gcloud` command to deploy, you could reference the official web [Pubsub Trigger]
This is the command that we will use to deploy the function that we have written.
gcloud functions deploy load_gcs_file_into_bq_new \--runtime python37 \--trigger-resource gcs_to_bq_cloud_func_new \--trigger-event google.storage.object.finalize \--allow-unauthenticated
6. Load file from bucket into BQ
The last step that we should take is to provide the file that we need to upload it into the bucket we created, we could use Portal or use
gsutil command to upload the file into that bucket directly, but we do need to ensure that SDK is installed in the computer(SDK install) or in
Colab it’s installed already.
! gsutil cp data.csv gs://gcs_to_bq_cloud_func_new/
Please do keep in mind that we should provide the right bucket name and the source file name in that bucket when write python code.
7. Check logs and dataset result
Alright, we have already deployed the code and moved a file into that bucket, but how about BigQuery? We could get the running log in GCP portal with Cloud functions, click the function we created and in LOGS, we could get the running status and result.
In the logs that we found that the function executed finished without error. But how about the BigQuery table result? If would like to write less code, then just in the GCP portal BigQuery component, just query that table.
Final words, this tutorial is based on the cloud function to load files in bucket into Bigquery, as this is a less expensive and event-driven process, so this is just a demonstration, for real project we just need to focus on the logic that we need to implement.
If you think this has helped you a bit, please clap for this story! Thanks!!!
If you would love to learn more about GCP hands-on in Python, please check my Github repo. Hope Python coding.