By putting our declarations in Git repositories and using CI/CD, we now have a powerful way of automating and controlling our cloud, while also keeping track of different versions. We can easily roll-back, (dis)approve, and monitor changes we make in our cloud environment.
But with what are we going to provision in our cloud?
Networking and compute services
For our foundation, we looked at the basic networking and compute services. For networking, we created a Virtual Private Cloud (VPC). This is an isolated network from which you can attach resources that need an internal IP address. A VPC in Google Cloud is global by nature, meaning that machines from Europe can reach machines in Asia or the US with near-zero configuration. This is a big difference compared to Azure, where you would have to go through multiple steps to set this up. Within our VPC, we created several subnetworks, which are tied to regions.
As for compute services, there is a wide variety with different pricing models. Most interesting here is the Kubernetes offering. Google Kubernetes Engine enables you to manage your applications and clusters, while Google Cloud takes care of your master nodes and node pools. This allows you to easily create high-available services while taking away some of the headaches when hosting your own Kubernetes.
Very #cool that we can easily perform automation with Deployment Manager to set up networking and compute resources. Now we must find a way to monitor these resources.
Monitoring and alerting
Stackdriver is the native service used for monitoring your GCP environment. This service visualizes your infrastructure metrics and logs, and alerts you when something significant happens. What really sets Stackdriver apart from its competition is the built-in incident manager to manage these alerts. Using Deployment Manager, we can automate the rules for when to alert. For example, we implemented a rule in Stackdriver that alerts when someone’s access rights are changed, or when our infrastructure load spiked.
Google is well known for their data and machine learning solutions, so obviously we wanted to put those to the test. To make a clear distinction between the data services, we created three personas – the data engineer, the data scientist, and the data analyst – and explored which services are of most use to them.
The Data engineer
Data engineers work with big data, and solve problems using ETL and distributed processing. For Data engineers, we have found the services DataProc, DataFlow, BigQuery, and Composer to be most useful.
DataProc is used for migrating Spark and/or Hadoop Jobs to Google Cloud. It allows you to spin up and tear down clusters, and it provides some automation to create complete workflows. DataProc gives you a lot of control, since you can choose the underlying infrastructure that runs the service. This means you can specify how big or fast you want your machines to be. The downside is that you need to specify more configurations yourself to manage this service.
DataFlow is used when creating data pipelines from scratch. It runs Apache Beam in the background and has some out-of-the-box integration with different streaming and storage services. You do not have any control over the environment this service is running, but it is very easy to get started.
BigQuery is the data warehousing service with an SQL-like interface. It was pretty easy for us to integrate this with our DataProc clusters. The way this service is presented to users is similar to DataFlow. Developers can easily use it without worrying about the underlying infrastructure.
And lastly, Composer is the managed Apache Airflow service to do data pipeline orchestration and scheduling using Directed Acyclic Graphs. If you already made the decision to use Airflow in some way, then I recommend using it, since it takes away some of the pain of setting up and maintaining the environment. If you need data pipeline orchestration, it is a no-brainer to go with this solution.
The only thing that is really missing here is a developer tool, like the Databricks offering on Azure. The available services are great for when you have a piece of software that is robust and ready for production. But if you are still developing your code, it becomes difficult to be confident it will work in production.
Data scientists work with machine learning (ML) solutions. AI Platform is the go-to service in that case. It gives access to all infrastructure optimized for ML computation, such as GPUs, hosted Jupyter notebooks, and deep learning machine images. AI Platform is made to cover the entire machine learning process from development to deployment.
It is very easy to share models with either your coworkers or the entire world. Data scientists can use amazing pre-trained machine learning models, made available by Google and the AI community. A lot of tooling is in place, enabling you to continuously validate and test new versions of your models and deploy the best ones to your production environments.
Data analysts are using business intelligence and data visualization. Google Cloud offers two tools, DataPrep and DataStudio, that are great for this.
DataPrep is a data pre-processing no-code tool, which allows you to perform simple data cleanup and computation to prepare your data for the next step in the process. It also shows the analyst information about the data that is coming in, such as the distribution of values per column, or the amount of missing values. This information can then be used to click together automation to get your data ready for the end user. DataPrep is a nice tool in terms of its functionality, but it is not easy to version control the functionality you build. This makes it almost useless for any automation purposes.
DataStudio is a data visualization tool that you can use to build rich, interactive dashboards and reports, making it very comparable to a service such as PowerBI. A great feature of DataStudio is that it has good out-of-the-box integrations with Google’s other data services such as BigQuery and DataPrep, making it much easier to work across disciplines. If you are planning to use it as a standalone service, evaluate what kind of integrations are important for your use case before deciding.
Data use cases
GCP: the verdict
It’s clear Google is on a mission to be the number one. Google Cloud has a vast selection of services for different use cases. Spinning up and tearing down resources takes seconds rather than minutes. But GCP is also newer to the game, and we see that as well. Certain services are still minimal and in need of additional features.
When artificial intelligence and machine learning play a crucial part in your day-to-day work, you cannot go wrong adopting Google Cloud in your stack, especially given the strong community surrounding it. However, if data engineering is more your strong suite, you might need to consider a different platform until GCP has matured a bit more. The main issue being the lack of a good development solution.
Back to overview