Gopay In-house Cloud Cost Dashboard
Motivation
Back then, we were only using AWS for our services. As we wanted to step up our game, we started to build a redundant system, primarily for disaster recovery, and GCP was the other Cloud Provider selected. Everything was going well, but then we started to notice that our cloud costs were skyrocketing—no kidding. So, we needed to keep track of the cost and optimize it. Initially, we were using the built-in cost explorer on each cloud provider for that purpose.
Well, now what’s the problem? It is only two accounts, no? Here’s another thing: apart from that, our company keeps acquiring other companies, and along with the business, their tech also needs to be integrated. For compliance reasons, we need to have different accounts for different companies or businesses. In the end, we ended up with a bunch of accounts spread across different Cloud Providers.
Going back and forth between the two Cloud Providers and between different accounts was a real pain. After checking a few options we had, like Vantage.sh or Finout.io, we decided to build our own cloud cost dashboard to save money, gain cost visibility, secure our data, and have more flexibility with the features.
Data
The first hurdle we had was how different the data structure is between the two Cloud Providers. AWS and GCP have their own ways of presenting the data, and the terminology between the two clouds is also different. For this, we decided to build an ETL pipeline to transform the data into a common format, and then we could use that data as the primary data source for our dashboard. for the table, we initially chose BigQuery as the data store, as it is simple and able to handle the data volume we have. Although, we moved to MaxCompute later on.
Representation Layer
To show the data, we needed to represent it in a way that is easy to understand. For this, we decided to build a representation layer using React and TypeScript with Chart.js. For the backend, we have a simple query builder that translates the filters and payload from the client to our DataStore. The backend also transforms the data into a format that is easy to consume by the frontend.
Architecture
A simplified architecture of our cloud cost dashboard is shown below. The actual architecture is a bit more complex, but this is a good enough representation that I can give while avoiding any legal issues.
flowchart LR
A@{ shape: processes, label: "AWS Accounts"} -->|Raw Data| D@{ shape: das, label: "ETL Pipeline"}
B@{ shape: processes, label: "GCP Accounts"} -->|Raw Data| D
C@{ shape: processes, label: "Alicloud Accounts"} -->|Raw Data| D
D -->|Standardized Schema| E@{ shape: cyl, label: "Data Store"}
F@{ shape: process, label: "Backend"} -->|Query| E
G@{ shape: process, label: "Dashboard"} --> F
Features
Currently, we have the following features:
Cloud Provider Agnostic
We have built our own ETL pipeline to transform the data into a common format. This way, we can support multiple cloud providers and are also able to support multiple accounts. Currently, apart from GCP and AWS, we also support Alicloud and Tencent as well.
Cost Explorer
This feature allows users to drill down into the cost data and get a better understanding of the cost breakdown. The dimensions are supported in a hierarchical manner based on our organization structure, and users can select multiple dimensions to drill down into. This has become a staple for day-to-day cost analysis and visibility for many users.
Cost Forecast
This feature is actually integrated into the Cost Explorer. We use a simple machine learning model, EWMA (Exponential Weighted Moving Average), to predict the cost based on historical data.
Cost Alerts
Another feature we built to help engineers be more aware of their costs. This feature supports alerts based on multiple factors such as budget, trends, or even custom queries. The alert can be configured by the user, and the notification can be sent via Email or Slack.
Cost Report
This feature is primarily dedicated to higher-ups to present a helicopter view of cost visibility, along with some insights we can gather. There are weekly and monthly reports. The report supports multiple cardinalities per business unit.
Custom Dashboard
This feature is actually a continuation of our cost report feature. Some personas and people wanted to have their own personal report structure, but we couldn’t support that at scale. So instead, we delegated this to the user and let them create their own dashboard based on their needs. The custom dashboard is very much inspired by how Grafana handles its dashboards. Users can create “Widgets” and arrange them in a dashboard. The widget supports multiple visualization types and can be configured based on the user’s needs.
Cost Allocation
Our cloud cost dashboard is able to give some estimates on shared costs, such as Kubernetes and Kafka. These estimates are calculated based on each service’s usage of the shared service. The calculation has some slight rounding errors, but it is good enough for our use case.
Cost Optimization and Recommendation
This feature allows us to take actionable items to reduce costs, reduce waste, or improve utilization. Currently, we support this feature for VMs as well as for Kubernetes workloads. We use a simple capacity planning algorithm used in other paid SaaS tools (with some tweaks). The core formula is:
$$ R_{new} = \frac{U_{current}}{T_{util}} $$Where:
- $R_{new}$ is the New Recommendation (e.g. CPU/Memory request)
- $U_{current}$ is the Current Usage (e.g. p99 usage over a period)
- $T_{util}$ is the Target Utilization Percentage (e.g. 80% or 0.8)
For example, if we have a resource allocated 100 CPU, but the current usage is only 20 CPU, and we target 80% utilization:
$$ R_{new} = \frac{20}{0.8} = 25 \text{ CPU} $$This allows us to still have enough headroom for when the usage spikes.
Planned Features
There are also some features on the roadmap that we wanted to implement later, such as
- Anomaly Detection
- Unit Economics
- Virtual Tagging
- MCP Support
- and more…