I built this because as AI workloads move into production, GPU spend is becoming the largest line item on the cloud bill. Standard K8s cost tools often treat a node as a "black box," but when an A100 sits idle because of a misconfigured training job or a stuck inference server, you’re burning hundreds of dollars a day.
The Live Demo: I know how annoying it is to sign up just to see a dashboard. I’ve set up a demo cluster so you can see the ML-specific cost analysis and recommendations immediately:
User: hackernews@podcost.io
Pass: hackernews@podcost.io
What’s inside:
ML Workload Analysis: It tracks costs per training job and inference request.
GPU Idle Detection: Automatically finds GPUs that are allocated but have low utilization.
Actionable Recommendations: It suggests specific rightsizing for pods and nodes based on actual historical usage.
Quick Setup: If you want to test it on your own cluster, it’s a single Helm command.
I’m particularly looking for feedback on our GPU recommendation engine. Is this a problem that you might pay for? also are those metrics shown in the demo cluster good enough? I am not building another observability tool. I am building AI cost saving tool that focuses on AI and GPU waste. your feedback will be really important for me.
I’ll be here to answer any technical questions!