In recent years, there has been a significant paradigm shift in the platform landscape towards containerization, and its orchestration using Kubernetes (k8s). The need for observability in this new distributed architecture has become non-negotiable. But due to the complexity and size of these architectures vast amounts of observability data is being generated which then needs to be analyzed when issues occur. This means your teams will be spending more time sifting through this data, increasing mean times to resolution, extending down time and ultimately impacting the business.
K8s was Initially open-sourced by Google in 2014 and then the first SaaS offering introduced on GCP in 2015. Today, every major cloud vendor offers its own variation of k8s as a service, making cluster creation almost as simple as deploying a virtual machine. However, in my experience, many companies have entered this domain without prior experience in setting up and managing these clusters. Even large enterprise companies often rely on bringing in experts to handle cluster setup and configuration, leaving the rest of the organization struggling to acquire the necessary skills to support them.
Observability has become even more important now with most applications being deployed to k8s and the need to be able to see what’s going on at every layer in these large distributed architectures is essential. To standardize the management of these distributed architectures and reduce the burden on developers (ultimately improving time to market), API Gateways and service meshes, such as Kong Gateway and Kong Mesh respectively, are being introduced. These additional layers though mean the volume of collected data is increasing rapidly, as each layer has its own set of metrics to be tracked and analyzed. This surge in data presents a new challenge – dealing with an overwhelming amount of information and deciphering it effectively when encountering issues. This can lead to longer mean times to resolution which impacts the business as outages are extended.
Robusta, a new startup out of Israel, is helping solve these issues with an outstanding tool they have built. It adds a layer on top of your observability metrics helping with the vast amounts of observability data by giving your teams a single pane of glass view of all of your metrics and combining them together in well thought out multi-tenancy dashboards. You can also configure Robusta to enrich your alerts with additional data to help aid your issue investigations. Robusta is not trying to replace your observability stack but instead enhance it.
They are then taking things one step further by integrating an AI assistant into their UI (still in beta but I have early access to the feature). The integration with these Large Language models (LLMs), i.e. Azure OpenAI (to ensure your companies data is kept private), allows you to take all the context of say a crashing pod, pod logs, metrics, events, deployment YAML, dependency systems for example, and pass it to the LLM as context to help debug the issue. This leads to an improved self-service experience for developers, eliminating the need to rely heavily on the platform/operations team. As a result, it reduces meantime to resolution and enhances resilience capabilities.
They have also started producing ChatGPT plugins that enable the LLM to scan your clusters and identify issues, whether that be misconfiguration or performance. There are currently tools that do this, but these tools produce static outputs that still need to be analyzed and then actioned. Robusta’s ChatGPT plugins instead scan/analyze your cluster and offer steps on what needs to change and how to action them. I am pretty sure they are not far off being able to instruct the LLM to offer a solution in the form of a pull request too. Incredible!
AI, particularly LLM models, will not necessarily replace engineers’ jobs in their current state. Instead, they will serve as a valuable complementary tool alongside your current observability stack. This combination aids engineers in analyzing the vast amount of observability data effectively reducing the mean time to resolution. Tools like Robusta are here to help organizations new to the containerization landscape or skilled platform teams becoming bottlenecks trying to support large development teams. Empowering engineers to investigate and address issues independently, fosters resilience in their applications while simultaneously gaining proficiency in the underlying infrastructure. This results in organizations being able to recover from outages faster, limiting the business impact. It’s a win-win situation.