Write rieview✍️ Rezension schreiben✍️ Get Badge!🏷️ Abzeichen holen!🏷️ Edit entry⚙️ Eintrag bearbeiten⚙️ News📰 Neuigkeiten📰
Tags: efficient observability
60313 Frankfurt am Main, DE Germany, EU Europe, latitude: 50.1169, longitude: 8.6837
✨ TL;DR Bedrock Data secures petabytes of enterprise data across AWS, Azure, and GCP. Their observability was fragmented: a major log analytics vendor, CloudWatch, and custom scripts, each used by different people for different things.
Challenges:
– Three disconnected tools, no common workflow for troubleshooting
– No correlation between Lambda logs and metrics
– Query-heavy interfaces discouraged broad adoption
After moving to Oodle:
– One platform for logs, metrics, and alerts
– Automatic log-to-metric correlation for Lambda functions
– 50% reduction in incident resolution time
– End to End onboarding in under five hours
– Strongest tool adoption to date across all users
"Oodle is a fully baked, from-the-ground-up observability platform. An observability platform needs to look and feel as if it's all one integrated process. And that's what Oodle does."
– Olaf Stein, Head of Customer Success, Bedrock Data
Half of your most critical telemetry comes from infrastructure you don't own. That's the reality when your product deploys thousands of Lambda workers directly into customer AWS, Azure, and GCP accounts, and you need to know what they're doing at 11 PM on a Tuesday.
Bedrock Data builds data security posture management (DSPM): discovering, classifying, and protecting sensitive data across every major cloud and SaaS platform at a scale of many petabytes. Their deployment model makes observability unusually hard: "outposts" (Lambda-based workers) run inside customer environments, scaling to thousands per customer. On Bedrock's own side, another fleet of Lambda functions handles backend processing. Both sides ship logs to the same S3 bucket.
Like most fast-growing teams, Bedrock's observability stack had evolved alongside the product. Different tools came in at different stages, and depending on who you asked, you'd get a different answer about which one the team relied on.
Their log analytics vendor worked well enough for ingesting unstructured logs from S3, but the team was outgrowing what it offered them.
Some engineers preferred CloudWatch for its proximity to Lambda metrics. Others had built custom scripts to pull logs from S3 for ad hoc analysis. The result was three different tools, three different workflows, and no shared language for debugging.
When something went wrong, the troubleshooting played out the same way: one person would dig into their log tool, another would check CloudWatch, a third would run a custom script against S3. Then they'd jump on Zoom and try to piece findings together across different tools, different query languages, different mental models.
"If I got a report from a customer, I'd go in and look at logs in our log tool. Engineers were doing other things. If we're all using different tools and different approaches, it's really hard to collaborate asynchronously."
– Olaf Stein
The cost wasn't just time. It was money: multiple tool subscriptions, engineers pulled from feature work, and the frustration of debugging that should have been simpler.
"If you have three engineers and me on a Slack channel or a Zoom call, any hour it takes longer costs actual money. It costs money, it costs productivity, and it causes frustration."
– Olaf Stein
The bigger challenge was correlation. As Bedrock matured, they needed Lambda metrics and application logs in the same workflow: monitoring to know something is wrong, logs to figure out what. Logs were well covered, but the team needed more than logs.
The fragmentation was unsustainable. The Bedrock team knew what the next tool needed to look like, and had strong opinions about what would actually work.
Their requirements boiled down to four things:
The team evaluated a mix of established vendors and smaller players.
"I know enough about these tools and our unique requirements that in most cases, just reading the documentation tells me if something is viable or painful. Oodle checked all the boxes."
– Olaf Stein
One detail clinched it early: Oodle's S3 integration pulls logs directly from the customer's bucket, no Lambda functions deployed in Bedrock's account to do the fetching. "That's one of the main reasons I wanted to do this with you guys," Olaf said. "It's a direct pull integration. Many other vendors make you deploy Lambda functions. I would prefer not to have to do that."
During the first PoC call, the team connected log ingestion, set up enrichment rules to extract structured fields from unstructured log data, and pulled in logs, Lambda metrics, and Auth0 logs.
"It took an hour to onboard the data. Another couple of hours for log transformations in the ingest pipeline. Getting us up and running probably took four or five hours end-to-end."
– Olaf Stein
One detail that mattered was custom lookup tables that map internal customer UUIDs to human-readable names, so engineers could filter by customer without memorizing GUIDs. With their previous vendor, maintaining these mappings meant manual query upkeep. In Oodle, it's built in.
Olaf's definition of a good onboarding? "I don't have to ask a single engineer for anything." He didn't have to.
The change was immediate. Within the first week, Bedrock was resolving real incidents, not planning a months-long rollout.
Everyone in the same tool. Post-sales, sales engineers, and the engineering team now share a single platform. No more "which tool did you use to find that?" conversations.
Auto-correlation between logs and metrics. For Bedrock's Lambda-heavy backend, Oodle automatically stitches together function logs from S3 with CloudWatch metrics for the same function. Click a log line, see the corresponding invocation duration, error rate, and throttles. No manual joins, no switching tabs.
"The auto-correlation between our backend metrics and our backend logs is a great improvement to what we had before. That we can automatically stitch together different types of telemetry from the same service, that's something that in our previous world was much more cumbersome."
– Olaf Stein
No query language required. Oodle's log transforms extract the fields Bedrock's team cares about (customer ID, function name, log level) as clickable filters. Engineers search by clicking, not by writing queries. That drives faster adoption across the team. Engineers can focus on their actual work instead of learning a query language.
Faster collaborative debugging. The difference was clear during a late-night incident when a customer reported unusual outpost behavior. Three engineers jumped into a Slack channel. All three were using Oodle. They exchanged links to Oodle searches in Slack, each approaching the problem from a different angle.
Within 20 minutes they confirmed the issue. Within 35 minutes they'd identified the root cause, mitigated the problem, and notified the customer.
"We were all in the same tool. Before we did a Zoom call, we were on a Slack channel exchanging links to Oodle searches to show each other what we were each looking at. It was one of our faster conclusions."
– Olaf Stein
Performance that earns trust. Olaf runs longer time-range queries than most, looking at trends over weeks, not just the last hour. Search performance held up, even at those ranges, and at a price point that didn't force the team to choose between query speed and budget.
AI-native debugging. For an AI-native company like Bedrock, Oodle's AI capabilities are a natural fit, not a novelty. AI-driven root-cause analysis surfaces likely causes faster during incidents, and dashboards can be built in plain English. The barrier to investigation drops: anyone on the team can dig in, not just the engineers who know the query syntax.
What set Oodle apart wasn't any single feature. It was the integration.
Olaf had evaluated Grafana, and used it extensively in previous roles. His take was blunt: "Everything is very disjointed. If I had to pick between Datadog and Grafana, I would pick Datadog all day long. But of course nobody picks Datadog because it costs too much money."
Oodle hit a different spot, offering the integrated experience of Datadog at a price point a Series A startup could sustain.
"I would say Oodle is a fully baked, from-the-ground-up observability platform. I would describe it as: looks and feels a lot like Datadog, but a lot cheaper."
– Olaf Stein
The team at Oodle made the difference too. Features Olaf asked for showed up fast. Questions got answers quickly. For a growing company evaluating a newer vendor, that responsiveness was the trust signal.
Oodle's AWS Lambda monitoring
Bedrock's observability story is just getting started. The immediate priority is standardizing on OpenTelemetry by standing up an OTel collector to control what telemetry goes where. From there, the roadmap includes adding custom metrics from application code and eventually producing traces.
"OpenTelemetry is native and front and center to what you guys do. It's not bolted on, like with many of the older vendors who existed before OpenTelemetry was a thing. I want a data pipeline that is completely vendor-agnostic, and I believe with Oodle we can get there."
– Olaf Stein
There's also interest in pushing Oodle's AI capabilities deeper, not just for runtime troubleshooting but into the development cycle itself. For an AI-native company, observability that shifts left is a natural next step.
For now, though, the foundation is set. One tool. One workflow. Everyone on the same page.
"People can do the same things they were doing before, plus things we couldn't do before. The adoption curve is pointing much more upwards than with any other tool."
– Olaf Stein
Three clouds. A hundred deployments. And every time something broke, someone opened four browser tabs and hoped for the best.
Wisdom AI builds agentic analytics — natural-language insights across structured, unstructured, and MCP-connected data sources. Co-founded by Sharva (formerly founding engineer at Glean and Rubrik), the company runs a hybrid model: most customers use Wisdom as SaaS, while others deploy the full stack in their own cloud for data sovereignty. That model is great for customers. It's brutal for observability.
When Wisdom was AWS-only, CloudWatch was the default. It worked, barely. Filtering on specific labels or customer IDs was possible in theory but painful in practice. Engineers didn't want to log into CloudWatch to debug — they resorted to writing CLI scripts to pull logs and grep through them locally. As Wisdom expanded to GCP and Azure, things got worse. Logs were now in three places: CloudWatch, Cloud Logging, and Azure Monitor. Each had its own query syntax, its own filter semantics, its own alerting model. Debugging a single customer issue meant opening several cloud consoles, mentally correlating timestamps, and hoping you didn't miss the signal buried in a different account.
"Pretty much everyone I talked to was like, we need to buy a product. This doesn't look like a product."
— Sharva, CTO at Wisdom AI
The team tried self-hosted Grafana for metrics and alerting. It didn't hold up. Grafana didn't offer a unified logging story — logs and alerting lived in separate systems, so debugging still meant jumping between tools. Operationally, the self-hosted deployment was brittle: OOMs and reboots were common, the UI was sluggish under load, and the team didn't have time to tune it properly. Alert deduplication and grouping — e.g., one alert per customer instead of a flood of near-duplicates — were hard to configure. And metrics meant PromQL, which most engineers didn't want to learn just to set up a dashboard.
"learning and using PromQL is a pain, but Oodle's assistant acts as a copilot on the page that we are on and can take actions on our behalf. I just prompt it to create a dashboard containing these panels and it just works. It is magical."
— Tanish, Founding Engineer at Wisdom AI
Wisdom needed centralized observability that was cost-effective, reliable, and modern — and that could scale with 100+ deployments without requiring a dedicated platform team to babysit it.
Wisdom evaluated the usual options. Datadog was the first call. The team's rough estimate made it non-viable to start with. With 250+ nodes and 5000+ workloads running across AWS, GCP and Azure, Datadog would have been 5x costlier than Oodle. GCP's Stackdriver (now Cloud Operations) was looked at for consolidated logging, but the UX didn't meet the bar. Self-hosted Grafana had already shown its limits.
The decision criteria were clear:
Oodle fit those requirements at a fraction of the cost of Datadog.
"It's the best way to do observability. I'm surprised you get such good observability at a cheaper cost than native AWS and GCP capabilities. Usually there's a trade-off between cost and features. Here you're getting both."
— Anuj, Founding Engineer at Wisdom AI
Anuj Mittal, founding engineer at Wisdom, owns platform and infrastructure. He ran a focused proof-of-concept: one dev deployment, then scale-out. Setup was straightforward involving installing a helm chart and all telemetry signals were flowing to Oodle, and the Oodle team was responsive — issues got addressed in hours, not weeks. There was no long bake-off. Oodle addressed centralized observability and cost in one move.
Migration proceeded in phases. Logs and traces moved to Oodle first; metrics and dashboards followed. All existing Grafana dashboards and alerts were automatically migrated using the Oodle's Grafana migration tool. In-product links within Wisdom product that previously pointed to CloudWatch or GCP Logs Explorer now point to Oodle, so engineers start and finish debugging in one place.
"it was very easy to just set it up. We started with one dev deployment, set it up there, saw that it was helping"
— Anuj, Founding Engineer at Wisdom AI
Today, almost every engineering team at Wisdom uses Oodle for logs and traces. Adoption extends beyond engineering — support and other roles use Oodle when they need to debug customer issues.
PromQL was the biggest friction point on the metrics side. Most engineers at Wisdom are application developers — they don't want to learn a query language just to build a dashboard or set up an alert. The Oodle AI assistant changed that. Engineers describe what they want in plain English or paste metric definitions from code, and the assistant suggests PromQL queries and explains the nuances — delta vs. increase, rate windows, label selectors.
Anuj described a moment during the transition: he needed to recreate a Grafana dashboard in Oodle. Instead of looking up PromQL syntax, he used the AI assistant and had it done in minutes. No docs. No Stack Overflow.
Wisdom also uses Oodle's cost and usage attribution to see which metrics drive cardinality and cost. That visibility let them trim low-value sources — broad Kubernetes and Envoy metrics they didn't actually need — reducing cost without losing signal.
The first thing engineers noticed after moving to Oodle was how little friction there was in the logs experience. Filters, labels, and the fields you actually care about surface in a few clicks — no hunting through menus or writing query syntax to slice by customer, cloud, or service.
The command palette (Cmd+K) takes it further. Need to show or hide a field across every log line, or add a filter mid-investigation? Instead of clicking through menus, hit Cmd+K, type what you want, and it just works.
"If I'm looking at logs and I need to show or hide a field, or add a filter — I don't need to do that manually. I just press Cmd+K and type it there and everything just works. It's so seamless."
— Tanish, Founding Engineer at Wisdom AI
Oodle's Logs UI provided much simpler ways to find relevant logs, work with structured JSON logs, filter for relevant customers etc. It surfaces error patterns and insights from logs automatically making it easier to notice any new errors after releases quickly.
"My experience with Oodle has been pretty good so far. I didn't think changing the logging software would affect my life so much!"
— Varun, Founding AI Engineer at Wisdom AI
The result: developer happiness went up measurably. Engineers stopped writing custom scraping scripts. They stopped context-switching between cloud consoles. They started and finished debugging in one place.
Integrations tightened the feedback loop in ways the team didn't expect.
"That MCP server that we have with Oodle — it's a game changer for us."
— Anuj, Founding Engineer at Wisdom AI
Alerts fire into Slack. An integration triggers Oodle Slack Bot to fetch the logs relevant to that specific alert and post them into the same Slack thread. Previously, on-call engineers had to infer which job or service failed, then hunt for logs across one or more cloud consoles — often 10–15 minutes of toil before they even understood the problem. Now the logs appear in Slack before they open the alert.
Engineers also use the Oodle MCP server in Cursor: when debugging a failing job or unexpected behavior, they ask the AI in Cursor to pull logs or telemetry from Oodle directly in their IDE.
"I personally use Oodle's MCP server a lot. It is a game changer. I don't need to manually add filters and look at logs, understand them, and then root cause the issue - I let Cursor do it for me. The tools in the MCP server are very intuitive as well, so any agent can easily use it without needing to add much context."
— Tanish, Founding Engineer at Wisdom AI
Before Oodle: Services across AWS, GCP, and Azure each sent logs and metrics to their own native tool. A self-hosted Grafana instance tried to aggregate metrics but had reliability issues. Engineers switched between three or four UIs and wrote ad-hoc scripts for cross-customer debugging.
After Oodle: All three clouds send logs, metrics, and traces into Oodle via vmagent and OTLP. A single workspace provides dashboards, alerts, log search, and DB monitoring. The AI assistant, MCP integration (Cursor), and Slack alerting sit on top of one unified data plane.
Cost. The largest savings were on traces. CloudWatch charges per trace; Oodle's volume-based model gave Wisdom roughly 4–5x lower trace costs. Logs were also significantly cheaper. Without Oodle, Wisdom would have been forced to cut trace volume or reduce sampling to control spend — at a stage where granular tracing matters most. Instead, they kept full fidelity and stopped worrying about cost.
"if we had not switched to Oodle at this point of time, we would have been thinking about cutting down on those traces, reducing and figuring out like, is it really needed and all of that. And given that we are at the growth stage, we don't really want to focus on those things at this point of time. We want to just focus on building new things rather than figuring out - can I drop this to reduce costs?"
— Anuj, Founding Engineer at Wisdom AI
Developer experience. The UI stays responsive even under load, unlike the old Grafana setup. The Kubernetes view — clusters, pods, health across 20+ clusters — gives a single pane for infrastructure visibility.
"Going from the old setup to Oodle felt like going from a rudimentary product to a proper product."
— Sharva, CTO at Wisdom AI
AI as a force multiplier. The AI assistant and MCP server removed the expertise barrier from observability. Engineers who never learned PromQL now build dashboards and alerts in minutes. On-call engineers who used to spend 10–15 minutes hunting for logs now get them pushed to Slack automatically — or pull them into Cursor mid-debug without leaving their editor. The net effect: observability went from a platform team's burden to something every engineer uses daily, without training.
"AI analysis on alerts is powerful for identifying problematic nodes and patterns across service logs"
— Varun, Founding AI Engineer at Wisdom AI
Cross-customer and cross-cloud debugging. Queries that used to require custom scripts or were simply impossible — "which customers are seeing delayed responses?" or "is this tied to a specific LLM provider?" — are now standard. Run a query in Oodle with the right filters and dimensions. No more ad-hoc scripts. No more cross-account log scraping.
"We've run customer instances on AWS, Azure, and GCP, and it is painful to go to individual cloud consoles. Oodle helps with the multi cloud use case by offering a single pane of glass"
— Guilherme, Co-Founder at Wisdom AI
Wisdom plans to deepen their use of DB monitoring and rely more on Oodle's AI features. They're also exploring LLM observability as their agentic product evolves — and we're building alongside them.
Today, we’re excited to announce a new integration between Bindplane and Oodle.ai — combining an AI-driven, OpenTelemetry-native telemetry pipeline with an AI-native observability platform built for extreme scale.
With Bindplane acting as the control plane for telemetry and Oodle.ai providing AI-powered analysis across logs, metrics, and traces, you get a single, intelligent, vendor-neutral pipeline from raw telemetry to actionable insight.
The biggest issues customers face are not exporting and storing telemetry data. It’s the complexity of managing pipelines, keeping costs and data volume in check, and ultimately making sense of the data once it reaches a destination.
Together we solve different sides of the same problem.
Bindplane’s Pipeline Intelligence learns normal data flow patterns, detects anomalies like drops or spikes, and recommends or automates routing and processing changes, reducing cost and latency without sacrificing control.
Oodle.ai’s AI Assistant then navigates across logs, metrics, traces, and events to spotlight root causes, generate dashboards, and accelerate investigations using plain English. It’s AI embedded into the telemetry lifecycle, truly end to end.
Oodle.ai is an enterprise-grade, AI-native observability platform delivering fantastic developer experience at open-source economics.
Unlike legacy tools or open-source wrappers, Oodle runs on a custom-built serverless architecture powered by object storage, built to handle individual loads of 20+ TB of telemetry daily.
Why Oodle.ai stands out:
Built by the team behind Rubrik, Amazon S3, DynamoDB, and Snowflake, Oodle.ai rethinks observability for the AI era.
Bindplane is the AI-driven telemetry pipeline for modern security, observability, DevOps, and SRE teams.
Built entirely on OpenTelemetry, Bindplane gives you full ownership and control over how telemetry is collected, processed, secured, and routed, across any environment, source, or destination.
With Bindplane, you get:
You own the pipeline. Always.
Bindplane gives you control. It is the control plane for telemetry flow across your environment. Bindplane ensures data is collected, processed, and routed consistently and at scale, reducing noise and lowering cost before data ever leaves your environment.
With Bindplane, you can optimize telemetry costs in a few ways:
Oodle.ai gives you clarity. It applies AI-native debugging across logs, metrics, and traces, at a fraction of the cost. It’s cheap enough to ingest telemetry you’d usually send to cold storage, but still lets you query, and debug in real time.
With Oodle.ai, you get even more cost saving benefits:
Together, Bindplane and Oodle.ai deliver lower cost, faster debugging, and long-term telemetry that remains accessible.
With the new Oodle.ai destination, Bindplane can collect, transform, and route OpenTelemetry telemetry directly into Oodle. You still keep the flexibility to route the same telemetry to multiple observability or SIEM platforms in parallel.
This integration enables:
Bindplane becomes the single control plane for telemetry flowing into Oodle.ai, and anywhere else your data needs to go.
Follow these steps to get started:
X-OODLE-INSTANCE and X-API-KEY.Once connected, Bindplane lets you visually design and control your telemetry pipeline, from any source to Oodle.ai, with full governance and safety built in.
1. Go to Configurations → Create Configuration
2. Give it a name, select the Agent Type, and Platform
3. Add a telemetry generator source to simulate traffic
4. Add the Oodle Destination
5. Add an Agent to the configuration
6. Start a Rollout to validate the configuration
7. Add processors for filtering, sampling, masking, enrichment, batching, etc.
As soon as your configuration is rolled out, telemetry begins flowing into Oodle.ai, already filtered, enriched, and managed by Bindplane.
From there, you can:
The result is faster investigations, clearer signals, and less time spent chasing noise.
Bindplane and Oodle.ai are both designed for enterprise environments where scale, governance, and compliance are non-negotiable.
Bindplane supports:
Together, Bindplane and Oodle.ai deliver a single, intelligent telemetry pipeline for observability, security, and AI-driven operations.
We’re continuing to expand the Bindplane integration ecosystem to help teams build scalable, vendor-neutral telemetry pipelines. Want to see a specific integration added? Let us know in the Bindplane Slack Community!
👉 Try the Bindplane + Oodle.ai integration today.
👉 For more guidance on configuring the Oodle destination in Bindplane, you can read the documentation, here.
👉 To learn how to configure Bindplane for Oodle, read more in the Oodle documentation.
You can watch Oodle's Unified AI-assisted debugging experience on youtube.
Production is down! You open your laptop and start the familiar dance:
An API timeout?
Check metrics for the spike, switch to logs for the error, open traces for the slow dependency
— three tools, three contexts, one incident.
and after all this, pray 🙏
This isn't a workflow. It's whack-a-mole with browser tabs.
The truth is, alert debugging today is broken. It's fragmented across tools, dashboards, and tribal knowledge. Even when you can afford premium observability tools, debugging an incident means:
One incident. Three tools. Three contexts. One exhausted engineer burning through error budgets while fighting the tooling instead of the problem.
Here's the irony: We've accepted that AI can build entire features or application. Yet when production breaks, we're still manually correlating metrics, grepping through logs, and squinting at trace waterfalls like it's 2015.
You can't ask your observability system a simple question - "Why is checkout latency spiking?" - and have it reason through the data for you.
Instead, only the most experienced engineers can connect the dots. Junior engineers are left staring at charts, guessing, losing hours trying to find the needle in the haystack. Why can't AI debug? Or at least suggest the possible root cause?
The problem isn't the data - you're collecting more telemetry than ever. The problem is that your observability tools weren't built for the AI era.
Oodle reimagines observability from first principles - both in architecture and experience.
At its foundation is a serverless architecture that separates storage and compute, giving you enterprise-grade observability at the cost of open-source solutions. On top of this foundation sits a unified, AI-native interface that brings together metrics, logs, and traces in one place.
You can ask Oodle questions in plain English - "Which service caused this alert?" Oodle's AI assistant analyzes metrics, logs, and traces from relevant services, stitches together all signals, and filters them down to the relevant context for your alert. It combines the best of Grafana and OpenSearch across all signals, made possible by our unique architecture.
In this blog post, we'll use the OpenTelemetry Demo application to show you how easy it is to debug incidents with Oodle across a distributed system - and how the AI does the work just like an experienced SRE would.
The OpenTelemetry (OTel) demo consists of several interconnected microservices written in multiple languages that simulate an online shopping experience. It can be run in your Kubernetes cluster (or in minikube) or as a Docker Compose application.
Note:
We have installed it in our playground cluster if you want to explore it without installing it yourself. If so, You can skip directly to the Your AI Alert Command Center section.
You can install it in your environment by following the steps below:
Once data is being ingested, you can explore it through:
As part of the above setup, we also provision a few alerts to help you explore the unified debugging experience. Let's look at that next.
Let's see Oodle in action with a real alert from the OTel demo running on our playground. We will debug OTel Demo - API Errors in our playground
Instead of the usual tab-hopping marathon, your alert debugging journey starts with a single click on the Insights button. The Insights view launches an agentic AI assistant that automatically troubleshoots the alert — we'll see how that works later in this blog. First, let's explore how Oodle's unified interface makes it easy to manually explore metrics, logs, and traces for relevant services.
The OTel demo has 21 services (your real production environment would likely have many more), the first challenge is figuring out which services are even involved in the incident. Oodle's Service Graph makes this instant - you can see the involved services, how they interact, and their golden signals at a glance.
✨ Here, Oodle has identified
frontend,recommendation,checkoutandpaymentservices to be relevant to the alert. You can also see high error rate infrontendto be correlated with high error rate incheckoutservice.
But Oodle goes further. It performs automatic correlation analysis across the dimensions (labels) in your alert metric. This means you can quickly identify whether the issue is isolated to specific regions, customers, pods, or any other dimension — without manually filtering and comparing charts across multiple tools.
✨ The specific pod which is facing the errors is
frontend-66cbd69c8c-kczp5
Infrastructure health is often the first data point to examine when debugging an alert. Are services restarting? CPU-starved? Slowly leaking memory?
Instead of opening Grafana, finding the right dashboard, adding service filters, and adjusting time ranges - it's all already filtered for you. One click shows you infrastructure health for the relevant services in the context of your alert.
✨ For this alert, CPU and memory seem to be perfectly fine, thus, we can rule out infrastructure issue.
If it's not an infrastructure issue, application logs usually hold the clues. But searching through logs across multiple services and timestamps is tedious.
Oodle automatically filters logs for the relevant services and time range of your alert. You can spot error patterns and anomalies instantly, without manually crafting complex queries or copy-pasting timestamps.
✨ We can see there is an error
Error: 13 INTERNAL: Error: Product Catalog Fail Feature Flag Enabledin thefrontendcontainer logs around the time of the alert.
Is this a one-off alert, or is there a more broader impact due to increased errors or latencies?
Traces filtered to your alert context make this trivial to answer. You can quickly see load, error, latency trends for the relevant services and time range of the alert.
✨ Here, Error and Latency have spiked around the alert time. We can also see a bunch of red spans in the span distribution.
We can drilldown further by selecting a rectangle around the impacted time range of the alert in Span Distribution chart. Oodle performs a correlation analysis across the span attributes in the traces to find what's different in the selected region versus the rest of the traces.
✨ You can quickly see an internal error bubbled up in the insights -
Product Catalog Fail Feature Flag Enabled.
Now, let's see the magic we mentioned earlier. Oodle's agentic AI assistant automatically troubleshoots the alert by analyzing metrics, logs, and traces from relevant services in the context of the alert.
Here's where it gets interesting: The AI reasons through the problem just like an experienced engineer would. It creates a debugging plan, breaks it down into specific tasks (check error rates, analyze latency patterns, correlate with deployments), and executes them all in parallel. You can click on each task to see its thought process and execution results.
The AI queries your telemetry data intelligently — fetching data for specific services, filtering by errors and latency based on the alert definition, and correlating findings across all signals. At the end, it scores the findings from all tasks and surfaces the most relevant insights.
The result? What would take an engineer 30 minutes of tab-hopping and manual correlation happens in seconds. And you can ask follow-up questions in plain English: "Which deployment caused this?" or "Are other regions affected?"
✨ AI assistant identified that
Product Catalog Fail Feature Flag Enabledis leading to errors incheckoutAPI with specific product being impactedfailed to get product #"OLJCESPC7Z". This ties well with our manual investigation through the Unified experience.
Note:
The AI assistant is also available as a sidebar on all UI pages in Oodle to assist you with common tasks, such as summarizing logs, creating alerts, and updating Grafana dashboards.
Debugging production incidents shouldn't feel like archaeological work across three different tools. With Oodle's unified, AI-native experience, you get:
Ready to experience the future of observability? Explore our full feature set at our playground, or sign up for free to try Oodle with your own data.
4.11.2025 15:43Meet Oodle: Unified and AI-Native ObservabilityThis post explores what effective CDN observability looks like. It explains how Varnish, the content-delivery software trusted by Tesla, Walgreens, Emirates, Sky, etc., exposes telemetry through OpenTelemetry and why it built its standard dashboards on Oodle AI (see Varnish’s integration post).
It highlights key metrics, shows how they appear in dashboards, and describes how this integration simplifies observability for distributed CDN and edge environments.
CDN observability isn’t just about tracking requests per second or cache hit ratios. At the edge, engineers deal with high concurrency, global latency variance, backend failures, and dynamic cache churn often all at once.
A small configuration change can shift terabytes of traffic or trigger cache invalidations across multiple points of presence. To manage that safely, you need visibility into every layer: client requests, cache behavior, backend fetches, queue lengths, and resource saturation.
Varnish’s architecture exposes detailed, structured data about its internal state. Its observability isn’t bolted on; it’s built in.
Here's why engineers value Varnish's Observability:
Sustained high-throughput performance: Handles hundreds of TBs of traffic daily with stable tail latency exposing queue depth, thread utilization, and cache efficiency so engineers can tune for consistent speed at scale.
Programmable edge control: Fully programmable caching and routing via VCL, with instant feedback through built-in metrics like backend latency, hit ratio, and invalidation frequency.
Bandwidth efficiency through visibility: Makes origin offload measurable. The detailed cache metrics let teams verify bandwidth savings and detect regressions as workloads change.
Hardware-level insight: Surfaces CPU and NIC counters that power 100 Gbps-class throughput, giving rare visibility into how hardware performance affects edge latency.
Hybrid and private-CDN readiness: Often used as an origin shield or private CDN tier; telemetry spans edge and origin for unified visibility across multi-CDN or hybrid setups.
Together, these characteristics make Varnish observability-ready without additional instrumentation.
Varnish’s built-in tools: varnishstat, varnishlog, and extensions like vmod_accounting provide deep insight into cache behavior. With the varnish-otel exporter, this data becomes portable and standardized. By emitting metrics, logs, and traces over OpenTelemetry, Varnish gives operators the freedom to choose their observability destination without lock-in or custom agents.
Oodle AI complements that perfectly by natively supporting OpenTelemetry ingestion. This makes observability easier to use, eliminates vendor lock-in, and lets engineers focus on improving their systems rather than managing tools.That flexibility is what makes the Oodle integration so straightforward: you can point varnish-otel at Oodle and start visualizing data immediately - as detailed in Brian Stewart’s blog post.
Oodle AI is an AI-native observability platform designed for modern, high-cardinality telemetry, the exact kind of data Varnish generates at scale.It provides enterprise-grade performance at open-source cost, unifying metrics, logs, traces, and anomaly detection in one place.
Here’s why it fits CDN workloads particularly well:
5× cost efficiency: Makes high-cardinality metrics and logs affordable and sustainable. (See pricing calculator)
The varnish team created a standard dashboard organized around the five pillars most teams rely on. These sections help SREs and platform engineers focus on the metrics that actually move the needle balancing visibility with clarity.
For more details, check out this blogpost by Varnish.
Dashboards show you what’s happening; alerts make sure you know when it matters. Using Oodle’s native PromQL-based alerting, teams can define precise thresholds or anomaly rules for backend latency, cache efficiency, and error rates scoped by domain, POP, or backend label.
Example: Trigger an alert when avg_over_time(varnish_backend_req_time_seconds[5m]) > 0.5
Because Oodle is AI-native, related anomalies are automatically surfaced so when a latency spike occurs, Oodle can highlight linked metrics such as rising queue length or degraded cache hit ratio.This transforms alerting from simple thresholding into actionable diagnosis.Pro-tip: You can configure alerts through Terraform.[docs]
Oodle is now one of the trusted destinations for Varnish telemetry alongside Datadog, Grafana, Splunk, and New Relic giving CDN and edge operators flexibility without complexity.Our experience with Varnish’s data model and high-cardinality metrics ensures smooth ingestion, clean dashboards, and meaningful alerting from day one.
And for Oodle users exploring caching or edge acceleration, Varnish itself is worth a closer look, it’s one of the rare systems engineered with observability as a core principle.
Varnish users can explore a preconfigured dashboard and example alerts on Oodle’s playground, or follow the integration guide in Oodle’s documentation to connect varnish-otel.
You can reach us at support@oodle.ai.
Varnish Software is a global leader in high-performance caching and content delivery. Trusted by leading streaming services and enterprises worldwide, Varnish optimizes the delivery of digital content, APIs, and applications. Its customizable platform empowers businesses to scale efficiently, reduce costs, and deliver outstanding user experiences across video, web, and cloud environments.3.11.2025 15:45CDN Observability with Open Telemetry - Varnish + Oodle AI
Fello AI has a bold mission: flip the traditional real estate model on its head. Instead of pushing agents to buy expensive leads, they help over 20,000+ real estate & mortgage professionals turn their own databases into 24/7 profit engines.
The numbers tell the story of explosive success - from startup to major player faster than most companies dream of.
But behind every scaling success story lies an engineering challenge.
The breaking point came as customer data volumes exploded. What worked at startup scale started failing as they scaled.
Observability had become the bottleneck preventing them from focusing on what mattered: building the future of real estate technology.
Something had to change.
Like most scaling startups, Fello started by evaluating the obvious choices. But each solution they tested revealed new problems—sometimes worse than the original.
Suman realized he wasn't just shopping for cheaper and comprehensive observability. He wanted an innovation partner who is constantly pushing the boundary and obsessed with the problem they solve.
When they discovered Oodle, multiple advantages became clear:
No maintenance overhead, no scaling headaches
50% cheaper, S3 backend architecture, affordable 90+ day retention
Easily find what you need to debug in one place
Anomaly detection that competitors couldn't match
Onboarding completed within hours
PagerDuty integration "pulled off within a couple of days"
The final validation came when Fello's engineers talked to Oodle's team. Engineer to engineer, same language, shared obsession with solving hard problems.
"Our engineers love to talk to your engineers" Suman explains.
Fello began by running Oodle alongside CloudWatch to validate the product.
"Within hours after deciding on the pilot, we went flying." says Suman.
Hours, not weeks. No enterprise migration complexity, no custom-tooling, no months of planning. Just simple integration via open source standards.
But would engineers actually adopt the new system? The team had muscle memory with CloudWatch. The principal engineers made a strategic decision: observe for validation, then execute a hard switchover.
The results exceeded expectations. Rather than friction and learning curves, engineers found Oodle intuitive. No training required—it was "plug and play." The UI made all the difference. Where CloudWatch demanded query language expertise, Oodle offered intuitive filters that any engineer could master immediately.
"Someone who doesn't have knowledge of query language can also query logs by just adding filters in the UI," explains Vinay. "They are able to figure out everything by themselves."
What happened next was a fundamental shift in how Fello's engineering team operates. Oodle made comprehensive observability natural - unified dashboards pulling from different sources, seamless alerting pipelines, and the flexibility to debug from anywhere.
"It has been very seamless, we don't need to worry about observability anymore. We focus primarily on our business now," says Ashish.
The transformation touched every aspect of their engineering operation:
Fello's engineering team didn't just switch observability tools - they empowered new and senior engineers, eliminated knowledge silos, and built the foundation for sustainable scaling.
"As we scale, the costs are in control. At the same time, we have a better product." - Suman
The real differentiator though was the relationship itself.
"The best thing about Oodle is the team. Anything I have issues with, they're always there to help me out" says Ashish.
This isn't typical vendor-speak. When Fello needed AWS Marketplace integration for easier procurement, Oodle's team built it within days. When feedback was provided on features, changes happened in real-time.
For a team achieving 185% Y-o-Y growth, this isn't just vendor selection—it's choosing a growth partner.
"We see Oodle as an extension to us right now. That's a great partnership to be in."
Ready for observability partnership that scales with your fast-growing business? Schedule a demo
10.9.2025 15:54How Fello achieved 100% observability adoption with unified UX at half the costSay you are a modern software engineer used to getting access to relevant information at your fingertips, you are grappling with questions around your Observability tools:
“Why am I still relying on static dashboards in an era where everything else is dynamic and easy to use?”
“Why is my work day just filled with creating and updating metric dashboards for triaging incidents?”
“Are pre-built dashboards just relics of the past, limiting real-time decision making?”
“Can Metric dashboards be ephemeral - created, used and discarded dynamically - just like logs and traces?”
You have started thinking about these as you have just finished your on-call rotation where the same story has unfolded for the umpteen times, which looks something like this:
The above itself looks quite daunting, however, in reality, there are few more hurdles which you may face depending on the state of Observability, Deployment systems, and organization size and practices at your company:
OnCall burnout is a real problem and it grows as the size of your organization grows. The Observability journey at any company starts modestly with when the scale is low and it is sufficient to monitor the entire system via a handful of static dashboards. Subsequently, you start creating relevant alerts to notify you when things go wrong. As the company grows, the number of deployments/services grows and your observability dashboards start to follow Conway’s Law - each team starts maintaining their own set of dashboards to monitor their frequently visited metrics. As the company further grows, a central Platform/SRE team gets established who starts to introduce a much-needed hygiene towards standardisation of common metrics, dashboards and alerts. Subsequently, a tight-rope balancing act starts in multiple dimensions - ease of debuggability v/s managing the costs, availability & reliability v/s alert noise, feature development v/s on-call bandwidth, and so on.
We ask a simple question - does every company need to go through a similar Observability journey? What happens when you don’t take a dashboard-first, alert-first approach and rather expect your Observability tools to not just provide you with building blocks which let you monitor your systems availability and reliability, but rather expect Observability tools to monitor your systems availability and reliability and help you troubleshoot when these get impacted. The ultimate value of Observability data is to know when your system's availability/reliability is impacted and being able to troubleshoot and mitigate it faster.
Let us try to answer these questions by listing the core components which will be needed to break free from the current way of using Observability tools.
“Wouldn’t it be great to have relevant telemetry data collected automatically and always?”
Collecting observability data from your systems should not be a day-1 problem, adding instrumentation to your service should not be a release-readiness task to be completed after the entire feature is built. OpenTelemetry’s auto-instrumentation uses eBPF to automatically collect Observability data from your applications. eBPF based auto-instrumentation removes the reliance on developers to instrument the applications which helps in standardisation of observability data as well as comprehensive collection of observability data. https://ebpf.io/applications/ provides a rich set of applications built on top of eBPF - some of them being related to Observability.
Dynamic instrumentation also allows to ensure that telemetry data is consistent across various signals - metrics/logs/traces etc. Removing reliance on developers to instrument the applications means that we can apply standardised semantic conventions at the source of the data generation.
“Why can’t my Observability tool tell me when my systems are unhealthy?”
In today’s Observability systems, alert definitions are static. As an engineer, you define what you want to get alerted on, what thresholds to set, which channels to be notified on? This worked well when you had monolithic applications with their associated Ops teams as it was all under the purview of the Ops team - they were both the creators as well as recipients of alerts. With the modern cloud-native microservice architectures, though static alert definitions don’t suffice. In the example alert investigation shared at the beginning of this post, wouldn’t it save so much time and effort if an alert is raised on marketing-service which introduced an excessive load on promotions-service. As the complexity of modern applications increases, you need a birds-eye view of what’s unhealthy at the current point of time. Netflix’s Telltale comes to my mind of how an intelligent alerts system could look like.
“Is the future of Observability dashboard-less? What does it look like?”
Modern applications have cloud-native architectures where the infrastructure itself is dynamic - your pods scale up/down based on incoming load, you can spin up your clusters in a different cloud/region quickly. In such a world, why are we still relying on static dashboards to be created and maintained? With the amount of Observability data increasing, static dashboards only show a tip of the iceberg, most of your observability data is not visible in these static dashboards and continue to remain hidden as engineers find it friction-full to create or update these static dashboards. Wouldn’t it be great to have access to all your observability data at your fingertips? The recent advancements in AI and LLMs show promise to change the way we consume Observability data - now you can just type in what you want to see and Observability tools should be able to understand your intent and show you relevant data within seconds.
“Can AI-driven, context-aware debugging approaches replace traditional means of debugging altogether?”
Our current way of debugging an alert is primarily dependent on manually looking at various dashboards, logs, traces and connecting the dots. The runbooks associated with alerts are also static - most of the time they could be outdated or don’t cover the exact scenario which is happening for the current alert incident.
A recent report from PagerDuty shows that a large part of incident response tasks are still manual and extremely time consuming.
With a rich set of semantically correlated Observability data being available, one can apply ML/AI models to connect the dots and generate dynamic dashboards specifically catered to the incident being investigated. You should be able to look at the inter-connected web of telemetry data (metrics, logs, traces, service dependencies, upgrade history) in the context of the alert being triaged.
At the end, I would like to leave you with a question:
“Is the future of Observability dashboard-less? Will it be replaced by a dynamic, context-aware debugging experience?"24.3.2025 06:38Dashboards are Dead!
Prometheus is an open-source monitoring solution that provides a streamlined way to store metrics data, query metrics using PromQL, and set up alerting. It has become the de-facto standard for monitoring infrastructure and applications.
prometheus binary and you're ready to ingest and query metrics. It uses a single-node setup rather than a clustered architecture, making it simple to deploy but introducing scalability limitations that we'll discuss later.scrape_interval.Prometheus is a great starting point if you are just getting started with Observability. However, the reasons which make it easy to set up and operate at low scale are also the reasons why it can become challenging to operate Prometheus at scale. To understand the limitations of Prometheus, we need to build a good mental model of how it stores data and what affects its scalability.
Prometheus fundamentally stores all data as time series: streams of timestamped values belonging to the same metric and the same set of labeled dimensions.
For example, a time series with the metric name api_http_requests_total and the labels method="POST" and handler="/messages" could be written like this:
api_http_requests_total{method="POST", handler="/messages"}
For each time series, Prometheus stores a (timestamp, value) for the associated metric and set of labels at each scrape_interval. By default, Prometheus stores all time series in memory for two hours and flushes them to local on-disk storage at the end of the two-hour-interval.
Understanding Prometheus's data model helps us see that its memory usage directly correlates with the number of active time series it scrapes. This leads us to the cardinality explosion problem. In the above example, if method can contain four possible values - GET, PUT, POST, and DELETE, and handler can contain ten possible values, the total cardinality of api_http_requests_total becomes 40 (4*10). If you then add a status label with two possible values (success or failure), the total cardinality increases to 80 (4*10*2). High cardinality metrics can easily occur when you have dynamic labels or when the cross-product of multiple label cardinalities becomes large. Since the number of time series is determined by unique label combinations across all metrics, label cardinality directly impacts the Prometheus server's memory usage.
This highlights a key limitation of Prometheus: The memory required by a Prometheus server directly correlates with both the number of time series it scrapes and their associated scrape intervals. As the number of time series grows, vertical scaling of the Prometheus server becomes necessary.
As mentioned earlier, Prometheus stores data older than 2 hours (by default) to local on-disk storage. Prometheus will further run compaction on the data sitting on disk to merge them into larger blocks. The reliance on local on-disk storage impacts reliability and durability of Prometheus setup as noted in Prometheus's Storage documentation:
Note that a limitation of local storage is that it is not clustered or replicated. Thus, it is not arbitrarily scalable or durable in the face of drive or node outages and should be managed like any other single node database.
In addition to scalability limitation of on-disk storage, it also adds additional operational overhead:
PromQL allows slicing and dicing of data across various time ranges and labels. It allows complex queries which may require scanning through a lot of data. Given the single-process architecture of Prometheus, heavy queries may require excessive CPU or memory, causing OOM on the prometheus server or making it slow for other queries.
Having examined the key limitations of single-node Prometheus setups, let's explore various solutions for scaling Prometheus.
To address Prometheus's limitations in handling large numbers of time series, one approach is to shard the time series data across multiple Prometheus servers. This can be achieved by dividing the data based on teams, clusters, service groups, or other logical boundaries. While this allows running multiple Prometheus servers with each handling its own subset of time series data, it introduces new challenges.
The main drawback is the loss of centralized query capability across all metrics. Users need to know which metrics are stored on which servers and query them accordingly. Additionally, this approach prevents joining data from different metrics at query time if they're stored on different servers.
Prometheus's Federation feature enables one Prometheus server to scrape selected time series from another. This partially addresses the central visibility challenge by allowing a central Prometheus server to pull aggregated metrics from functionally sharded servers, providing an aggregate global view.
For example, you might set up multiple per-datacenter/region Prometheus servers that collect detailed data (instance-level drill-down), alongside global Prometheus servers that collect and store only aggregated data (job-level drill-down) from those local servers. This architecture provides both aggregate global views and detailed local views.
However, this approach still doesn't enable querying across all raw data.
Since neither Functional Sharding nor Federation fully solves central visibility or addresses data durability concerns, let's explore solutions that leverage remote storage.
Thanos is an open-source project that extends Prometheus with long-term storage capabilities. Built on top of Prometheus, Thanos enables object storage as a long-term storage solution for Prometheus data. It offers three key benefits:
Thanos maintains the same data format on object storage as Prometheus uses for its on-disk storage. By enabling unlimited storage, Thanos allows for unlimited data retention. While it solves the central visibility problem by allowing unified querying across multiple Prometheus servers and object storage, Thanos still depends on individual Prometheus servers for data collection and recent data serving. This means you'll still need to handle capacity planning and functional sharding of Prometheus servers, along with their associated operational overhead.
Thanos includes a query frontend that can cache results and split long time range queries into smaller ones. While this improves query performance against object storage, heavy queries (especially those spanning long time ranges) may still experience slower performance.
Source: https://thanos.io/tip/thanos/quick-tutorial.md/
Cortex provides a horizontally scalable, multi-tenant, long-term storage solution for Prometheus. It accepts metrics data via the Prometheus remote-write protocol and stores them in object storage (similar to Thanos). Unlike Thanos, Cortex eliminates the need for Prometheus servers to serve recent data since all data is ingested directly into Cortex. It also adds multi-tenancy features and allows for configuring various quotas and limits per tenant. However, like Thanos, query performance for long-range queries may be slower due to the need to fetch data from object storage.
Source: https://cortexmetrics.io/docs/architecture/
Grafana Mimir has a similar architecture as that of Cortex and was started out as a fork of Cortex due to licensing issues. Grafana uses Mimir architecture in their Grafana Cloud offering.
Victoria Metrics offers a high-performance, open-source time series database and monitoring solution that can serve as Prometheus remote storage. It differs from Thanos and Cortex in two key ways:
Compared to Cortex, Victoria Metrics offers simpler setup and operation due to its streamlined architecture.
We've explored how Prometheus excels as an easy-to-deploy metrics monitoring system at small scale, while examining its limitations at larger scales. We've also reviewed several open-source alternatives that address these limitations. Here's a summary of the solutions discussed:
| Dimension | Prometheus Federation |
Thanos | Cortex / Mimir |
Victoria Metrics |
|---|---|---|---|---|
| Functional Sharding | ✅ | ✅ | Not Needed | Not Needed |
| Global Query View | 🟠 | ✅ | ✅ | ✅ |
| Data Durability | 🔴 | 🟠 | ✅ | 🟠 |
| Query performance | 🟠 | 🟠 | 🟠 | ✅ |
| Unlimited storage | 🔴 | ✅ | ✅ | 🔴 |
| Operational Overhead | 🟠 | 🟠 | 🟠 | 🟠 |
| High Cardinality | 🔴 | 🔴 | ✅ | ✅ |
In the next part of this blog series, we will look at how Oodle solves the scalability issues of Prometheus in detail. Stay tuned!
10.3.2025 04:12Scaling Prometheus: From Single Node to Enterprise-Grade ObservabilityThe beginning of our journey to build a team for our startup was a pivotal chapter filled with excitement and vision. Fueled by the passion to create something extraordinary, we set out to find like-minded individuals who share our ambition and values. Each new conversation and connection was a step closer to assembling a team that would not only bring diverse expertise but also a collective effort to navigate challenges and celebrate triumphs. Together, we became more than just a team – we became a crew of dreamers ready to take on any storm and toast to every victory, no matter how small.
In the first quarter (and in last quarter of 2023), we've built an awesome founding team with people who've built and launched world-class products like Rubrik, Amazon S3, DynamoDB, SignalFx, and Snowflake.
Oodle is located in San Jose, California- setting up a new office space can be challenging...
The early stages of product development involved a ton of experimentation, resulting in exciting and optimistic revelations!
After numerous successful tests, we had our architectural design for a custom datastore: Serverless + S3, truly cloud native.
By separating storage and compute, and leveraging serverless architecture for queries, we found that we can greatly reduce costs by 10x! And using caching methods, we're able to deliver fast performance at high scale. In Q3 and Q4, we developed Oodle Insights. On top of the cost efficiency and performance, we knew we needed to push further and utilize AI-powered insights to assist in debugging.
Countless hours preparing for Oodle's launch, we did our best to make sure our core product was solid and created an open playground to showcase our proof of concept.
Launching on Hacker News, we had thousands of people try out our playground demo, we gained really valuable feedback!
At the very start, we decided that Oodle would be fundamentally simple and easy to use, ensuring a great user experience in today's era of complicated observability tools. Another core belief is full open source compatibility with no lock-in for our customers.
Our very first offering is a drop-in replacement for prometheus, fully managed, highly scalable with zero management overhead. There are three simple steps to get going -
then, all your existing dashboards and alerts will just continue to work.
We redesigned Alerts to make it more intuitive...
We've launched simple and transparent pricing: based on a single dimension.
We believe simplicity is the cornerstone of great product design. By stripping away unnecessary complexity, we're ensuring a product that is intuitive and enjoyable to use. Simplicity empowers users to focus on what matters most, creating a worry-free experience.
PromCon EU 2024 was a premier event for the Prometheus community, bringing together developers and industry experts to share insights and innovations in open-source observability. We had the opportunity to speak on Oodle's innovative cost-efficient architecture and how we've leveraged serverless architecture to achieve speed and cost-efficiency.
DevOps Days is a global series of community-driven conferences focusing on software development, SRE, IT infrastructure operations, and the intersection between them. In 2024, both Nashville and Boston brought together professionals to share new ideas.
KubeCon + CloudNativeCon North America 2024, the flagship conference of the Cloud Native Computing Foundation (CNCF), took place from November 12 to 15, 2024, at the Salt Palace Convention Center in Salt Lake City, Utah. KubeCon gathers adopters and technologists from leading open-source and cloud-native communities for four days of education and advancement in cloud-native computing.
KubeCon + CloudNativeCon India 2024, the inaugural Indian edition of the Cloud Native Computing Foundation's (CNCF) flagship conference, featured 56 sessions, including keynotes, lightning talks, and breakout sessions, with contributions from 12 CNCF project maintainers. We were exhausted by the end of the event with many of us barely able to walk, but excited by the reception of the demos from the booth visitors.
At PromCon EU, we had the exciting opportunity to share our innovative architecture design focused on achieving unparalleled cost-efficiency in observability. We detailed how Oodle's unique approach leverages advanced compression techniques, storage optimizations, and seamless scalability to reduce costs by over 10x compared to traditional observability solutions.
We also had the opportunity to share our groundbreaking approach to implementing a serverless Prometheus. In the talk, we detailed how our method addresses the inherent limitations of traditional Prometheus setups, on scalability, operational overhead, and cost management.
Our serverless architecture decouples metric ingestion, storage, and querying, leveraging cloud-native technologies such as serverless functions, object storage, and stateless compute resources. This design eliminates the need for dedicated infrastructure to manage Prometheus, drastically reducing management complexity and operational costs.
Our article on Go Profiling were featured on SRE Weekly, where we talked about things like techniques for leveraging Go's built-in profiling tools, how to optimize goroutine performance to reduce response times. We'd our second article featured in SRE Weekly a few weeks later about GoLang as your swiss arm knife, where we talked about how we leverage Go all the way from application development to infrastructure management, and how it helped a lot with consistency and feature velocity. Finally, our article on Go Faster was featured in CCNF kubeweekly, where we talked about how to design GoLang for performance sensitive applications.
At SREDay, we were thrilled to present our cutting-edge approach to metrics observability for the cloud era, focusing on how to cut costs and complexity by leveraging serverless architecture and S3-based storage. This session resonated with attendees looking for modern solutions to address the escalating challenges of observability in dynamic cloud environments.
Our design partnership with Cure.fit has been instrumental in shaping our platform to handle unicorn scale customers with 4000+ employees. As a fitness and health-tech giant with millions of active users and complex data streams, Cure.fit provided a unique environment for us to test and refine our observability platform.
Working with Fello gave us valuable insights into the critical importance of simplifying observability and the need for centralized monitoring and alerting in fast-growing organizations. As a platform managing real estate transactions, listings, and customer interactions, Fello required a solution that could provide clear visibility into their operations while reducing complexity for their engineering teams.
Our mission for 2025 is to radically simplify the troubleshooting experience with AI, empowering engineering teams to resolve issues faster and with greater confidence. Traditional observability tools often involve complex configurations, manual data analysis, and time-consuming debugging processes, which can delay incident resolution and impact productivity. We aim to change that by leveraging AI to transform how teams approach debugging.
We're aiming for greatly improved Insights functionality in our dashboards and alerts...
Oodle Insights will have advanced interaction, providing tighter AI assistance in the debugging process...
We're launching logs and traces and combining these perspectives into Insights as well...
We're also improving our integrations experience as well as expanding it...
And we're designing new ways to help you cut through alert noise and get to the most important issues first...
As we reflect on 2024, it’s been an exciting and enlightening journey for Oodle—a year filled with innovation, collaboration, and growth. From forging impactful partnerships to pushing the boundaries of observability, we’ve gained invaluable insights and made significant strides in transforming how teams monitor and debug their systems.
As we embark on 2025, we’re filled with optimism and ambition, eager to build on this momentum, tackle new challenges, and continue empowering organizations to resolve incidents radically faster with incredible cost-efficiency. Here’s to another year of growth and groundbreaking achievements!
6.2.2025 21:54Oodle.ai - 2024 Year in ReviewAt Oodle's inception, we faced a common dilemma: choosing the right technology stack to get started. With a small team proficient in Go and a big vision, we needed a language that could handle everything from application development to infrastructure management. After careful consideration, we chose Go, and it has proven to be our Swiss Army knife for modern development. Here's why.
Picture juggling multiple languages and frameworks across your stack. Many teams live this reality: Python for scripting, JavaScript or TypeScript for frontend, HCL (Hashicorp Configuration Language) for Terraform, YAML for Kubernetes configs, Bash for automation, and traditional languages like Java, C#, or Ruby for backend. Each language comes with its own quirks, dependencies and learning curves. Every addition to the stack means:
While some of this complexity is unavoidable, we can minimize the challenges.
We simplified most of these pain points by choosing Go as our primary language for:
Our project structure now looks something like this:
|- infrastructure
|- pulumi
|- aws
configuration.go
deployments.go
|- kubernetes
|- helm
|- charts
cluster.go
|- src
|- app
|- project
|- collector
|- compactor
|- query
|- util
|- containers
set.go
|- tools
ds_build.go
ds_deploy.go
YAML and JSON are ubiquitous formats for defining infrastructure and application configurations. While they're human-readable and widely supported, they come with few drawbacks. These formats lack type safety, which makes it easy to introduce subtle errors, such as using strings instead of numbers or mismatching units (e.g., "1000m" vs "1"). JSON doesn't support comments, and neither format supports code reuse or validation at write time. As configurations grow larger, maintaining consistency becomes challenging, and simple typos in indentation or key names can lead to hard-to-debug issues. Furthermore, these formats don't provide any built-in way to handle environment-specific variations or inheritance, often resulting in significant duplication across environments. Go is an excellent choice for defining configurations, except in cases where configurations need to be modified without going through a code build and deploy cycle.
Let's examine how a traditional YAML configuration looks and how we can improve it using Go.
Here's an example in YAML:
deployments:
prod-01:
region: us-east-1
properties:
timeout: 30s
retries: 3
replicas: 3
memory: 2Gi
cpu: 1000m
prod-02:
region: us-west-2
properties:
timeout: 45s
retries: 5
replicas: 5
memory: 4Gi
cpu: 2000m
prod-03:
region: eu-west-1
properties:
timeout: 60s
retries: 4
replicas: 4
memory: 3Gi
cpu: 1500m
And here is how we can improve it with Go:
type DeploymentConfig struct {
Name string
Region Region
Timeout time.Duration
Retries int
Replicas int
Memory string
CPU string
}
// Default resource requirements
const (
defaultMemory = "2Gi"
defaultCPU = "1000m"
)
var ProdConfigs = map[string]DeploymentConfig{
"prod-01": {
Name: "prod-01",
Region: RegionUSEast1,
Timeout: 30 * time.Second,
Retries: 3,
Replicas: 3,
Memory: defaultMemory,
CPU: defaultCPU,
},
"prod-02": {
Name: "prod-02",
Region: RegionUSWest2,
Timeout: 45 * time.Second,
Retries: 5,
Replicas: 5,
Memory: defaultMemory,
CPU: defaultCPU,
},
"prod-03": {
Name: "prod-03",
Region: RegionEUWest1,
Timeout: 60 * time.Second,
Retries: 4,
Replicas: 4,
Memory: defaultMemory,
CPU: defaultCPU,
},
}
Now we can easily use ProdConfigs across our infrastructure, kubernetes and application stack,
ensuring configuration consistency and type safety from development through deployment.
The debate between static and dynamic typing has no clear winner. While both approaches have their merits, we are in the camp of "Static Typing Where Possible, Dynamic Typing When Needed," as defined in this paper Static Typing Where Possible, Dynamic Typing When Needed:
The End of the Cold War Between Programming Languages. Static typing isn't just about catching errors—it's about building maintainable systems. It helps us catch issues at compile time rather than in production and makes our codebase more navigable and readable through better tooling support and explicit contracts in modern IDEs.
Consider an example utility that retries an operation a few times before exiting.
package utils
func WithRetry[T any](operation func() (T, error), maxAttempts int) (T, error) {
var result T
var err error
for attempt := 1; attempt <= maxAttempts; attempt++ {
result, err = operation()
if err == nil {
return result, nil
}
time.Sleep(time.Second * time.Duration(attempt))
}
return result, fmt.Errorf("failed after %d attempts: %w", maxAttempts, err)
}
This implementation is now re-used across our infrastructure code, application code, testing utilities, and deployment tools. Otherwise we would have to write this utility in each of the languages used for that respective stack - not to mention managing the dependencies and testing for each language implementation.
While using Go for tooling, invariably, we have a need to execute some of the cli tools like kubectl, helm, docker or other tools. We use Go's native exec package to run external commands.
func RunHelmChart(ctx context.Context, chartName string, releaseName string, namespace string, valuesFile string) error {
cmd := exec.CommandContext(ctx, "helm upgrade --install ", chartName, releaseName, "--namespace", namespace, "--values", valuesFile)
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
return cmd.Run()
}
// Usage
err := RunHelmChart(ctx, "my-chart", "my-release", "default", "values.yaml")
We've built similar deployment tooling with Go that handles building Docker images, pushing to ECR, and managing Kubernetes deployments.
Managing database schemas becomes straightforward with Goose, a database migration tool that lets you manage schema changes through SQL or Go functions.
func init() {
goose.AddMigrationContext(upCreateTableUser, downCreateTableUser)
}
func upCreateTableUser(ctx context.Context, tx *sql.Tx) error {
_, err := tx.Exec(`
CREATE TABLE user (
id SERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
email VARCHAR(255) UNIQUE NOT NULL,
created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);`)
if err != nil {
log.Errorf(ctx, "Error creating user table: %v", err)
return err
}
return nil
}
func downCreateTableUser(ctx context.Context, tx *sql.Tx) error {
_, err := tx.Exec("DROP TABLE IF EXISTS user")
if err != nil {
log.Errorf(ctx, "Error deleting user table: %v", err)
return err
}
return nil
}
And here's how we manage migrations in Go:
package main
import (
"database/sql"
"log"
"github.com/pressly/goose/v3"
_ "github.com/lib/pq"
)
func migrateDB() error {
db, err := sql.Open("postgres", "postgres://user:pass@localhost:5432/mydb")
if err != nil {
return err
}
if err := goose.SetDialect("postgres"); err != nil {
return err
}
if err := goose.Up(db, "migrations"); err != nil {
return err
}
return nil
}
Key benefits of using Goose include:
Gone are the days of node_modules hell or Python's virtual environment confusion. Go modules provide clear dependency management:
```go
module github.com/company/project
go 1.21
require (
github.com/aws/aws-sdk-go-v2 v1.24.0
github.com/spf13/cobra v1.8.0
)
In addition, Go's dependency vendoring provides enhanced reliability by allowing you to store third-party packages directly in your project's repository. By committing dependencies alongside your source code in a vendor directory, you ensure your project remains buildable even if external package repositories become unavailable due to deletion, renaming, or other issues and the ability to completely reproduce builds and analyze the code that went into any past release, even when the original repository is no longer available.
GoLand provides powerful features that make development in Go a breeze:
You can read more about how we use Go to write performant, maintainable and scalable applications in Go faster! and Go Profiling in Production.
Using Go across our entire stack has been transformative. It's not just about using a single language—it's about choosing the right language that's powerful enough to handle everything we throw at it. From infrastructure to application code, Go has proven to be our reliable Swiss Army knife, making our development process more efficient and enjoyable. The consistency in our tooling, the ability to share code across different parts of our stack, and the simplicity of maintaining a single-language ecosystem have allowed our small team to move fast and build with confidence. While some of these benefits may seem trivial on their own, they add up significantly when building and maintaining a large codebase. When you find a language this good, sometimes the best strategy is to "Go all in."
Are you passionate about building high-performance systems and solving complex problems? We're looking for talented engineers to join our team. Check out our open positions and apply today!
25.12.2024 11:49Go All the Way: Why Golang is Your Swiss Army Knife for Modern Development











































































