Case Study

30% to 80%: The Testing Overhaul That Changed How Galileo Ships

Eversynced &

Galileo

May 19, 2026

This is the story of how an embedded team reshaped how an entire engineering organization thinks about quality — from 17 tests running in 30 minutes to 300+ tests running in 24.

Galileo is an AI evaluation and observability platform for GenAI applications. It gives teams the tooling to measure, monitor, and improve LLM-powered systems in production — covering evaluation metrics, hallucination detection, and runtime monitoring.

When the Eversynced team joined Galileo's engineering organization, test coverage sat at 30%. On paper, that number looked reasonable. In practice, it masked something more serious: virtually all of that coverage lived in one layer — UI end-to-end tests — leaving API and SDK entirely untouched. The suite was slow, flaky enough that it had become difficult to rely on, and offered almost no signal about where failures actually originated.

The starting point: 30%, but not where it counted

The 30% figure looked tolerable at a distance. Up close, it told a different story. Coverage was concentrated entirely in UI end-to-end tests — a layer designed to catch visual and interaction regressions, not contract bugs or SDK integration failures. Tests were slow — 17 tests taking 30 minutes to run. Failure was ambiguous: when something went red, there was no fast answer for whether it was a real bug, a flaky test, or an environment issue.

There was a testing culture in the sense that developers were already writing tests. This wasn't a case of building from nothing. But without the lower layers in place, all that effort was funneling into brittle suites. The work wasn't to create a culture — it was to redirect one.

The judgment call: go in the opposite direction

The obvious response to a failing E2E suite is to fix the failing tests. The team made a different call: stop chasing E2E volume, and invest instead in separating the layers entirely.

The reasoning was straightforward. E2E failures are mysteries: when one turns red, you can't tell whether the problem is in the UI, the backend, the SDK, or the test itself. API failures are not mysteries — they fail fast, they fail precisely, and they tell you exactly where the problem is. The same is true of SDK tests. Expanding into those layers first was the structural investment that would make quality measurable.

The easy response would have been to push for more E2E coverage. The judgment call was to go in the opposite direction. The payoff wasn't immediate, but it's the single decision that made everything else possible.

Getting developers to engage: the tooling had to earn its trust

The resistance wasn't cultural. The blockers were practical: without fast, reliable tests, it was difficult for developers to justify investing time in the suite — a 30-minute pipeline that fails for mysterious reasons doesn't earn that investment. It trains developers to ignore it.

The levers that actually moved the needle were concrete and visible:

A flakiness dashboard showing unstable tests broken out by component, so every team could see exactly which tests were unreliable and where. A coverage dashboard broken down by layer — API, SDK, UI — so gaps became visible to the whole team, not just QA. Reducing pipeline execution time from 30 minutes to 24 minutes. And expanding coverage into API and SDK, where failures are fast and precise.

The positioning mattered as much as the tooling. The goal was to be a consultant, not a gatekeeper. Developers usually need support writing tests - that means time in PRs, pairing sessions walking through why a test was flaky and how to make it reliable, and code reviews.

Having internal advocates who'd already experienced the before-and-after was worth more than any top-down mandate.

The pipeline architecture: per-component, parallel, targeted

The architectural shift that made everything else scalable was a move from a single monolithic suite to a per-component pipeline. Each component is its own job. Within each job, four tests run in parallel. Instead of one slow suite hammering the full stack, you have targeted jobs that fail independently and report independently — so when something breaks, you see immediately which component failed and on which layer.

The pipeline runs against the staging environment on a six-hour schedule, not per-PR. That design was intentional. Continuous runs against the integrated state of the app catch real integration issues rather than hypothetical ones, and avoid blocking PRs on flakiness.

Aegis: reporting as a release gate

Test results live in a custom internal dashboard called Aegis. For every run it surfaces tests passed, failed, and flaky; total duration; volume; failures broken out by layer; and the responsible team and component for each failure. That last detail is what turns a metrics dashboard into an accountability tool.

Teams use Aegis as a release gate. The decision rule is simple: don't release until the bugs reported by automation are fixed. That shifted the pre-release conversation from "is QA done?" — a question with a subjective answer — to "what does Aegis say?" — an objective signal everyone reads the same way.

What 80% actually means

When the engagement began, there were 17 tests in total and they took 30 minutes to run. Today there are 300+ tests — and they run in 24 minutes. The coverage number isn't the point. What 80% represents is that the team now covers the right layers in a smarter way — no longer dependent on UI E2E to catch every kind of regression. When something breaks, the feedback comes from the layer that actually needs the fix. An API contract issue fails in API tests. An SDK integration issue fails in SDK tests. UI only owns what UI should own.

In practice, that means faster and clearer feedback to developers. They know where the problem is, not just that there is one. Responsibility by layer is well-defined, so ownership is unambiguous and more problems are caught before they reach production.

When you rely only on E2E, a failure is a mystery. With layered coverage, failures are localized the moment they happen.

Full-stack ownership of the quality layer

The testing work required ownership across more than just test files. Architecture, tooling, documentation, reporting strategy, and direct bug fixes across the Metrics domain — 75+ combined QA tickets and bug fixes.

That breadth was what made the pace possible. When a gap emerged in how flakiness was being tracked, or when pipeline performance needed attention, it could be addressed immediately, within the same team, without a handoff or a ticket queue.

What this engagement demonstrates

The testing work at Galileo demonstrates what it looks like when quality is treated as an engineering problem, not a process problem. The shift from a brittle, UI-only suite to a layered, per-component architecture didn't just move a coverage number. It changed how developers experienced testing — from something that slowed them down and gave them unreliable signal, to something that told them exactly where to look and got out of their way.

Earning that kind of trust requires getting into the work — into PRs, into the pipeline, into the tooling — and making testing demonstrably worth the effort.

That's what Eversynced brought to Galileo's engineering organization — and what we bring to every engagement.

Eversynced embeds engineers into product teams as full members, not vendors. If you're evaluating team augmentation, we'd like to talk.

Start building THE team

Build momentum without the hiring headaches. Add high-caliber engineers who integrate seamlessly into your team and start delivering fast.

Schedule a call