Capital One Tech - Medium

DataAgents: How we turned 9 months of analysis into 10 days

Capital One Tech — Tue, 09 Jun 2026 22:42:18 GMT

An engineering deep dive into the pattern that changed how we approach large-scale classification problems.

Every engineering team has that project sitting in the backlog. The one where someone says, “We really should analyze all of these,” and the room goes quiet. Everyone knows what “all of these” means — hundreds of entities, complex rules, no clear starting point.

For us, it was cloud resource dormancy detection.

We had around 350 distinct cloud resource types spread across AWS, Azure and Google Cloud Platform (GCP). Each type has different behavior patterns. An EC2 instance sitting idle looks nothing like a dormant Amazon S3 (S3) bucket or an unattached Elastic IP. Detecting dormancy required understanding what “active” means for each specific resource, then writing detection logic that wouldn’t flood operations teams with false positives.

Traditional estimate: 6-9 months of expert analysis.

Actual time: 10 days.

Here’s how we did it, and more importantly, here’s the reusable pattern behind it.

The problem with large-scale analysis:

Before we get to the solution, it’s worth naming the pattern that makes these projects so painful. It shows up everywhere:

Cloud resources - Which of our 350 resource types are dormant?
Data governance - Which of our 800 tables have quality issues we should monitor?
Security - Which of our access entitlements violate least-privilege principles?
Compliance - Which of our 500 policy controls need remediation?

In every case, the structure is similar — a large catalog of heterogeneous entities, entity-specific rules that don’t generalize, unknown priorities and a high cost for getting it wrong.

The traditional approach is not just slow. It’s structurally limited. You get coverage of the “obvious” cases, inconsistent logic across analysts, and tribal knowledge that evaporates when people leave. What you need is something that can assess each entity, apply consistent criteria, prioritize by confidence and document its reasoning. The DataAgents pattern gets you there.

The data foundation: Why quality data comes first

The quality of your analysis is bounded by the quality of your data.

An AI agent can reason only over what it’s given. If your data is incomplete, inconsistently structured or untrustworthy, the agent’s outputs will be too—just faster and at greater scale.

For our cloud analysis, we had a genuine advantage — an authoritative data product called cloud-asset-data-product. It contains a daily snapshot of every cloud resource across all providers, with a standardized schema. What made this data product valuable was data quality and rich metadata: formal field definitions, PK-FK relationships for cross-entity reasoning and lineage tracking for provenance and freshness. Quality data isn’t optional — it’s what makes AI analysis trustworthy.

resource_id - Unique composite identifier
resource_type - EC2 instance, S3 bucket, EIP, etc.
service_id - Service grouping
business_application_name - Ownership
data_structured_tag - JSON: configuration, tags, state flags
resource_updated_utc_timestamp - Last change timestamp

The principle: the richer and more standardized your data product, the more an AI agent can do with it. Without quality data, AI analysis is unreliable. This is not a caveat. It’s the prerequisite.

The DataAgents pattern

Once you have the data foundation, the pattern is straightforward:

Authoritative data product + AI agent = DataAgent

A DataAgent is not just “AI doing analysis.” It’s a structured combination of:

An authoritative data source—the single source of truth for your domain
An AI agent that understands domain behavior, applies entity-specific rules and generates confidence-rated outputs with documented reasoning
A human-AI validation loop that catches errors and guides refinement

The output is not a spreadsheet of results. It’s a self-documenting artifact—detection logic, confidence classifications, false-positive risk assessments and plain-English reasoning for every entity.

The three-phase process

Phase 1: Broad assessment

Input the full entity catalog. Ask the agent to categorize each resource type by analysis feasibility:

Config-detectable - Dormancy is detectable from configuration data alone
Needs telemetry - Reliable detection requires additional usage signals

Phase 2: Classification and logic generation

For each entity, the agent analyzes three questions:

What indicates the target state (dormancy)? → Detection logic
How reliable is this signal? → Confidence level
What’s the false-positive risk? → Risk assessment

Output: Spark SQL detection queries and documented reasoning for every resource type.

Phase 3: Deep validation (the game-changer)

This is where the approach separates itself from “run the AI and ship the output.”

Human: “Double-check all MEDIUM confidence entities one by one.”

The agent reviewed every MEDIUM classification individually-not as a batch. Some were upgraded to HIGH. Some moved to Phase 2 for requiring telemetry. Then the same for LOW confidence. Every entity reviewed, every decision documented.

Without Phase 3, you’re stuck at, “Let’s try some and see what happens.” With it, you get systematic validation of every entity in your catalog.

What the agent discovered that humans would miss

The most valuable outputs weren’t the easy HIGH confidence cases-those are obvious. The value was in what systematic analysis uncovered across 350 types.

The false-positive trap. Some resource types look dormant by age but are actively used without any detectable configuration changes. An S3 bucket with no config updates for 90 days might be accessed millions of times per day-access patterns don’t touch the config. Age-based detection on these types runs 40-60% false-positive rates. The agent identified these and moved them to Phase 2.

State-based vs. age-based detection. The agent surfaced a clean framework from the analysis: High-confidence detection uses explicit state indicators-a binary state (stopped, unattached, disabled) that definitively signals dormancy. Low-confidence detection relies only on timestamps.

Disproportionate value concentration. Just 12 HIGH confidence resource types account for 30-40% of total dormancy savings-from 3.6% of resource types. Without systematic analysis of all 350, you’d find some of these, but you’d miss others.

The audit trail: Every decision documented

One of the underappreciated outputs of this approach is the reasoning column. Every entity gets a plain-English explanation of why it received its classification:

▎ HIGH Confidence: “Resource is in an explicit dormant state (stopped) and has not been updated for 90+ days. State-based detection with <5% false-positive rate. Automation-ready.”

▎ Phase 2 Reclassification: “This resource type can be actively used without configuration changes, resulting in 40-60% false-positive rate with age-based detection. Requires usage telemetry for reliable detection.”

▎ LOW Confidence: “No state indicators available. Age-based detection only-active entities may qualify. Manual owner review required before action.”

This is not a nice-to-have. Months later, when someone asks, “Why does this logic work this way,” the answer is in the output. When a developer is assigned to maintain the detection queries, the context is self-documenting. When stakeholders ask why a resource type is flagged, the reasoning is already written.

Without AI-generated documentation, these explanations either don’t exist or live in someone’s head.

Of course, humans have to prompt AI to provide such output in a field.

Human-AI partnership: The part that actually makes it work

AI made errors. We should be clear about this.

Field name casing was wrong on several generated queries.
Column names in some detection logic referenced fields that didn’t exist in the schema.
A handful of confidence levels were over-optimistic before the deep-dive review.

Every one of these errors was caught by human validation before production.

AI contributes speed, pattern recognition and consistency at scale. Humans contribute strategic direction, quality control and domain validation. The 18-27x speed improvement is the net result after human validation-not in spite of it.

Where this pattern applies

The DataAgents pattern works for any domain with these properties:

A large catalog of heterogeneous entities
Entity-specific rules and thresholds (one-size-fits-all doesn’t work)
An authoritative data product/source with a standardized schema
A need for confidence-based prioritization and documented reasoning

Beyond cloud resources, the same pattern applies to:

Data and governance - Generate data quality rules across hundreds of tables. Classify PII and sensitive data at scale. Detect schema drift with documented rationale.
Risk and compliance - Detect policy violations across entity catalogs. Review access entitlements. Map regulatory controls to technical implementations.
Security - Assess security posture across resource configurations. Identify misconfigured services with ranked confidence.

The pattern is reusable. The data product changes. The agent changes. The output structure adapts. But the three-phase process-broad assessment, classification with logic generation, deep validation-applies every time.

What you need to try this

Before you start:

Identify your highest-value analysis problem
Confirm you have an authoritative data product/source with standardized schema for that domain
Define what “target state” looks like for your entities (what does dormant/at-risk/non-compliant mean?)

The minimum viable setup:

A data product/source you trust
An AI agent with enough context about your domain (schema, sample data, domain documentation)
A human validator who knows the domain and can challenge the outputs

The investment: 10 days of focused work for something that would otherwise take 9 months. Most of that time is in Phase 3: the deep validation loop. Don’t skip it.

The bottom line

AI doesn’t replace human expertise. It amplifies it.

What changed with DataAgents is that humans can now spend their time on judgment calls and validation-instead of manually analyzing entity 47 through entity 350. The AI capability is available. What’s often missing is the structured data foundation that makes it reliable.

If you have that foundation, you have more analytical capability available to you right now than you probably realize.

Originally published at https://www.capitalone.com.

This blog was authored by Ram Manohar Bheemana, Senior Lead Data Engineer, Cloud Radar. Ram Bheemana is a Senior Data Engineer at Capital One, where he builds cloud-scale data products that give engineers and business teams real-time visibility into cloud resources across AWS, Azure and GCP. His recent work includes developing the DataAgents pattern — an approach that pairs authoritative data products with AI agents to automate large-scale analysis tasks that once took months, and pioneering Spark Streaming architectures that have delivered over a million dollars in annual cost savings. Ram is passionate about the intersection of data engineering and AI — not just using AI as a tool, but rethinking how data pipelines and intelligent agents can work together to solve problems at enterprise scale. Outside of work, he enjoys mentoring engineers, contributing to community initiatives and staying curious about what’s next in the data and AI space.

DISCLOSURE STATEMENT: © 2026 Capital One. Opinions are those of the individual author and are not necessarily those of Capital One. Unless noted otherwise, Capital One is not affiliated with, nor endorsed by, any third parties mentioned and is not responsible for the content or privacy policies of any linked third-party sites. Any trademarks and other intellectual property used or displayed are property of their respective owners.

DataAgents: How we turned 9 months of analysis into 10 days was originally published in Capital One Tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

UVA School of Engineering: Capital One Fellows 2026–2027

Capital One Tech — Fri, 05 Jun 2026 13:08:47 GMT

Capital One and the University of Virginia celebrate the 2026–2027 engineering fellowship awardees.

Capital One is advancing AI innovation through long-term partnerships with top research institutions

In October 2025, we announced our deepening partnership with the University of Virginia (UVA) via a new $4.5 million initiative with the UVA School of Engineering and Applied Science. This includes a $2 million investment in the Capital One AI Research Neighborhood, matched by UVA for a total of $4 million, plus $500,000 for a new Ph.D. Fellowship Awards program. Through these investments in AI research infrastructure and individual scholarship, we are focused on advancing specialized scientific knowledge, fostering industry-academia collaboration and developing impactful research co-publications, ultimately allowing us to meaningfully advance science.

We recently concluded the selection process for our fellowship awards. We are excited to share our PhD Fellows for the 2026–2027 academic year:

Zhepei Wei

Zhepei Wei is a Ph.D. candidate in the Computer Science department at the University of Virginia, advised by Professor Yu Meng. Zhepei has held research positions at Microsoft Research, Meta and Amazon, working on large language models (LLMs). His first-authored research papers have been published in top-tier venues in the fields of machine learning and artificial intelligence (e.g., NeurIPS, ICML, ICLR) with over 1,400 citations on Google Scholar. He is also a recipient of the UVA Copenhaver Charitable Trust Bicentennial Fellowship and the John A. Stankovic Outstanding Graduate Research Award.

Zhepei’s core research interest lies in the learning foundations of LLMs and their applications to practical problems (e.g., question answering, logical reasoning, agentic workflows), with a focus on efficiency, trustworthiness and generalizability. His work enables fast, verifiable and autonomous LLM-powered systems that can be deployed at financial scale, including efficient fraud scoring and credit analysis, grounded compliance review with traceable citations and robust automated workflows for financial operations.

Zhenyu Lei

Zhenyu Lei is a Ph.D. student in electrical and computer engineering at the University of Virginia, advised by Professor Jundong Li. Before joining UVA, he earned his B.S. in physics (Honors Graduate) from Xi’an Jiaotong University.

His research centers on making LLM reasoning more efficient and reliable. Specifically, he works on reasoning distillation, which compresses powerful models into smaller ones that retain strong reasoning ability, and on reasoning editing, which corrects failures in model behavior and enables LLMs to reason reliably over their lifetime. Building on this foundation, he is excited to extend his work to the financial domain, where efficient and trustworthy reasoning is critical for real-world decision-making. He has published over twenty papers at leading venues including ICLR, AAAI, and ACL, with three oral presentations.

Congratulations to our award recipients!

Learn more about Capital One Tech and explore career opportunities

We’re building innovative AI solutions in-house and transforming the financial industry:

Explore our AI research: Dive deep into our latest advancements in AI and machine learning.
Discover career opportunities: Learn about exciting applied research career paths at Capital One for researchers and engineers passionate about AI, and join our world-class team.
Engage with our team: Meet our researchers and AI experts, ask questions and discuss the future of AI in finance.

Originally published at https://www.capitalone.com.

UVA School of Engineering: Capital One Fellows 2026–2027 was originally published in Capital One Tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

Multi-Tenancy: The Non-Negotiable Foundation for Scaling Platforms

Capital One Tech — Thu, 28 May 2026 15:39:39 GMT

Seamlessly manage shared infrastructure while ensuring strict logical isolation for data, security and performance.

In today’s software ecosystem, we no longer build one-off applications. We build platforms-tools meant to scale and that support multiple teams, customers and use cases. And for any platform to scale successfully, multi-tenancy is not just an optimization, it’s a foundational requirement.

Whether you’re building a back-end service for internal teams or launching the next big SaaS product, multi-tenancy should not be an afterthought; multi-tenancy should be a foundational design principle. Yet many engineering teams overlook this at the beginning, which often leads to expensive rewrites, fragile extensions and deeply tangled architectures.

In this post, I’ll walk through what multi-tenancy means, why it matters and the importance of robust tenant isolation, and I’ll explore various implementation strategies.

What is multi-tenancy?

At its core, multi-tenancy is a software architecture where a single instance of the software system serves multiple tenants. A tenant could be defined as a team, a customer, a department or even an individual user. A good multi-tenant system ensures that the data, compute, configurations and workflows of the tenants are logically (or physically) isolated.

In single-tenant models, each tenant gets a completely separate instance of the platform. This is often easier to implement at the beginning but harder to scale.
In multi-tenant models, tenants share infrastructure but remain isolated in terms of data, security and usage.

Why is multi-tenancy important?

Economies of scale
A multi-tenant architecture lets you onboard new tenants into a shared resource pool instead of provisioning dedicated infrastructure per tenant. Multi-tenancy enables platform growth efficiently without linearly scaling cost.
Speed
If your platform is multi-tenant from day one, onboarding new teams or customers is frictionless, drastically reducing time-to-value. As the shared platform is already running, new users get instant access without waiting for dedicated environments to be spun up and validated.
Strict isolation
Strict logical boundaries are essential in a shared environment to completely prevent cross-tenant data leakage. This rigorous separation guarantees that one tenant can never access, view, or interact with another tenant’s sensitive information.
Cost efficiency
Shared infrastructure (compute, storage) maximizes resource utilization.
Operational simplicity
Multi-tenancy removes the maintenance burden from your customers, freeing them from managing infrastructure updates, patches or vulnerabilities. As everyone shares a unified platform, tenants seamlessly and automatically benefit from new features, bug fixes, and improvements the moment they are released.
Flexibility for internal and external users
Whether you’re packaging your platform for internal consumers or offering the platform as a service to customers, multi-tenancy enables agility and growth.

What is tenant isolation?

While multi-tenancy enables us to serve multiple tenants efficiently on shared infrastructure, it introduces a key challenge: tenant isolation.

Tenant isolation is the architectural enforcement of strict boundaries, both logical and physical, between tenants sharing the same underlying infrastructure. It guarantees that the identity, data, network traffic, and execution state of one tenant are completely invisible and inaccessible to any other tenant.

While multi-tenancy drives efficiency, implementing this rigid separation is the key to ensuring the platform remains safe, scalable, and trustworthy. Without strong isolation boundaries, a resource-heavy workload from one customer can create a “noisy neighbor” effect, draining shared resources and degrading the experience for everyone else. Furthermore, weak isolation significantly increases the risk of critical cross-tenant security breaches.

That’s why, once we understand what multi-tenancy is and why that matters, it becomes just as important to explore how we design robust isolation across all layers of the stack.

Some examples of implementing tenant isolation across the layers:

By designing for tenant isolation at every layer, you don’t just improve architecture quality-you also unlock critical benefits like regulatory compliance, reduced blast radius for security incidents, better operational resilience and clearer auditability.

This layered approach ensures that even if one control fails, others remain in place, creating a defense-in-depth model that builds trust and meets enterprise expectations.

Dimensions of tenant isolation

Tenant isolation is a multidimensional concept. It’s not just about data. Here are the key dimensions to think about when designing a multi-tenant, tenant-isolated platform:

The more mature your platform, the more dimensions you’ll likely need to support.

Should we worry about cost?

Growing cost is a valid concern. Implementing multi-tenancy can add upfront complexity. But the cost of not building multi-tenant platforms from the start is often higher.

That said, not all multi-tenancy models are created equal. You can choose from models based on tradeoffs:

Fully shared: Resources are pooled across all customers (low cost, high complexity isolation).
Partially isolated: A hybrid approach balancing needs (e.g., shared compute, but isolated storage).
Fully isolated: Infrastructure is separated per tenant to meet strict regulatory requirements, but the codebase, operations, and updates remain centrally managed (high cost, lowest noisy-neighbor risk).

Use logical multi-tenancy (e.g., namespaces, resource tags, policy scopes) where possible. This offers a good tradeoff between cost and control.

The day-one advantage

If you’re building a platform and don’t see multiple tenants today, think again. Tomorrow, someone will want to reuse your platform. Then another team. Then your product team may want to externalize that platform.

It’s tempting to skip multi-tenancy early in a platform’s life cycle. After all, why bother when there’s only one internal team using the platform OR if the platform is an internal-only platform?

But that’s precisely the moment when it’s easiest and cheapest to bake multi-tenant capabilities in it. If your architecture is inherently single-tenant, adding multi-tenancy later involves:

Refactoring APIs/UI, etc.
Migrating data
Changing auth models
Rebuilding dashboards, logs and alerts
Creating chargeback models
Worst of all, downtime and rework

It’s easier and cheaper to bake multi-tenancy in from the start.

Final takeaways

If you’re building a software platform today, internal or external, make multi-tenancy a nonnegotiable principle.

Start early: Designing for multi-tenancy later is costly and brittle.
Think holistically: It’s more than just data isolation; consider all dimensions.
Balance cost and complexity: Choose the right model for your context.
Bake multi-tenant platforms into your culture: Multi-tenancy should be a core part of platform thinking.

Multi-tenancy isn’t just for hyperscalers or SaaS giants. It’s for every team building software that aims to scale, evolve and serve more than one use case.

Originally published at https://www.capitalone.com.

This blog was authored by Prabodh Mhalgi, Distinguished Data Engineer, Enterprise Data Technology. Prabodh Mhalgi is a Distinguished Data Engineer who designs and scales enterprise data platforms that power intelligent decisions. From building next-generation, cloud-native low-latency data pipelines to developing diverse consumption patterns that serve both analytical and real-time needs, he focuses on making data accessible and impactful. With passion for technology and innovation, he is driven to push boundaries and shape the future of enterprise data platforms.

Multi-Tenancy: The Non-Negotiable Foundation for Scaling Platforms was originally published in Capital One Tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

Seeing the unseen: The role of anomaly detection in IAM

Capital One Tech — Wed, 27 May 2026 15:39:02 GMT

Uncover hidden signals in data to protect customers, build trust and safeguard identity at scale.

How does seeing the unseen signals in data protect customers and build trust?

In today’s digital world, identity is the foundation of trust between consumers and financial institutions. As fraudsters become more sophisticated, traditional rule-based defenses are no longer enough. This is where anomaly detection steps in, but the focus is not on simple real-time alerts on service-level failures. Instead, deep, analytics-driven insights help detect what isn’t triggering, what’s silently failing or what’s subtly shifting under the surface.

This post is for tech professionals looking to solve security problems with data analytics. Advanced anomaly detection can safeguard identity at scale. This is achieved not just by blocking fraud, but by preserving trust, optimizing user journeys and ensuring systems perform as intended.

From reactive to proactive: The new role of anomaly detection

Most fraud systems were built to be reactive. They respond to what’s happening right now, such as suspicious transactions or failed logins. But what about when something stops happening? This is not just noise; it’s a signal. Anomaly detection helps teams see what’s changed, even if it isn’t triggering a traditional alert.

Consider these signals:

A fraud rule that all of a sudden stops executing
A once-popular verification method that sees a steep drop in usage
A surge in identity verification success rates that seems too good to be true
An unexpected increase in abandonment rates during onboarding flows
A spike in selection rates for a low-friction path that’s suddenly favored by users

The data behind the detection

At its core, anomaly detection is about comparing the present to the past and asking, What’s different? This requires looking beyond single events and into trends across systems. These patterns often point to underlying issues like integration gaps, misrouted data, product misconfigurations or new attack vectors.

Key patterns to watch for include:

Verification volumes: Are multifactor authentication or document checks dropping unexpectedly?
Success rate surges: Is an ID check now passing 99% of the time even though baseline is 82%?
Abandonment rate spikes: Are users dropping off after being challenged to complete a second factor authentication, or is a user experience issue the reason for abandonment?
Dormant rules: Have critical risk controls stopped firing?

Machine learning adds context and confidence

A big challenge with anomaly detection is separating real risk from random fluctuations. That’s where machine learning (ML) shines. ML helps build behavioral baselines for users, devices and even entire systems. It doesn’t just detect that something changed-it understands how unusual that change is.

ML models bring in contextual intelligence like the time of day, user segment, geography and device type to reduce noise and focus on meaningful change. For example:

A spike in success rates may be normal for returning users but not for new ones.
Increased abandonment during identity verification might be more concerning if it aligns with a mobile OS update.
A fraud rule that hasn’t fired for three days could be an integration issue or a sign that attackers have learned to bypass it.

Change point detection: Spotting what’s slipping

Not all fraud happens overnight. Some of the most damaging attacks start with subtle, gradual shifts in behavior, a pattern known as “drift.”

That’s where change point detection comes in. It helps identify when user behavior or system performance starts to evolve in small but consistent ways. Think of it as trend analysis for security. Catching small changes early can prevent more serious downstream issues.

Validating intent meets execution (I=E) at scale

Beyond just detecting security anomalies, these techniques are crucial for validating that system intent aligns with actual execution. In complex identity ecosystems, it’s easy for subtle discrepancies to emerge between how a system is designed to behave and how it truly operates in production. Anomaly detection provides the continuous, real-time audit needed for I=E validation at scale.

For instance:

Expected user journeys: If the intent is for 90% of users to complete onboarding within five steps, anomaly detection can flag sudden drops in completion rates or unexpected diversions from the intended path. This highlights potential friction points, broken integrations or even new attack vectors exploiting workflow gaps.
Policy enforcement: If a new policy intends to block transactions from specific regions, anomaly detection can confirm that no such transactions are slipping through. A sudden, unexpected success rate for transactions from a restricted region would be a clear I=E failure.
System health and performance: Anomaly detection on metrics like API response times, database query failures or even CPU utilization can reveal when system execution deviates from intended performance baselines. This ensures that the underlying infrastructure is reliably supporting the intended identity processes.
Rule coverage and efficacy: The silent fraud rule discussed earlier is a prime example of an I=E failure. The intent is for the rule to catch a specific type of fraud; if it stops firing, the execution is no longer matching the intent, indicating a potential bypass or system misconfiguration.

By continuously monitoring these signals and detecting deviations, anomaly detection transforms into a powerful tool for ensuring that identity security systems are not just preventing fraud; they are also behaving precisely as intended, thereby validating I=E across the entire identity life cycle.

The hidden power of abandonment rates

Abandonment rates are often seen as a product or user experience concern, but they’re just as valuable for security teams. When users drop off, it’s a signal that can indicate friction, fear or even fraud deterrence. With anomaly detection, abandonment becomes a first-class signal, not an afterthought.

User abandonment can provide valuable insights into system performance and potential security issues. Consider the following scenarios:

New verification method: A sudden increase in abandonment rates following the introduction of a new verification method may indicate usability issues or technical glitches.
Bot activity: A recurring pattern of bots failing at a challenge step and then disappearing suggests automated attacks or attempts to bypass security measures.
Regional login issues: Users frequently “bouncing” at the login stage from a specific region could point to connection throttling, device ID rejections or other localized access problems.

Making it all work: The data foundation

Anomalies cannot be detected without the proper data to back it up. Think of it as building the nervous system for identity infrastructure.

This foundation requires:

Unified identity telemetry: Consolidate logs, metadata and behavioral data into a single observability layer.
Metric observability: Monitor success rates, rule fire counts and abandonment rates over time and segmented by risk.
Automated governance: Employ ML to audit rule coverage, detect blind spots and highlight erratic signals.

Trust is the real output

Ultimately, anomaly detection isn’t just about catching fraud. It’s about building systems that earn and maintain customer trust. When customers encounter seamless and secure identity experiences that adapt and improve, confidence grows. Catching fraud early and fixing user experience issues preemptively is trust in action.

The future of this field is predictive. Instead of just spotting issues after they occur, systems would use artificial intelligence (AI) and historical patterns to predict potential threats, and as these threats materialize, AI agents would adjust policies in real time to optimize friction and security. These self-improving systems would provide appropriate defense against the emerging threats from threat actors that use GenAI to commit fraud.

Final thoughts

Anomaly detection is a critical capability for modern identity security. In a world where threats are quiet and systems are complex, the organizations that can see the unseen will be the ones that lead with both security and trust.

Originally published at https://www.capitalone.com.

This blog was co-authored by Ranjith Goud Karvanga, Sr. Manager, Data Analysis and COF Tech & EPX, and Arpan Srivastava, Director, Data Analytics and COF Tech & EPX

Ranjith is a distinguished expert in data analytics, machine learning and cloud-based technologies, boasting over 10 years of experience spearheading innovation within the banking and financial services sector. He has overseen impactful initiatives spanning credit card systems, risk events and customer identity, delivering innovative solutions in fraud detection, customer verification and AI-driven decision systems. This has been achieved through his proficiency in generative AI and scalable analytics. He has consistently enhanced enterprise capabilities in risk mitigation and strategic data utilization, translating intricate data into actionable, business-critical outcomes.

Arpan is a high-impact data leader with 20 years of experience transforming data into a strategic asset for growth and product innovation. With deep expertise in data technologies, analytics and identity and access management, he architects and leads the high-performing teams that build scalable, data-driven decision systems that self-optimize. Arpan’s background, spanning application development to sophisticated analytical systems, provides a rare ability to bridge technical execution with strategic vision. He specializes in untangling complex data environments to create secure and reliable assets that directly fuel product innovation and deliver measurable business value.

Seeing the unseen: The role of anomaly detection in IAM was originally published in Capital One Tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

Spark tuning: executor optimization for performance

Capital One Tech — Thu, 30 Apr 2026 16:21:31 GMT

Learn how Spark executor tuning improves performance with fat, thin and optimal executors for efficient applications.

Introduction to Spark executor tuning

Working with Spark is intimidating for new users. Distributed computing concepts to understand, parameters to tune and environments to configure, on top of writing the application code itself! It’s easy to use surface-level Spark knowledge to create an application that works, but developing efficient applications and achieving real performance improvements require a deeper dive into the details of Spark. One such case is Spark executor tuning.

Spark drivers, workers and executors

Spark is an open source unified analytics engine for distributed processing of large volumes of data. Computations are split out over a series of clusters, enabling parallel processing and in turn speeding up execution. Each Spark application runs on a driver node and a series of worker nodes.

The role of the driver in Spark applications

The driver is the “brain,” running the main method of the application. The driver is responsible for building execution plans and orchestrating the computational tasks performed. It analyzes, schedules and sends tasks to worker nodes.

The role of worker nodes in Spark clusters

In contrast, the workers are the “brawn.” Each worker carries out computation tasks and returns the results to the driver.

The role of executors in Spark jobs

Spark jobs are broken down into a series of stages, each composed of tasks that run in parallel on worker node executors. A task is the smallest unit of work in a Spark application. A Spark executor is a process that runs on a worker node in a cluster and executes tasks assigned to it by the driver. Executors perform the actual data computations of a Spark application.

Each executor is allocated a certain amount of memory and CPU cores for storing data and performing computation tasks. By default, Spark creates a single executor for each worker node in a cluster, but users can change the number of executors and the memory and CPU allocated to each executor. This can often lead to improved performance depending on the application.

Details for executors can be configured by using the following Spark parameters when setting up an application: -num-executors, -executor-cores, -executor-memory.

Fat vs. thin vs. optimal executors in Spark

Executors in Spark can be configured in different ways-fat, thin or optimally sized-each with trade-offs in performance, cost and fault tolerance.

Fat executors in Spark

Since Spark creates one executor for each worker by default, these executors are “fat.” They contain all the CPU cores and memory available to the worker node. Fat executors can be beneficial for certain use cases, such as when an application is processing a large amount of data or when managing several executors becomes a concern.

Since a fat executor has more cores, more tasks can be run in parallel (typically 1 core = 1 task), which can improve application performance.
An additional benefit of fat executors is enhanced data locality, as there is a greater chance of data being processed on a node where it is already stored. This reduces the amount of network traffic (data sent between workers), speeding up the application.

While there are benefits to using fat executors, there are potential downsides:

As all the memory and CPU for a worker sit on one executor, there is potential for resources being underutilized if some cores or portions of memory remain unused.
Fat executors have a lower fault tolerance in the event of an error, as all the resources for a worker are contained on a single executor. If that one executor goes down, the whole worker goes down.

Thin executors in Spark

In direct contrast to fat executors are thin executors. Thin executors are minimally sized, oftentimes containing a single CPU core (or a small number of cores) and a fraction of the memory available to a worker node.

Similar to fat executors, thin executors also increase parallelism, in this case because there are more executors available.
There is better fault tolerance due to the number of executors available in the cluster. If an executor goes down, it’s not the end of the world, and the amount of data being processed on each executor is small due to the limited memory on thin executors, so recomputation is easier.

Thin executors are not perfect either, and there are negatives:

A high amount of network traffic occurs between the driver and executors on each worker node. Since thin executors have less memory and fewer CPU cores, there will be more data sent across more executors as the driver assigns tasks and receives the results from workers.
For similar reasons, there is reduced data locality when using thin executors. More data is spread over more executors, and each executor has a smaller amount of memory, preventing it from storing a large quantity of data partitions locally.

Optimally sized executors in Spark

As the name implies, optimally sized executors are configured to contain the ideal amount of memory and CPU cores for each executor on a worker node. Optimal executors are the Goldilocks solution: not too big, not too small, just right. This can lead to improved application performance by potentially reducing the run time of the application while better utilizing the resources configured. Optimal executors are determined by following the rules below.

Rules for sizing optimal Spark executors

Sizing Spark executors correctly requires following a few best-practice rules:

Leave out 1 CPU core and 1 GB of RAM for the operating system per worker node.
Remove 1 executor (or 1 core and 1 GB of RAM) at the cluster level to account for resource management.
When calculating executor memory, leave out a certain amount to account for the memory overhead of internal system processes. The amount to leave out is MAX (384 MB, 10% of executor memory).
Ideally, have 3–5 CPU cores on each executor.

Example Spark configuration across 5 worker nodes

Given what we now know about executors, let’s walk through an example. We’ll use a sample Spark configuration with 5 worker nodes, each containing 12 CPU cores and 48 GB of RAM. Our base cluster looks like this:

After following rule 1, we are left with 11 cores and 47 GB of RAM on each worker node.

Across the cluster (all 5 nodes), we have:

Now, we follow rule 2. We could remove one full executor (which would likely be better for a thin executor case where we have many executors), but for this example, we will remove 1 GB of RAM and 1 core at the cluster level.

We want to use optimally sized executors across these 5 nodes and need to take into account rule 4, leveraging 3–5 cores per executor. We’ll choose 5 executors:

The Spark configurations set for this example would be as follows:

The following is a visualization of what the executors would look like in relation to the whole cluster, with each executor in purple:

Conclusion: Spark executor tuning matters

Spark executor tuning matters because performance, cost efficiency and reliability all hinge on how executors are configured. Ideally, this article provides a deeper understanding of what Spark executors are and how tuning them can lead to better performance and potential cost savings. As with many concepts in Spark, executor tuning is not an exact science and will likely require trial and error. This article should serve as an example that you can use to crunch the numbers and find the optimal configuration for your application. Enjoy tuning!

Originally published at https://www.capitalone.com.

This blog was co-authored by Rudra Sinha, Senior Manager; Andrew Baak, Principal Associate and Tatum Bair, Principal Associate Rudra Sinha, Andrew Baak and Tatum Bair work in the data engineering space and boast extensive experience in data engineering and technical leadership. As a group of data engineering leaders, their passion is to explain their deep data engineering knowledge, acquired through extensive research, concept proofing and countless implementations.

Spark tuning: executor optimization for performance was originally published in Capital One Tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

Insights from the inaugural Capital One AI Symposium

Capital One Tech — Fri, 24 Apr 2026 15:52:24 GMT

Advancing the state of the art through multi-sector partnerships.

At Capital One, we are deeply committed to advancing the frontier of artificial intelligence (AI) to solve meaningful industrywide challenges.

A key part of this mission is fostering long-term multi-sector ecosystems and partnerships to advance the state of the art. Recently, we hosted our first-ever two-day AI Symposium in McLean, Virginia. The inaugural event was designed to deepen connections and foster insight-sharing between the scientific AI community, leading-edge startups and our own technology, data science and AI/machine learning leaders and partners.

A multi-sector approach

Across the two days, the symposium highlighted a critical shift in how we approach researching and building AI. Rather than siloing talent and insights within the confines of the university environment or the private sector, we must build sustainable, reciprocal models that strengthen both for broad societal benefit.

The AI Symposium showcased how academic research can play a defining role in enterprise without losing its scientific rigor.

Day 1: The Academic Summit

The first day focused on the latest breakthroughs in AI and related areas from top academics, applied scientists and engineering leaders. Key themes explored in these academic talks and discussions included:

Human-AI Interaction: Professors Lydia Chilton of Columbia University and Mohit Bansal, director of the University of North Carolina at Chapel Hill MURGe Lab, discussed challenges and opportunities related to trust in AI systems, including the confidence obstacles impeding productivity and how addressing said obstacles can improve real-world applications in medicine, education and beyond.

Creative Intelligence: Professors Heng Ji of the University of Illinois at Urbana-Champaign and Ellie Pavlick of Brown University explored how models can move beyond simple pattern-matching into complex reasoning to generate entirely novel solutions, alongside deep dives into how AI systems actually process information internally.

Responsible Decision-Making and Quantum: Multi-sector panels, including researchers like Dr. Shih-Fu Chang, dean of Columbia Engineering, and Professor Gaurav Sukhatme, director of Advanced Computing at the University of Southern California, discussed the responsible implementation of AI in finance, the need for multisector innovation and considerations for the intersection of AI and quantum computing alongside experts from scientific federal agencies.

The Academic Summit was enriched by the participation of academic leaders from Capital One’s formal partnerships, including the Capital One University of Southern California Center for Responsible AI and Decision Making in Finance; the Capital One Illinois Center for Generative AI Safety, Knowledge Systems, and Cybersecurity; the Columbia Center for AI and Responsible Financial Innovation; the University of Maryland Center for Machine Learning; the University of North Carolina at Chapel Hill MURGe Lab; the University of Virginia School of Data Science; the University of Virginia School of Engineering; and partners from the National Science Foundation and the Partnership on AI.

Day 2: The Frontier Forum

The second day focused on the enterprise, exploring how scientific breakthroughs can scale to create real business impact. Highlights included:

Economics of AI: Insights from Professor Jason Furman of the Harvard Kennedy School of Government focused on the emerging macroeconomic impacts of advanced AI.

Innovations in Infrastructure: Bryan Catanzaro of NVIDIA and Professor Randall Balestriero of Brown University and formerly Meta AI Research offered deep dives into open foundation models and accelerated infrastructure, and the applicability of world models to reliable AI in complex real-world parameters, respectively.

Agentic Coding: Boris Cherny and Laurens van der Maaten from Anthropic explored the future of software engineering and research through the lens of Claude Code.

Startup Showcases: Demos from Capital One Ventures startup partners highlighted the cutting edge of AI monitoring, data infrastructure and conversational applications.

Dialogues and connections facilitated by the Capital One AI Symposium are critically important drivers as we look to the next frontier of enterprise AI. By bringing together the brightest minds across academia, government, big tech, startups and our own enterprise, we are collaboratively building a more intelligent, autonomous and well-managed future.

We look forward to continuing these vital conversations and further fostering a collaborative ecosystem that attracts world-class talent to the frontier of AI and advances U.S. AI leadership.

Interested in joining the team that’s building the future of AI in finance? Explore our AI Careers page.

Originally published at https://www.capitalone.com.

Insights from the inaugural Capital One AI Symposium was originally published in Capital One Tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

NLP research foundations at ICLR 2026

Capital One Tech — Tue, 21 Apr 2026 16:55:24 GMT

Explore our latest research in LLM alignment, uncertainty quantification and privacy-preserving synthetic data in Rio de Janeiro.

Capital One technologists are excited to participate in the 14th International Conference on Learning Representations (ICLR) taking place in Rio de Janeiro, Brazil, April 23–27, 2026. As a premier venue for deep-learning research, ICLR provides a vital forum for addressing the complexities of natural language and representation learning.

Capital One is participating as a gold sponsor and research contributor to discuss advancements in large language model (LLM) alignment, uncertainty quantification and the development of responsible, agentic systems. This work provides the foundational technical solutions necessary for the next generation of financial services.

Main conference research: AI safety and model reasoning

The following research, accepted to the ICLR Main Conference, examines the limits of how models reason, adhere to safety policies and quantify uncertainty. This section includes work from Capital One researchers, papers first-authored by 2025 Applied Research Interns and collaborative research with academic partners.

Alignment-Weighted DPO: A novel way to improve alignment in LLMs via reasoning
Capital One Authors: Mengxuan Hu (ARIP 2025), Vivek Datla, Anoop Kumar, Alfy Samuel, Daben Liu

Despite advances in alignment techniques like Direct Preference Optimization (DPO), LLMs remain vulnerable to jailbreak attacks. Our research, grounded in causal intervention, reveals that this vulnerability stems from “shallow” alignment, a lack of deep reasoning when rejecting harmful prompts. To bridge this gap, we introduce Alignment-Weighted DPO, a reasoning-aware post-training technique that identifies problematic reasoning segments during response generation. By targeting these specific vulnerabilities, our method improves robustness against diverse jailbreak strategies while maintaining overall model utility.

Uncertainty as Feature Gaps: Epistemic Uncertainty Quantification of LLMs in Contextual Question-Answering
Capital One Authors: Yavuz Bakman (ARIP 2025), Zhiqi Huang, Chenyang Zhu, Anoop Kumar, Alfy Samuel, Daben Liu

Quantifying epistemic uncertainty is critical for real-world contextual quality assurance. This study proposes a theoretically grounded approach that interprets uncertainty as “feature gaps” in hidden representations relative to an ideal model. By extracting features like context reliance and honesty, we form a robust uncertainty score. Experiments show a 13-point prediction-rejection ratio improvement over state-of-the-art methods with negligible inference overhead.

DynaGuard: A Dynamic Guardian Model With User-Defined Policies
Capital One Authors: Melissa Kazemi Rad, Bayan Bruss

While standard guardian models are limited to predefined harm categories, this collaboration with the University of Maryland introduces DynaGuard, a suite of dynamic models that evaluate conversational agent responses in multiturn settings based on user-defined policies, as well as DynaBench, a synthetically generated dataset with dynamic rules covering various industries and multiturn user-agent interactions. DynaGuard provides rapid detection of custom violations and a chain-of-thought option to justify outputs. It surpasses state-of-the-art guardrail models in accuracy and is competitive with frontier reasoning models on free-form policy violations.

mR3: Multilingual Rubric-Agnostic Reward Reasoning Models
Capital One Author: Genta Winata

Evaluation using LLM judges often fails to generalize to non-English settings. This work introduced mR3, a multilingual reward reasoning model trained on 72 languages and developed in partnership with Stanford University. It achieves state-of-the-art performance on multilingual benchmarks while remaining significantly smaller than larger models, demonstrating an effective strategy for building high-quality multilingual reward models.

BioTamperNet: Affinity-Guided State-Space Model Detecting Tampered Biomedical Images
Capital One Author: Premkumar Natarajan

Subtle manipulations in biomedical images can compromise experimental validity. In partnership with the University of Southern California (USC), this research introduces BioTamperNet, which uses affinity-guided attention inspired by state space model approximations to detect duplicated regions. By integrating lightweight linear attention mechanisms, it identifies tampered regions and their source counterparts more accurately than competitive forensic baselines. This methodology may be applicable to financial services in the context of fraud detection and image verification.

μLO: Compute-Efficient Meta-Generalization of Learned Optimizers
Capital One Author: Charles-Etienne Joseph

Learned optimizers (LOs) often struggle to optimize unseen tasks. In collaboration with Mila, Samsung AI Lab, Concordia University, Sorbonne University and Université de Montréal, this research derived the maximal update parametrization (μP) for two LO architectures and proposed a meta-training recipe for μ-parametrized LOs (μLOs). This method substantially improves meta-generalization to wider, deeper and longer training horizons compared to standard parametrization.

Workshop tracks: Privacy, synthetic data and forecasting

Our workshop participation addresses the intersection of privacy, synthetic data and time-series forecasting. This includes work from our Applied Research Internship Program and our funded university-based Academic Centers of Excellence.

Evaluating LLM Simulators as Differentially Private Data Generators (The 2nd Workshop on Advances in Financial AI)
Capital One Authors: Nassima Bouzid, Dehao Yuan, Nam Nguyen, Mayana Wanderley Pereira

LLM-based simulators offer a path for generating complex synthetic data, but their ability to reproduce statistical distributions from DP-protected inputs remains a question. This study finds that while these simulators achieve promising utility, they exhibit significant distribution drift due to systematic LLM biases. Addressing these failure modes is essential before LLM-based methods can handle the rich user representations required for financial simulations.

Decoupling Identity From Utility: Privacy-by-Design Frameworks for Financial Ecosystems (The 2nd Workshop on Advances in Financial AI)
Capital One Authors: Ifayoyinsola Ibikunle, Tyler Farnan, Senthil Kumar, Mayana Wanderley Pereira

This paper positions differentially private (DP) synthetic data as a robust framework for building responsible agentic systems in finance. We examine direct tabular synthesis and DP-seeded agent-based modeling, arguing that the latter is essential for autonomous finance. It provides a “safe gym” for training agents, enabling fairness auditing and robustness testing while adhering to rigorous formal privacy guarantees.

Zero-Shot Multivariate Time Series Forecasting Using Tabular Prior Fitted Networks (Time Series in the Age of Large Models [TSALM] Workshop)
Capital One Authors: Mayuka Jayawardhana (ARIP 2025), Doron Bergman, Nihal Sharma, Nam Nguyen, Mohammadkazem Meidani

This research recasts the multivariate time series forecasting problem as a series of scalar regression problems. Doing so provides a twofold benefit: First, we can leverage tabular foundation models (which have been remarkably successful in tabular regression tasks) in a zero-shot fashion to serve as forecasters. Second, our reformulation allows for interchannel interaction, going beyond the current standard of decomposing the multivariate forecasting problem into independent univariate subproblems.

EPSVec: Efficient and Private Synthetic Data Generation via Dataset Vectors (Workshop on Navigating and Addressing Data Problems for Foundation Models [DATA-FM])
Capital One Authors: Erin Babinsky, Alfy Samuel, Anoop Kumar

Developed through our USC-Capital One Center for Responsible AI and Decision Making in Finance (CREDIF) program, EPSVec is a lightweight DP synthetic data generation method. It steers LLM generation using dataset vectors — directions in activation space that capture the distributional gap between private data and public priors. EPSVec extracts and sanitizes steering vectors just once and then performs standard decoding. This method decouples the privacy budget from generation, enabling high-fidelity synthetic samples even in low-data regimes with reduced computational overhead.

Your Model Diversity, Not Method, Determines Reasoning Strategy (Workshop on Logical Reasoning of Large Language Models)
Capital One Authors: Moulik Choraria (ARIP 2025), Anirban Das, Supriyo Chakraborty, Berkcan Kapusuzoglu, Chiahsuan Lee, Kartik Balasubramaniam, Shixiong Zhang, Sambit Sahu

This study investigates how the allocation of compute budget between exploration (breadth) and refinement (depth) impacts LLM reasoning. We argue that the optimal allocation strategy depends on a model’s “diversity profile”-the spread of probability mass across solution approaches-and it must be characterized before any exploration strategy is adopted. We formalize it by decomposing the reasoning uncertainty and deriving conditions under which tree-style refinement outperforms parallel sampling.

Connect with Capital One at ICLR 2026

If you’re attending the conference in Rio de Janeiro, we invite you to visit our booth to engage with our researchers and authors.

Visit our booth: 408
Explore our research: Dive deep into our latest advancements in AI and machine learning.
Discover career opportunities: Learn about exciting applied research career paths at Capital One for researchers and engineers passionate about AI and join our world-class team.
Learn about our student and grad internships: Put your knowledge and skills to work in our 10-week to two-year graduate programs innovating new products and creatively solving the problems that impact our customers and our business.
Engage with our team: Meet our researchers and AI experts, explore how we’re shaping financial services with patented AI and discuss what’s next for AI in finance.

Originally published at https://www.capitalone.com.

NLP research foundations at ICLR 2026 was originally published in Capital One Tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

Zero trust revolution: Why legacy network security fails

Capital One Tech — Fri, 10 Apr 2026 17:02:30 GMT

The shifting landscape of digital threats demands a fundamental change in network security.

In 2005, long before my time at Capital One, I was a Marine corporal in Iraq. I supported network communications for my squadron, and because of my role, I had to bunker down when attacks occurred. Not fight, just be on high alert and ready to fight if necessary.

When a mortar round or rocket crossed the perimeter and hit the base, I felt it. Often, I felt like a sitting duck. Yet, despite what was happening around me, I had to be ready to support comms for my unit. I had to ensure it was available and secured-or zeroized if all else failed. In those days, though, I only worried about the outside threat-the physical one pounding on my door.

Why now? The evolution of cyber threats

We’ve come a long way since the early days of Operation Iraqi Freedom. Do those types of threats still exist in the world? You bet they do. However, we now have sophisticated cyber threats to deal with too. We have ransomware, DDoS attacks, social engineering, phishing attacks, deepfakes and the list goes on for a solid mile. Depending on the attack, these can come from the outside or the inside.

There are numerous reasons why these threats exist, but many stem from rapid advancements in technology and a growing internet presence. DataReportal.com indicates that 5.35 billion people were using the internet in 2024. This is great for many industries but scary when considering the blast radius of one infiltrated app that millions of people could use.

Recent data breaches, like the MOVEit data breach in 2023, show how alarmingly simple and wide-scaling an attack can become if executed from the right angle in today’s world. In this attack, hackers obtained access to the MOVEit file transfer software. Then, they leveraged a vulnerability to retrieve sensitive data from tens of millions of people across a multitude of companies. That data was further utilized to collect ransom money from execs at big companies. The impact of this breach is still being felt today.

The traditional security struggle bus

I’m a network security engineer, but I’ll say it: Traditional network security struggles against modern threats. Conventional models rely heavily on a solid perimeter but allow for ease of movement for inside personnel. This is often due to cost or to improve performance.

Despite these struggles, the number of attacks and the number of directions they can come from are staggering. It’s not enough to build an “outside” firewall, a DMZ and an “internal” firewall your users sit behind. Today, we obviously need a more adaptive and resilient security model.

Say hello to zero trust network security

I’m sure you’ve heard of it by now, but zero trust network security (ZTNS) is a different way of thinking. Instead of assuming trust within the network, ZTNS is founded on this principle: “Never trust, always verify.” Every access request, whether it originates from inside or outside the network, is subject to strict verification.

The “Never trust, always verify” approach

In ZTNS, trust is never implicit. Each user and device must continually prove their legitimacy through stringent checks. This is akin to the rigorous identification checks at a military checkpoint-no one gets through without proper clearance. Want in? Badge, business and clearance level, please!

Key components of zero trust

Identity-Based Access Controls: Only authenticated and authorized users can access resources.
Microsegmentation: Slicing the network into smaller segments to contain potential breaches, much like setting up multiple secure zones within a base.
Continuous Monitoring: Constant vigilance over network activity to detect and respond to threats in real time.
Least Privilege Access: Users have the minimal access required, limiting potential damage if an account is compromised.

Benefits of zero trust

Enhanced Security Posture: With zero trust, each access attempt faces scrutiny, making it harder for attackers to roam once inside.
Protection Against Insider Threats: Zero trust limits even trusted insiders’ actions. Essentially, you have to be verified to do things.
Improved Visibility and Control: Zero trust allows organizations to monitor network traffic closely, which helps detect anomalies.

Implementing zero trust

Zero trust is not something you magically implement overnight. It demands a thorough review of your current infrastructure. It demands vulnerabilities be identified. Furthermore, it requires a network audit to pinpoint where current security mechanisms fall short.

Best practices for implementation

Identity Management: Use strict identity and access management protocols.
Network Segmentation: Create secure zones within the network to contain potential threats.
Encryption: You need this for your sensitive data-in transit and at rest.
Multifactor Authentication (MFA): Require multiple forms of verification for critical resources.
Continuous Monitoring: Use advanced monitoring tools to identify and respond to unusual activities in real time.

SASE and zero trust

Secure access service edge (SASE) combines networking and security to support zero trust principles. It uses five key capabilities:

Software-defined wide area network (SD-WAN)
Secure web gateway (SWG)
Cloud access security broker (CASB)
Firewall as a service (FWaaS)
Zero trust network access (ZTNA)

There are many providers and solutions for these capabilities. Still, if we hone in on ZTNA, you may recognize common solutions like GlobalProtect by Palo Alto Networks, Zscaler Private Access (ZPA) and Cisco Secure Client (formerly AnyConnect). Do any of those sound familiar?

Real-world examples

In his book “Zero Trust Security Demystified,” L.D. Knowings points out several organizations that have successfully embraced zero trust. He mentions big companies like Accenture, Cloudflare and Akamai, a well-known CDN with a dynamic environment. Even Cisco is mentioned, which should be familiar to many folks in the network space. They use their own SASE architecture, in fact.

Companies that have adopted zero trust report better threat detection and faster incident response times, as every access request undergoes verification. Many of their cloud infrastructures are secured, their attack surface is reduced, and there is limited ability for malware or bad actors to move laterally on the inside. Not to mention, their data is protected, and they can respond faster to new threats.

Challenges of zero trust

Zero trust can be tricky to implement, not to mention resource-intensive-costs, complexity and user experience all play a role in that. To overcome these types of challenges, organizations should:

Plan Thoroughly: Hash out a comprehensive plan that involves all necessary stakeholders, business requirements, target state and so on.
Invest Wisely: Draft an RFP for solutions that meet your requirements and target state. Invest in the tools and technology that best meet your organization’s needs.
Educate Users: Ensure that everyone-senior leaders, internal engineers, app developers and definitely your end users-understands your new tech and protocols.

Balancing security with user experience is vital for successful adoption. There are many vendors that offer SASE solutions, some all-in-one and some only pieces of the bigger zero trust puzzle. Over time, there will be more, but here are some well-known ones: Palo Alto Networks, Fortinet, Netskope and Cisco.

The future of zero trust

As cyber threats evolve, zero trust will remain crucial in cybersecurity. Dynamic cloud environments, like the one we have at Capital One, make this especially important. However, with tech like AI and machine learning, we’ll be able to enhance zero trust by enabling better threat detection and response. With these, we can sift through data faster, find suspicious patterns or anomalies and be more proactive in our security stance.

Let’s not forget about the role of IoT either. Tim Cook, the CEO of Apple, said: “The Internet of Things is creating a new world where everything is connected and can be controlled remotely.” This is exciting, yet scary when you consider how fast things can hop on the internet with minimal security provisions. Of course, zero trust can help secure IoT devices by requiring each device to be authenticated and monitored, minimizing the risk of IoT-based attacks.

Conclusion

If I learned anything from my beloved time in the Marines, it would stem from this mantra: “Improvise, adapt and overcome.” That “adapt” part is crucial for today’s digital environment. By moving beyond outdated models and embracing zero trust, organizations can adapt and ensure more robust defenses against cyber adversaries.

In the evolving landscape of cybersecurity, zero trust is the new standard, and it’s the path that companies in the modern world must begin to take. This is the path toward resilient, perhaps even antifragile, networks. Adapting and staying vigilant are essential, as complacency is not an option.

Zero trust. Over and out.

Originally published at https://www.capitalone.com.

Authored by Jerry Bair, Lead Platform Engineer, Network Core Security. Jerry Bair is a network security engineer supporting customer enablement for Capital One Network Observability. He ventured into the field after getting out of the U.S. Marine Corps in 2006 and found his way to Capital One in 2013. When he’s not tech-ing out at Capital One, he challenges himself with Spartan races and DEKA events.

DISCLOSURE STATEMENT: © 2026 Capital One. Opinions are those of the individual author and are not necessarily those of Capital One. Unless noted otherwise, Capital One is not affiliated with, nor endorsed by, any third parties mentioned and is not responsible for the content or privacy policies of any linked third-party sites. Any trademarks and other intellectual property used or displayed are property of their respective owners.

Zero trust revolution: Why legacy network security fails was originally published in Capital One Tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

Highlights from NVIDIA GTC AI Conference 2026

Capital One Tech — Thu, 09 Apr 2026 13:55:53 GMT

Showcasing the transformative power of agentic AI in real-time customer workflows.

Here’s a recap of Capital One’s presence at NVIDIA GTC 2026, which took place March 16–19 in San Jose, California. As always, GTC was a landmark event for the future of technology. As generative AI moves from experimental prototypes to enterprise-scale production, Capital One was excited to take part in the conversation.

With two featured sessions, a high-traffic booth presence, and strategic discussions with industry peers, our team shared how we are leveraging NVIDIA’s accelerated computing power to build a more enlightened banking experience.

This year, our leaders took the stage to discuss the shift from simple LLM chat interfaces to complex, multi-agentic systems capable of handling sophisticated conversational workflows.

Featured sessions

Building Proprietary Multi-Agentic AI Workflows for Consumer Banking

Explore how Capital One is leveraging its proprietary multi-agentic AI framework to improve the speed, effectiveness and documentation of customer service calls concerning fraud. This solution shows promise in enhancing both the human associate experience as well as the customer experience by advancing quality reviews and agent productivity while meeting rigorous business standards for accuracy, reliability and governance.

Featuring:

Milind Naphade, SVP, Head of AI Foundations, Capital One
Ritesh Soni, MVP, Data Science, Retail Bank, Capital One

Key Takeaways:

Multi-Agent Orchestration: We demonstrated how combining customized foundation models with multi-agent orchestration reduces summarization latency while preserving critical details.
Customization and Scalability: The solution’s underlying multi-agentic framework was originally built to support a car-buying solution and later applied to supporting customers with fraud resolution. This “build once, reuse extensively” mindset encourages associates from any corner of the bank to identify manual processes and pull from our centralized “tech stack” to build solutions.

Watch the speaker session >

Accelerating Enterprise AI: From Infrastructure to Agentic Systems

Our second session delved into the “five-layer cake” of AI-energy, chips, infrastructure, models and applications. Learn firsthand from Capital One engineering leaders how they’re building distributed data pipelines to curate high-quality datasets that enable laser-focused foundation models for financial applications.

Featuring:

Brian Nguyen, Sr. Engineering Manager, Capital One
Nick Resnick, Lead AI Engineer, Capital One

Watch the speaker session >

Conversations on the floor and in the press

Beyond the session halls, our booth received over 1,200 visitors and served as a hub for deep dives into MLOps, data privacy and the evolving landscape of open-source AI. Following the conference, the Wall Street Journal’s CIO Journal featured Capital One in a discussion on why leading companies are leaning into open-source models despite the complexities.

Looking ahead

GTC 2026 made one thing clear: The era of agentic AI has arrived. For Capital One, this means moving beyond simple automation to create systems that can reason, collaborate and execute complex tasks on behalf of our customers and associates.

By combining NVIDIA’s cutting-edge hardware with our modern tech stack and data ecosystem as well as our deep domain expertise, we are continuing to define what’s possible in the next generation of financial services.

Interested in joining the team building the future of AI in finance? Explore our AI Careers page.

Originally published at https://www.capitalone.com.

Highlights from NVIDIA GTC AI Conference 2026 was originally published in Capital One Tech on Medium, where people are continuing the conversation by highlighting and responding to this story.

Secure random number generation on virtual hardware

Capital One Tech — Mon, 06 Apr 2026 14:14:34 GMT

Explore how entropy sources, virtual hardware and cryptographic modules shape secure random number generation.

The specter of generating a secure random number can unnerve a software engineer of any experience level. The apparitional subtleties of the myriad available approaches can tremble even the practiced hand of a seasoned cybersecurity professional. And as with the cryptographic domain’s other spindly fingers, secure random number generation demands precisely correct implementation-otherwise, the security of the entire system at question might collapse. But against the severe implications of a poor choice, a rudimentary understanding of where random numbers come from suffices to make an informed decision and effectively mitigate the security risk associated with an inadequate source of random numbers.

For engineers, cybersecurity professionals or anyone generating encryption keys or setting up a process to do so or governance around it, this article will demystify the process of selecting a random number generator by explaining the varying levels of security that different solutions can offer in different contexts.

Understanding entropy and randomness

Briefly, a few key terms will aid in understanding the nuances of random number generation.

Random number generators

The output of a function is called random-and similarly, the function can be called a true random number generator, or RNG-if given all inputs to the function there is no effective computational method to forecast its output. In other words, the output of the function does not follow any predetermined algorithm.

Pseudorandom number generators

A pseudorandom number generator (PRNG), on the other hand, is a function that, given a seed value, produces a sequence that passes statistical tests for randomness. PRNGs are different from RNGs because, given an identical seed and context, they will produce an identical sequence of numbers. However, they can suffice in many cases, provided an attacker has no means by which to discover the seed.

Random bit generators and entropy

The National Institute of Science and Technology (NIST) defines both of these concepts in computational terms. According to SP 800–90, a random bit generator (RBG) is defined as “a device or algorithm that outputs a sequence of binary bits that appears to be statistically independent and unbiased.” As such, RBGs can be either RNGs or PRNGs. To distinguish between the two, NIST defines non-deterministic random bit generators and deterministic random bit generators (DRBGs) as their respective computational equivalents.

Entropy is a related concept that refers to the level of uncertainty in the output of a random function. NIST defines entropy as “a measure of the disorder, randomness or variability in a closed system.” Entropy is measured in bits-a fair coin flip has exactly one bit of entropy; an unfair coin flip has a little bit less entropy, since there is less uncertainty in the outcome; and a roll of a six-sided die has about 2.5 bits of entropy, since there is more uncertainty in the outcome. As the name implies, it is analogous to entropy in the context of thermodynamics. A source of entropy is an RBG acting as a source of uncertainty in the same way a coin or die might.

Cryptographically secure language built-ins

How standard libraries generate random bits

Desktop or server computers generate random bits by feeding output from a source of entropy through a DRBG. For example, computers might accumulate a pool of entropy consisting of things like mouse movements and temperature fluctuations from the central processing unit (CPU) or specialty entropy-generation chips like Intel’s digital random number generator. Upon user request, a DRBG siphons a few bits from the entropy pool and transforms them into a usable sequence of random bits.

Why virtual environments add risk

The cryptographically secure RBG from any language’s standard library uses this type of random bit generation-entropy collection from the hardware running it. Standard library RBGs are ubiquitously recommended by online forums, enterprise policies and government agencies, and they are secure in a wide range of cases. But when working in a virtualized setting, blindly following broadly scoped guidelines and collecting entropy from local hardware can leave your application vulnerable to attack.

Problems with local entropy in virtualized infrastructure

Virtualized infrastructure includes almost anything deployed using a cloud provider, but it could mean a fleet of dynamically provisioned virtual machines in a local data center or even on a single server. Of course, architects must examine each application individually to determine the level of risk each may pose, but two problems bedevil secure local entropy generation on most deployments of virtual hardware.

Entropy starvation at boot time

The first issue is a lack of entropy. Many large software deployments often involve dynamic scaling, which means frequent and automated provisioning of new servers or virtual machines. Soon after boot, an operating system may not have had enough time to collect sufficient entropy to generate secure random bits. Operating systems work around this problem in a number of different ways, which means that a generation call may not actually generate a random bit sequence but instead a predictable vestige of a starved entropy source. For example, in some older, diskless Linux distributions, the kernel’s RBG relies purely on reasonably predictable system events, making the initial entropy state highly predictable.

Entropy spying and manipulation on shared hardware

The second issue pesters systems even more persistently and resists simple solutions. On virtualized hardware, any other user of a shared physical server can read or influence the entropy collected by that particular server. It has been demonstrated in laboratory environments that entropy spying and manipulation permit an attacker to effectively predict RBG output on virtual machine- and container-based deployments across cloud platforms and operating systems. In another case, as a common work-around of the problem of boot-time entropy starvation, many Linux distributions save their final entropy state to the disk on shutdown as a seed to be loaded at reboot. But after the operating system has been shut down, anyone with access to a shared physical disk might be able to read the seed data.

Entropy spying is an extreme danger for software on virtualized infrastructure. Any key generated on a device compromised this way is itself compromised; thus, generating secure random bits on virtual hardware requires a completely novel solution.

A practical solution: Using hardware security modules

What HSMs are and why they’re secure

Hardware-based RBGs are the best source of entropy available on a regular computer, but specialized hardware exists. In particular, for secure random bits when local entropy doesn’t suffice, choose a hardware security module (HSM).

HSMs are special-purpose servers that are primarily used for key management in security-sensitive sectors like government, health care and finance. Capital One uses HSMs to store our most secure keys. HSMs ship with purpose-built entropy generation hardware that is significantly more sophisticated than the entropy sources supplied with general-purpose consumer CPUs. These purpose-built chips are protected from any spying or meddling, generate enormous amounts of entropy, and are nearly impossible to compromise.

Unfortunately, HSMs are often expensive to buy and maintain, and access to HSMs both physically and over the network must be tightly restricted to keep the attack surface on these sensitive machines as small as possible. But as a work-around of the apparently prohibitive cost and security, major cloud platforms provide a way to get secure random bits from an HSM on demand without actually requiring that you purchase and maintain one yourself.

Cloud KMS alternatives from AWS, Azure and GCP

To generate high-quality random bits, AWS Key Management Service (KMS) provides a simple GenerateRandom application programming interface (API) call, Azure Key Vault has Get Random Bytes and Google Cloud Platform (GCP) KMS offers GenerateRandomBytes. None of these API calls requires the creation or management of a dedicated HSM resource, and all three are billed at just $0.03 per 10,000 requests. Random bits generated using these functions are immune to the cloud vulnerabilities that haunt system RBGs.

A perfect solution: Quantum random number generation

For enterprises living on the cutting edge of security that wish to use in-house HSMs directly, some newer HSMs have begun incorporating exciting new quantum random number generator (QRNG) chips. QRNG chips source their entropy from measurements of systems governed by the rules of quantum mechanics. Quantum random number generation, on the other hand, is protected by the universe itself-measurements of quantum systems are unpredictable as a law of physics. No attacker, even given perfect knowledge of a QRNG chip and all its inputs, can predict its output.

QRNG chips have limited availability today. Integration and cost challenges will likely prohibit on-prem adoption for a while, though cloud-based solutions through the providers of other key generation APIs may be available sooner. And because QRNGs are new, they remain under scrutiny for the side-channel and implementation-specific attacks that imperil all real-world cryptographic systems. But don’t despair if you can’t get access to one yet; modern DRBGs are good enough-so good that QRNGs don’t provide any practical advantages in terms of security (at least not any that are known to the public).

Each application is unique: Matching entropy sources to risk

Depending on the context, securing an application may only require a standard hardware-based entropy source or it may require a bleeding-edge QRNG chip. Since attacks on hardware-based entropy generation are possible, we must assume nation-state threat actors are executing them. So for the most security-sensitive random numbers, like AES or RSA keys protecting sensitive data, using random bits generated by a cloud provider’s HSMs is prudent. But for other applications, such a high level of caution may not be necessary.

Ultimately, each application needs an individual security review, and the source of any required random bits should be examined and evaluated as part of a comprehensive threat model. It is easy to find oneself caught in the Lethean maelstrom of online forums that immerse the topic of random number generation in verbiage that implies a mystery too deep to unravel. But-as with any other security decision-it is a set of practical and understandable criteria that determines the appropriate choice.

Originally published at https://www.capitalone.com.

Authored by William Cabell, Lead Software Engineer, Cyber Intelligence Engineering. I’m a lead software engineer and cybersecurity professional who finds fulfillment in solving hard, open-ended problems. I’ve built resilient engineering cultures and led the development of large-scale software platforms in the cryptography and cyber intelligence domains.

Secure random number generation on virtual hardware was originally published in Capital One Tech on Medium, where people are continuing the conversation by highlighting and responding to this story.