Intuit’s Custom LLM Leaderboard: Optimizing Model Selection for Financial Use Cases

5 min readOct 24, 2024

This blog is co-authored by Sumanth Venkatasubbaiah, Senior Manager, and Udi Menkes, Principal Product Manager, for Intuit AI, Data and Analytics.

In the fast-changing world of generative AI (GenAI), staying ahead means selecting the right tools for the job. At Intuit, we’re committed to leveraging cutting-edge technology for innovative solutions in accounting, tax, personal finance, and small business and solopreneur marketing. To achieve that goal, we’re constantly adding to and refining the suite of tools that make up GenOS, our proprietary operating system for building GenAI-powered solutions at scale for consumers and business owners.

Choosing, fine-tuning or building the right LLM is one of the primary keys to success when developing GenAI-powered applications. Intuit takes an LLM-agnostic approach to this challenge so that our engineers can be as flexible as possible in a rapidly evolving environment. Today, we’re proud to share a transformative tool designed to make the process of choosing the right LLM easier than ever — the Intuit LLM Leaderboard. Our custom LLM Leaderboard enables GenAI developers at Intuit to quickly identify the most suitable language models for their specific use cases. This leaderboard focuses on evaluating models on benchmarks tailored to Intuit’s domains.

Why a leaderboard is essential for mature LLM ops

A leaderboard provides a structured, transparent way to compare LLMs. This capability aids in faster, more informed decision-making and helps ensure the highest quality results possible. It also accelerates time to production and fosters a culture of innovation and collaboration. It’s a vital component of a LLM ops stack, ensuring that the best tools are always at hand for developing cutting-edge solutions.

The need for a custom leaderboard

The explosive growth of language model options presents a significant challenge for developers. To confidently select the right model for experimentation, it is crucial to compare them using a robust and structured process. Our proprietary LLM Leaderboard addresses this by providing a transparent and comparative analysis of various model capabilities and performance, specifically tailored to our financial domains.

Tailored Evaluations for Intuit’s Unique Needs

Our custom LLM Leaderboard provides developers with a one-stop-shop for evaluating models based on Intuit-specific criteria, aligned to our unique requirements. For instance, it emphasizes tax and accounting knowledge over generic training data. Public benchmarks, while useful, often fail to address these specific business needs.

Structured and Relevant Model Comparisons

A custom leaderboard allows for a relevant comparison of all models. This is important because publicly available leaderboards that include a blend of commercial and open-source models have proprietary metrics to measure quality that are not necessarily clear or relevant to Intuit.

Enhanced Visibility and Faster Decision-Making

The LLM Leaderboard provides visibility across Intuit’s broader tech community, facilitating faster and more informed decision-making. By highlighting the most suitable models for various use cases, it enables teams to make quicker, data-driven selections, accelerating the development of high-quality GenAI-driven solutions.

Key benefits

The LLM Leaderboard isn’t just a comparison tool; it’s a gamechanger for Gen AI development at Intuit:

Simplified Decision-Making: Developers can quickly identify the best models for specific tasks, reducing the time spent on trial and error.
Enhanced Product Quality: Developers can select the most suitable models from the start, ensuring efficient and reliable financial products.
Faster Development Cycles: Using the leaderboard accelerates model selection and deployment, keeping our offerings competitive and responsive.
Knowledge Sharing and Collaboration: Giving all developers access to a centralized repository of comprehensive model evaluations encourages discussions, knowledge-sharing and collaborative problem-solving as teams learn from each other’s experiments and experiences.

Intuit-specific benchmark design

Intuit’s benchmarks were carefully curated through collaboration between industry partners and internal domain experts. This ensures the benchmarks accurately reflect the capabilities required for Intuit’s products and services. The methodology allows for standardized comparison between commercial and open-source models.

Use cases

Here’s how Intuit’s teams can leverage the LLM Leaderboard:

Continuous Language Model Selection: Quickly identify the best models for new projects in domains like accounting, personal finance, and tax, ensuring ongoing alignment with project needs.
Assessing Value of Model Updates: Continuously assess the value of updating an existing LLM with a newly released one, providing quick insights on improvements and relevancy.
Batch Language Model Evaluation: Evaluate multiple models against custom benchmarks to ensure they meet Intuit’s specific requirements.
Batch Benchmark Evaluation on Custom Models: Regularly test fine-tuned models to identify their strengths and areas for improvement, ensuring ongoing optimal performance.

Accessing the leaderboard

Intuit developers can access the leaderboard through Intuit’s GenOS AI Workbench, which provides a single-pane-of-glass user experience for AI development. The users can use filters to narrow down models based on attributes like domain, size, and capability. The performance scores provide initial signals to guide further experimentation.

Solution architecture

The architecture behind the LLM Leaderboard is designed for maximum functionality:

LLM Evaluation Framework: Enhanced to interact with a model registry.
User Interface: A user-friendly interface for easy interaction and comparison.
Benchmark Management: Structured validation process for new benchmarks.
Model Registry: Centralized storage and management of models.
Model Cards Integration: Detailed information on each model.
Latency & Cost Metrics: An upcoming feature will present latency and cost metrics.

Future objectives

Intuit’s LLM Leaderboard is set to evolve:

Complete view: Enhance benchmarks to cover latency, cost, security and safety aspects, in addition to accuracy.
Testing Own Models: Evaluate fine-tuned models against benchmarks.
Contributing New Benchmarks: Allow users to propose and add new benchmarks aligned to evolving business needs.
Curating Intuit Expertise: Create a hub of knowledge for model evaluations and fine-tuning datasets.

Key takeaways

With the accelerating pace of language model development, having a robust selection tool is becoming imperative. The Intuit LLM Leaderboard equips our developers to stay ahead of the curve. By benchmarking on our specialized domains, it empowers them to spend less time experimenting and more time innovating high-impact solutions.

This is just the start of a continuous journey of learning and advancement. It’s also another chapter in our quest to democratize AI by enabling product developers to build, deploy and monitor highly performant models efficiently at scale. The LLM leaderboard fosters a collaborative ecosystem where our experts share insights and drive Intuit’s AI capabilities forward together. It lays the foundation for us to lead the way in AI-enabled financial services. We are excited to see how we continue to enhance this resource to unlock new possibilities for our customers.

A heartfelt thank you to the team

This accomplishment would not have been possible without the dedication and hard work of our incredible team. We extend our deepest gratitude specifically to Antonio Martinez, Rohan Tangadpalliwar, Crystal Zheng, Udi Menkes, Noa Haas, Ido Mintz, Linoy Cohen ,Dmitry Burshtein,Kobi Lemberg,Tom Klein,Eduard Zlotnik,Rami Cohen,Kfir Aharon, Osnat Haj Yahia,Mohith D,Preetesh Sharma, Mrinalini Upadhya, Sooji Son, Eshita Gupta and all the data scientists, AI developers, product managers, and domain experts who contributed to the creation and launch of the LLM Leaderboard. Your unwavering commitment to excellence has been instrumental in bringing this vision to life.

#AI #GenAI #LLMOps #MachineLearning #IntuitEngineering

Intuit Engineering