Data Talent Pulse
Posts
👨‍💻 “Hallucination Index” Ranks LLMs for Popular AI Use Cases

👨‍💻 “Hallucination Index” Ranks LLMs for Popular AI Use Cases

DTP #28: PLUS: Open-source AI model libraries on the rise

Mukundan Sivaraj
December 08, 2023

Looking to fill the need for a comprehensive LLM benchmark report that provides a measurement of LLM hallucinations, Galileo Labs has put together a “Hallucination Index” featuring prominent existing LLMs.

They claim to have covered 3 major blind spots that are overlooked by other LLM benchmarks:

Assess the actual quality of LLM outputs
Practical benchmarks accommodating varying task types
Evaluating LLMs across RAG and non-RAG tasks

We look at the 3 major use cases for which they have ranked LLM capabilities below:

💼 AI in Business

Replicate Raises $40 Million For Open Source AI Model Library

Image generated by Midjourney

An article from Forbes highlights Replicate’s recent funding round, valuing the startup at $350 million. The platform hosts over 25,000 open-source AI models and has witnessed a surge in popularity among 2 million software developers.

Key Points:

Replicate secured $40 million in Series B funding led by Andreessen Horowitz, bringing its valuation to $350 million.
Hosts over 25,000 open-source AI models, catering to 2 million software developers.
Witnessed significant growth with the addition of models like Meta's Llama 2 and Stable Diffusion 2.0.
Rising interest in open-source models due to concerns about reliance on proprietary platforms.
Competition from startups like Together AI and tech giants such as Nvidia, Google, Amazon, and Microsoft offering similar services.
Addressing safety concerns with content filters, though accuracy remains a challenge.
Open-source models offer cost-effectiveness and customization potential through fine-tuning.
Replicate charges developers for model runtime and collaborates with Nvidia and various cloud providers.
New funding to attract more users, offer enterprise-level services like security, and enhance monitoring capabilities.

Competition for Replicate exists from other startups like Together AI and established tech giants like Nvidia, Google, Amazon, and Microsoft offering similar cloud-based machine learning services.

Question & Answer without Retrieval (RAG)

Use case where machine learning models answer questions directly without relying on a predefined database or external context retrieval. RAG (Retrieval-Augmented Generation) is a specific implementation of this concept, developed by OpenAI, that combines aspects of both retrieval-based and generative models.

GPT-4 by OpenAI stands out as the top performer in Question & Answer without Retrieval (RAG) tasks, boasting a high Correctness Score of 0.77. Its exceptional accuracy and minimal tendency for hallucination reinforce its dominance in applications involving general knowledge.

In the realm of open-source models, Meta's Llama-2-70b leads the pack with a Correctness Score of 0.65. However, other models such as Meta’s Llama-2-7b-chat and Mosaic ML’s MPT-7b-instruct exhibited higher susceptibility to hallucinations, scoring 0.52 and 0.40, respectively, in similar tasks.

The evaluation suggests GPT-4-0613 as the recommended choice for dependable and precise AI performance within this task category.

Question & Answer with RAG

Use case that combines elements of retrieval-based and generative models to facilitate question answering tasks. RAG integrates transformer-based language models (such as GPT models) with dense vector retrievers (like DPR - Dense Passage Retrieval) to enhance the process of answering queries. OpenAI's GPT-4-0613 demonstrated prowess, securing the lead with a Context Adherence score of 0.76, while the more economical and faster GPT-3.5-turbo models, specifically -0613 and -1106, closely matched its performance, attaining scores of 0.75 and 0.74, respectively.

Unexpectedly, Hugging Face's Zephyr-7b, an open-source model, surpassed the notably larger Llama-2-70b from Meta, securing a Context Adherence Score of 0.71 compared to 0.68, challenging assumptions about the inherent superiority of larger models.

Conversely, TII UAE's Falcon-40b (Context Adherence Score = 0.60) and Mosaic ML's MPT-7b (Context Adherence Score = 0.58) lagged in this specific task.

For this task type, the Index recommends GPT-3.5-turbo-0613 as the suitable choice.

Long-form Text Generation

The ability of these models to generate extended, coherent, and contextually relevant text passages or documents.

Once again, GPT-4-0613 from OpenAI exhibited superior performance, showcasing a minimal tendency for hallucination with a high Correctness Score of 0.83. Concurrently, GPT-3.5-turbo versions (1106 and 0613) closely matched this proficiency, scoring 0.82 and 0.81, respectively, offering potential cost-effectiveness and improved performance.

Notably, Meta's Llama-2-70b-chat from open-source alternatives competed neck and neck with GPT-4, displaying similar capabilities with a Correctness Score of 0.82, providing a viable and efficient alternative for this task. Conversely, TII UAE's Falcon-40b (Correctness Score = 0.65) and Mosaic ML's MPT-7b (Correctness Score = 0.53) lagged in effectiveness.

The recommendation from the Index suggests Llama-2-70b-chat as an optimal choice, striking a balance between cost efficiency and performance in the domain of Long-form Text Generation.

Final Thoughts

Open AI's Superiority:

Open AI's models demonstrate superior performance in minimizing hallucinations across various task types.
However, this advantage comes with a trade-off as their API-based pricing model may escalate overall costs when implementing Generative AI products.

Open-Source Cost Efficiency:

Opting for lower-cost versions within OpenAI's model range, like GPT-3.5-turbo, presents potential cost-saving opportunities.
Embracing open-source models offers the most substantial cost savings.

Models for Specific Task Types:

Long-form Text Generation: Meta's open-source Llama-2-13b-chat emerges as a commendable alternative to Open AI's models.

Question & Answer with RAG: Hugging Face's Zephyr model stands as a nimble yet powerful substitute for OpenAI, with an inference cost 10 times lower than GPT-3.5 Turbo.

Galileo's Evaluation Metrics

Galileo employed proprietary metrics (Correctness and Context Adherence), powered by ChainPoll, for evaluating models.
ChainPoll, developed by Galileo Labs, boasts an 87% accuracy in detecting hallucinations.

💻 Platform Highlight

Akkio: A business analytics and forecasting tool for data analysis and outcome prediction. Aids in predictive analysis, marketing, and sales.

Together AI: Cloud-based Gen AI platform providing tools for constructing open-source generative AI and infrastructure for developing AI models.

Chingu AI: AI-powered content creation, project management, and productivity platform.

🌐 From the Web

Mozilla Lets Folks Turn AI LLMs Into Single-File Executables

Mozilla's innovation group introduced llamafile, an open-source solution converting multi-gigabyte LLM (Large Language Model) weight sets into a single cross-OS binary, streamlining distribution across macOS, Windows, Linux, and BSD systems without installations.

NVIDIA Unveils Enhanced NeMo Framework, Improves LLM Training on H200 GPU
NVIDIA introduced an upgraded NeMo framework, enhancing LLM training on their H200 GPU, specifically benefiting complex models like Llama 2. These advancements focus on cloud-native capabilities, improved parallelism, and enhanced performance, meeting growing demands for efficient and diverse LLM training.

Why OpenAI developing an artificial intelligence that’s good at maths is such a big deal
OpenAI's Q* algorithm, represents an advancement in AI's quest for comprehensive reasoning skills. Despite AI's foundation in mathematics, challenges persist, such as limitations in reasoning and creativity.

Data Scientists on Reddit discuss the common fundamental you see Data Scientists and MLEs lacking: Post

How much better than GPT 3.5 is GPT 4?: A tweet

🤖 Prompt of the week

I want you to act as a SQL code optimizer. The following code is slow. Can you help me speed it up? [Insert SQL]

See you next week,

Mukundan

Reply

or to participate.