How Do You Measure AI Performance?
At AskHandle, we assess AI performance by ensuring that responses are both accurate and relevant to the user's question. Each AI-generated response in AskHandle includes two key scores, displayed in brackets like (0.8, 0.9), that measure the quality of the AI’s performance. These scores range from 0 to 1 and give a quick indication of how well the AI meets user needs.
- The first score is the Faithfulness Score: This checks if the AI’s response is factually consistent with the data you’ve provided.
- The second score is the Relevance Score: This indicates how well the response addresses the specific question asked.
If either score is zero, meaning the AI couldn’t use your data to respond, you’ll see “General Data” displayed instead of the scores, letting you know the AI used general knowledge to answer the question.
Understanding the Faithfulness Score
The Faithfulness Score (the first number) measures how closely the AI’s response sticks to the factual information in your data. This score ranges from 0 to 1:
- Score of 1: The AI response is fully aligned with the facts in your data.
- Score of 0: The response does not use any facts from your data.
- Scores between 0 and 1: The response is partially accurate, where some claims match your data, but there might be slight differences or additions.
For instance, if a customer asks about your store’s opening hours and the AI’s response matches the exact hours in your data, you’ll see a high Faithfulness Score. This score ensures the AI’s response is consistent with the information you’ve provided.
Understanding the Relevance Score
The Relevance Score (the second number) measures how well the AI’s response answers the specific question being asked. This score also ranges from 0 to 1:
- Score of 1: The AI’s response directly answers the user’s question using the data.
- Score of 0: The response does not address the question with your data.
- Scores between 0 and 1: Partial relevance, where the response is somewhat related to the question but doesn’t fully address it.
For example, if a user inquires about product availability, a high Relevance Score means the AI’s response is directly related to availability based on your data. This score confirms that the AI is responding accurately to the user's specific question.
Why Faithfulness and Relevance Scores Matter in Measuring AI Performance
Together, these two scores provide insight into the AI’s performance:
- Faithfulness Score: Shows if the AI uses your data accurately.
- Relevance Score: Confirms if the AI is addressing the user’s question specifically.
High scores in both Faithfulness and Relevance indicate a high-quality response. Low scores in either area might suggest gaps in your data or areas where the AI might need additional context or refinement.
Improving AI Performance with These Scores
Monitoring Faithfulness and Relevance Scores allows for ongoing improvement of AI performance. Low Faithfulness Scores may indicate that the AI is making assumptions or adding unnecessary details, suggesting a need for clearer or more precise data. Low Relevance Scores imply the AI could benefit from more specific information to address particular questions, guiding you to refine or expand your data sources.
Faithfulness and Relevance Scores offer a straightforward approach to measuring and enhancing your AI’s performance. This oversight ensures that your AI chatbot not only responds quickly but also provides factually accurate and relevant answers, creating a more satisfying and trustworthy customer experience in every interaction.