
Evaluation Bundesbot
Proof of concept for an automatic evaluation of an ITZBund Chatbot using a language model
Description
This projects provides ideas for an automtic evaluation of an instance of the ITZBund-Chatbot (aka Bundesbot, see e.g. https://www.itzbund.de/DE/itloesungen/standardloesungen/chatbots/chatbots.html). It should be seen as a Proof-of-Concept and give ideas to others who are tasked with a similar problem. In this project, we use a large language model (LLM) to generate questions that might be sent to the chatbot. The newly generated questions include variations of the pre-defined questions with or without typos, translations and new questions on a given topic. The answers are retrived from the chatbot by API-calls and checked against the pre-defined set of answers. Additionally the LLM is used to evaluate the answer on a score between 1 and 3 whether it is a valid response to the question.
Features
- parsing the JSON-export of the Bundesbot
- generating new questions using a large language model
- evaluating the chatbot response