sinazargaran 2 days ago

We've had a lot of problems continuously testing every iteration of our conversational agents. Our agents are more entertainment focused, so it's more difficult to evaluate them with a framework. Haven't come across any benchmarks for our use case. Is that something you're also considering?

  • mesius 2 days ago

    It all comes down to making your own dataset. Have you looked at Langsmith, langfuse? they have a ui for building datasets out of production traces. but we are taking it one step further and letting you define mock databases, mock apis, etc.