Skip to content

The-Swarm-Corporation/swarms-evals

Repository files navigation

Multi-Modality

Swarm Evals

Evaluations comparing multi-agent collaboration to individual agents.

Install

$ pip3 install swarms-eval

License

MIT

Citation

Please cite Swarms in your paper or your project if you found it beneficial in any way! Appreciate you.

@misc{swarms,
  author = {Gomez, Kye},
  title = {{Swarms: The Multi-Agent Collaboration Framework}},
  howpublished = {\url{https://github.com/kyegomez/swarms}},
  year = {2023},
  note = {Accessed: Date}
}

Evals

  • ARC (AI2 Reasoning Challenge)
  • HellaSwag
  • Multiverse Math
  • MMLU
  • GLUE
  • SuperGPU
  • HumanEval:
  • MBPP
  • SWE-BENCH

Evals to make

Math

[ ] MATH

code

[ ] HumanEval

[ ] MBPP

[ ] Natural2Code

[ ] MBPP (early)

[ ] SWE-bench

commonsense reasoning

[ ] ARC (AI2 Reasoning Challenge)

[ ] HellaSwag

[ ] Winogrande

[ ] PIQA

[ ] SIQA

[ ] OpenbookQA

[ ] CommonsenseQA

Question Answering - world knowledge

[ ] WebQuestions

[ ] NaturalQuestions

[ ] TriviaQA

[ ] ComplexWebQuestions

[ ] WebQuestionsSP

[ ] SearchQA

[ ] HotpotQA

[ ] DROP

[ ] WikiHop

[ ] QAngaroo

[ ] Multi

[ ] GLUE (early)

[ ] SuperGLUE

reading comprehension

[ ] BoolQ

[ ] QuAC

[ ] DROP

aggregated

[ ] MMLU

[ ] HELM

[ ] BBH

[ ] AGI Eval

multi-agent

[ ] ChatBot Arena

[ ] MT Bench

Metrics

[ ] Task accomplished y/n

[ ] Task quality 1-5

[ ] was a tool used y/n [ ] which tool [ ] how was it used [ ] were tools called in the right order [ ] were the tools used correctly? Did the environment change? [ ] Did the agent output match the expected reference output?

[ ] Compute load

[ ] Cost $ " the cost per 1000 function callings alt text"

[ ] Average Latency (s) "Latency: we measure the latency by timing each request to the endpoint ignoring the function document preprocessing time."

[ ] Model size needed (Small, Medium, Large) [ ] Model type needed (LM, VLM, predictive, point cloud, NeRF/Splat) [ ] Number of trials to acceptable performance [ ] Number of trials to best performance

References

[1]Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap (https://arxiv.org/html/2402.19450v1) Code (https://github.com/consequentai/fneval/)

[2]explanation of metrics (https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html)

[3] tool use benchmarks (https://langchain-ai.github.io/langchain-benchmarks/notebooks/tool_usage/intro.html?ref=blog.langchain.dev) (https://blog.langchain.dev/benchmarking-agent-tool-use/)

[4] Blog on evaluating LLM apps (https://humanloop.com/blog/evaluating-llm-apps)

About

Implementation of all the evals of the swarm evals for the swarm paper.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published

Contributors 3

  •  
  •  
  •