Swarm Evals

Evaluations comparing multi-agent collaboration to individual agents.

Install

$ pip3 install swarms-eval

License

MIT

Citation

Please cite Swarms in your paper or your project if you found it beneficial in any way! Appreciate you.

@misc{swarms,
  author = {Gomez, Kye},
  title = {{Swarms: The Multi-Agent Collaboration Framework}},
  howpublished = {\url{https://github.com/kyegomez/swarms}},
  year = {2023},
  note = {Accessed: Date}
}

Evals

ARC (AI2 Reasoning Challenge)
HellaSwag
Multiverse Math
MMLU
GLUE
SuperGPU
HumanEval:
MBPP
SWE-BENCH

Evals to make

Math

GSM8K

[ ] MATH

code

[ ] HumanEval

[ ] MBPP

[ ] Natural2Code

[ ] MBPP (early)

[ ] SWE-bench

commonsense reasoning

[ ] ARC (AI2 Reasoning Challenge)

[ ] HellaSwag

[ ] Winogrande

[ ] PIQA

[ ] SIQA

[ ] OpenbookQA

[ ] CommonsenseQA

Question Answering - world knowledge

[ ] WebQuestions

[ ] NaturalQuestions

[ ] TriviaQA

[ ] ComplexWebQuestions

[ ] WebQuestionsSP

[ ] SearchQA

[ ] HotpotQA

[ ] DROP

[ ] WikiHop

[ ] QAngaroo

[ ] Multi

[ ] GLUE (early)

[ ] SuperGLUE

reading comprehension

[ ] BoolQ

[ ] QuAC

[ ] DROP

aggregated

[ ] MMLU

[ ] HELM

[ ] BBH

[ ] AGI Eval

multi-agent

[ ] ChatBot Arena

[ ] MT Bench

Metrics

[ ] Task accomplished y/n

[ ] Task quality 1-5

[ ] was a tool used y/n [ ] which tool [ ] how was it used [ ] were tools called in the right order [ ] were the tools used correctly? Did the environment change? [ ] Did the agent output match the expected reference output?

[ ] Compute load

[ ] Cost $ " the cost per 1000 function callings "

[ ] Average Latency (s) "Latency: we measure the latency by timing each request to the endpoint ignoring the function document preprocessing time."

[ ] Model size needed (Small, Medium, Large) [ ] Model type needed (LM, VLM, predictive, point cloud, NeRF/Splat) [ ] Number of trials to acceptable performance [ ] Number of trials to best performance

References

[1]Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap (https://arxiv.org/html/2402.19450v1) Code (https://github.com/consequentai/fneval/)

[2]explanation of metrics (https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html)

[3] tool use benchmarks (https://langchain-ai.github.io/langchain-benchmarks/notebooks/tool_usage/intro.html?ref=blog.langchain.dev) (https://blog.langchain.dev/benchmarking-agent-tool-use/)

[4] Blog on evaluating LLM apps (https://humanloop.com/blog/evaluating-llm-apps)

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github		.github
scripts		scripts
swarms_evals		swarms_evals
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yml		.readthedocs.yml
LICENSE		LICENSE
README.md		README.md
agorabanner.png		agorabanner.png
errors.txt		errors.txt
gsm8_eval.py		gsm8_eval.py
hellaswag.py		hellaswag.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
test.txt		test.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Swarm Evals

Install

License

Citation

Evals

Evals to make

Math

code

commonsense reasoning

Question Answering - world knowledge

reading comprehension

aggregated

multi-agent

Metrics

References

About

Releases

Sponsor this project

Packages

Contributors 3

Languages

License

The-Swarm-Corporation/swarms-evals

Folders and files

Latest commit

History

Repository files navigation

Swarm Evals

Install

License

Citation

Evals

Evals to make

Math

code

commonsense reasoning

Question Answering - world knowledge

reading comprehension

aggregated

multi-agent

Metrics

References

About

Resources

License

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Contributors 3

Languages

Packages