9 Comments
User's avatar
Onyekachukwu Okonji's avatar

Really enjoyed reading this. Currently doing research on hallucinations and I'm glad I came across this.

Victor Dibia, PhD's avatar

Glad you found it useful.

Thanks for sharing feedback

Taledy's avatar

Great. work. Very in-depth coverage.

Victor Dibia, PhD's avatar

Thanks @derrick! Glad you found it useful! Thanks for sharing and subscribing! Your support is much appreciated!

Taledy's avatar

Absolutely! Don't mention it.

Doug Ross's avatar

Great summary, thank you.

Wondering if there are visualizations/UIs that can help capture model performance in various interim states?

Or provenance/bias of training datasets?

Victor Dibia, PhD's avatar

Model performance benchmarking is an interesting emergent area.

It is also hard because because performance is affected by factors beyond the model's weights (e.g. prompt design).

Some benchmarks

- HELM . https://crfm.stanford.edu/helm/latest/

- Eleuther AI LLM Eval https://github.com/EleutherAI/lm-evaluation-harness

UIs for benchmarking typically are task-based.

Some tools for comparing model outputs

- https://scale.com/spellbook

- https://twitter.com/natfriedman/status/1633582489850773504?lang=en

Jon's avatar

Great work; curious, how do you generate your post images? It looks great

Victor Dibia, PhD's avatar

Thank you!

Designed by hand in Figma.