9 Comments

Really enjoyed reading this. Currently doing research on hallucinations and I'm glad I came across this.

Expand full comment

Glad you found it useful.

Thanks for sharing feedback

Expand full comment

Great. work. Very in-depth coverage.

Expand full comment

Thanks @derrick! Glad you found it useful! Thanks for sharing and subscribing! Your support is much appreciated!

Expand full comment

Absolutely! Don't mention it.

Expand full comment

Great summary, thank you.

Wondering if there are visualizations/UIs that can help capture model performance in various interim states?

Or provenance/bias of training datasets?

Expand full comment

Model performance benchmarking is an interesting emergent area.

It is also hard because because performance is affected by factors beyond the model's weights (e.g. prompt design).

Some benchmarks

- HELM . https://crfm.stanford.edu/helm/latest/

- Eleuther AI LLM Eval https://github.com/EleutherAI/lm-evaluation-harness

UIs for benchmarking typically are task-based.

Some tools for comparing model outputs

- https://scale.com/spellbook

- https://twitter.com/natfriedman/status/1633582489850773504?lang=en

Expand full comment

Great work; curious, how do you generate your post images? It looks great

Expand full comment

Thank you!

Designed by hand in Figma.

Expand full comment