Apple study exposes deep cracks in LLMs’ “reasoning” capabilities

misk@sopuli.xyz · 2 months ago

Apple study exposes deep cracks in LLMs’ “reasoning” capabilities

zbyte64@awful.systems · 2 months ago

Adding the benchmark back into the training process doesn’t mean you get an LLM the can weed out irrelevant data, what you get is an LLM that can pass the new metric and you have to design a new metric with different semantic patterns to actually know if it’s “eliminating red herrings”.