Dan ran his resume through HackerRank's open-source applicant tracking system and got a score of 90. Then he ran the same file again and got 74. Then 88. Then 83. Nothing changed between the runs except that he ran it again.
This is from Dan Kinsky's writeup. He fed his own resume into the tool a hundred times and the scores came back anywhere from 66 to 99. If a company's cutoff is 85, the same resume fails about two-thirds of the time. Yups. Two-thirds! Not because of anything in the resume - Just luck.
We need to dive in further to see what happened here. The technical skills score came back as 8 out of 10 in 98 of the 100 runs. Almost perfectly stable. That makes sense, because technical skills are basically a checklist. You either know React or you don't.
But the "projects" score was all over the place. The same projects would "lack architectural complexity" in one run and "demonstrate real-world deployment" in the next.
So the tool was rock-steady on the thing that didn't matter much, and basically guessing on the thing that mattered most, which is whether this person is actually any good.
Three things we usually mix up
Before we move further, I think it helps to separate three things here that we usually lump together about “intelligence”:
First, there's a worldview. Interacting with the world, we develop a frame - the assumptions about what counts as a good engineer in the first place.
Second, there's the analysis. The comparing and matching and scoring that happens on top of that frame, once you feed it some information.
And third, there's the actual person out in the world, who is good or not good regardless of what any rubric says.
An LLM lives almost entirely in that middle layer. It's very good at analysis. But it just takes whatever worldview you hand it, from the prompt or from its training, and it can't really step back and ask whether that worldview is any good. If you change the rubric, the whole score changes, and the model stays just as confident as before. The frame was doing all the work and nobody checked it.
As a human, you can update the worldview by interacting with the world. But an LLM is locked inside a box it cannot get out of. It can only reason about the first and the third.
There's an old idea from logic that says this more precisely. An argument can be valid and still be wrong. "All cats can fly. Felix is a cat. So Felix can fly." The reasoning is perfect but the conclusion is nonsense, because one of the things you fed in was false. (I sometimes reach for Gödel here, but you don't need anything that heavy.) An LLM is a very good engine for making the conclusion follow from what you gave it. It has no way of checking whether what you gave it is true. And it will sound equally confident either way.
Working out whether the input is actually true, whether this candidate is actually good or an injury is actually there, isn't analysis at all. It's a different kind of act. It's the part where you have to make contact with the real thing instead of a description of it. And that's the part we keep trying to hand to the machine.
A body on the table
If that sounds like a small complaint about hiring software, here's the same problem with a body involved.
Antoine had been having shoulder pain, so he saw an orthopedist and got an MRI. He came out with a "Grade III (>50%-width) partial-thickness tear" of his subscapularis tendon and a fairly aggressive treatment plan that they started minutes after the scan. It felt rushed to him, so he asked for a copy of everything and ran it through AI.
And here's the thing you have to hold onto before the twist, because it's important: the AI was actually useful. It read a 266 MB DICOM export, set up its own tools, and pointed out two things the clinic hadn't told him. They'd done shockwave therapy even though a clinical guideline says not to for his condition, and they'd injected him with a homeopathic preparation that's registered as having no therapeutic indication. Yes, that's the analysis layer doing exactly what it's good at. Reading, cross-referencing, matching against guidelines, catching what a rushed human missed, and doing all this from the comfort of your home.
Then it read the actual scan. Where the radiologist saw a tear of more than 50%, the AI said the tendon was intact. Antoine then had it compare the two readings, and it took a side, "decisively" in its own words, with moderate-to-high confidence, saying there was no tear at all. So you have two careful readers, the same scan, and completely opposite answers, with a machine confident enough to declare a winner.
He ended up with much more analysis about his shoulder but far less idea of what was actually wrong with it than when he walked in. The word he used for where this left him was "limbo."
This was never really about AI
It's easy to read all this as "AI just isn't good enough yet." I don't think that's the lesson, and there's a Honda Civic with 150,000 miles on it that shows why.
A commenter on the Hacker News thread about Antoine's post described taking that car to three different garages to play the second-opinion game. The idea was to compare what each one said and figure out the truth. He got three completely unrelated recommendations, one of which he knew was wrong. He said he felt worse off than when he started. There's no AI anywhere in that story. Just three human mechanics looking at the same engine and giving three different answers. And his takeaway is really the whole point of this piece:
The solution to uncertain information isn't more information, which the AI can certainly provide, it's better information, and AI cannot currently provide that.
More passes over the problem, whether by a human or a machine, don't create ground truth. If you stack ten opinions on a genuinely uncertain question, you get a fuzzier picture, not a sharper one. The wobble was never really about who was doing the judging. It comes from trying to reach a truth that lives inside the thing itself while you're only ever holding a description of it. A resume instead of the engineer. A scan instead of the shoulder. A customer's account instead of the actual engine.
And humans fall into the exact same hole. There's a story that gets passed around about a Google hiring committee that was handed a stack of anonymized candidate packets and voted to reject all of them, and then found out the packets were their own, from when they had interviewed. Judging the paper instead of the person, even the experts couldn't tell themselves apart from a reject.
So how should you actually build with this
This gives us a more useful rule than "be careful with AI."
For any step in a process, let's ask ourself one thing. Does finishing this step need you to touch the real thing, or just work with a description of it?
If it's just the description, we can probably automate it. Parsing a document, structuring data, converting a format, running a deduction, checking something against a guideline, summarizing, ticking a box. This is where an LLM is better than a person. Faster, cheaper, more consistent, and it takes the boring work off your plate. Why copy that fields from one spreadsheet to another? Or just enter the 13 floors of a building manually when an LLM can do it for you? It's a great assistant.
Antoine's AI catching the bad injection was this. The ATS turning a PDF into clean fields was this. These are good. Refusing to use AI here because you're nervous about it is just wasting it.
But if finishing the step needs actual contact with the real thing, keep a human on it. Judging whether a person is good. A diagnosis you're going to act on. Anything built on trust. Any call where the truth is in the thing, not the paperwork about it. When real human connections are involved, an AI wouldn't work because you need real humans for that. And, it has to be a human who is actually in contact, not just another person reading the same file the machine was reading. Otherwise you've just swapped a software analyst for a human one and kept the problem.
An LLM that says "Experience: 25/25" or "tendon intact, moderate-to-high confidence" sounds like judgment, but it isn't. It's analysis dressed up as judgment.
If you know the truth value of p and I also know p=>q, then an LLM would be able to deduce the truth value of q - even if the statements aren’t exactly in this form. Generally, LLMs are good with logical inference and dealing with unstructured data.
But logical inference itself is limited. You still have to find out if p is true or not - the ground truth. How do you find that? The issue is not in logical inference. It’s in determining the value of p, which takes much more than logic.
So the short version is: let the machine do the analysis, and keep the actual judgment with people. By judgment I mean working out what's really true about a real person or thing. And make sure the human you keep is genuinely in contact with that person or thing, not just doing the machine's job by hand.
Some decisions aren't just made more accurate by a human, they're made by one. A diagnosis you'll go through with. A person you'll work next to. A customer who needs to feel heard. An employee who needs to feel that you are with him. Take the human out of those and you have completely missed the point.
Read the room
There's a scene in Good Will Hunting where Robin Williams sits Matt Damon down on a park bench and tells this kid who has read every book that he still can't say what it smells like inside the Sistine Chapel. He's never stood there and looked up. As Jay Acunzo puts it, the kid has all the knowing and none of the living. Reading Oliver Twist isn't the same as being an orphan. Knowing about something isn't the same as having touched it.
That's the line to build around. An LLM has read the whole internet, but it can't read the room. Give it everything that's only logic, the parsing and matching and structuring and the tireless grind it does better than any of us, and it'll hand you back hours of your day.
But for everything that needs you to connect with a human, and touch the real thing, that part is still yours.
Sources: HackerRank's open-source ATS, scored four ways — Dan Kinsky · Using Opus 4.8 to get a second opinion on an MRI — Antoine · Hacker News discussion · When a hiring committee at Google fails to hire themselves — Lucas Bean · The best response to AI slop is from Robin Williams — Jay Acunzo