comet-ml/opik · issue #5523 · related PRs #6624 (merged), #6819 (open)
I came in to claim issue #5523 as a free proof PR. Read the source instead and found that PR #6624 had already landed the task-error fix. Rather than retire the lead, I traced the same silent-swallow shape one layer deeper into calculateScores lines 505-509, where a metric throw is still logged-only with no marker in scoreResults. That is the cold-mail angle: not "I have a PR for you", but "I read your code carefully enough to find the sibling pattern your fix didn't cover".
The cold-outreach plan listed issue #5523 (EvaluationEngine silently swallows task errors) as an unclaimed target. Standard play would have been: pull the repo, write the fix, commit, mail the maintainer with a PR-ready branch. Instead, I opened sdks/typescript/src/opik/evaluation/engine/EvaluationEngine.ts and saw that the code already looked nothing like the snippet in the issue body.
An ItemRunOutcome discriminated union ({ kind: 'success' | 'failure', testResult, error? }), a flat queue of (item, runIndex) pairs scheduled through a concurrency semaphore, testResults populated on every outcome, separate errors[] accumulator, createFailedTestResult() with TASK_ERROR_SCORE_NAME and scoringFailed: true, and EvaluationResultProcessor.processResults(testResults, experiment, elapsedSeconds, errors) surfacing both halves at the end. The original silent-swallow is gone.
That fix landed in PR #6624 (merged). There is also PR #6819 (open) refining the infrastructure-error path on top.
The task-error path is now tracked. The metric-error path, what happens when a scoring metric itself throws while evaluating a successful task output, is still logged-only. Same UX failure mode #5523 was opened against, just at a different layer of the call stack.
calculateScores lines 505-509The catch block exists, the error is logged, and that is the end of it. scoreResults never receives an entry. The downstream consumer sees an evaluation with five metrics declared and three results returned, with no machine-readable signal that the other two threw. The reason is buried in a debug log.
flowchart LR Task["Task runs
(success)"] --> Score[calculateScores called] Score --> Loop{For each metric} Loop --> Eval[metric.score] Eval -- success --> Push[Push scoreResult] Eval -- throw --> Log[logger.error only] Log -. nothing pushed .-> Out["scoreResults
(missing entries)"] Push --> Out Out --> Cons[Caller sees incomplete results
with no marker explaining why] style Log fill:#1e0a0a,stroke:#ef4444,color:#fca5a5 style Cons fill:#1e1408,stroke:#fbbf24,color:#fde68a
The codebase already has the right shape: TASK_ERROR_SCORE_NAME = "__opik_task_error__", scoringFailed: true, a EvaluationScoreResult entry with value: 0 and reason: errorMessage. The metric-error path should follow exactly the same shape.
That is the entire change at the call site. The new constant in types.ts is one line. EvaluationResultProcessor already special-cases scoringFailed (line 26: if (!score || score.scoringFailed || typeof score.value !== "number")), so the surfacing layer needs no changes, the metric error appears in the results identically to a task error.
METRIC_ERROR_SCORE_NAME constant, or reuse TASK_ERROR_SCORE_NAME with a discriminator field? Either is defensible, I'd default to a new constant for clarity, but the maintainer's call.scoreResults array, so yes by default.BaseSuiteEvaluator metrics, the error case routes into assertionResults as status: "failed". Confirm that's the intended behavior, I read it as "failure to score is itself a failed assertion", which feels right.The mail to the Opik TS SDK maintainer leads with "I came in to take #5523, read the code, saw your fix is already in, and stuck around long enough to find the sibling pattern". That is a different sales pitch than the other three repos, it sells a code-review eye rather than a PR-ready commit. For Comet, where the team is sharp and the AI-augmented "Ollie" bot already exists internally, code-review eyes are arguably the more valuable signal anyway.
Read EvaluationEngine.ts, verified #5523 fixed in main, located the metric-error catch at lines 505-509, drafted the fix shape.
Mail leads with the diligence story, ends with the dual CTA: file the issue + open PR on signal, OR 15-minute call to align on naming first.
One new constant, three lines added to the catch block, one vitest test. Estimated half a working day end-to-end.
Realistic odds from this lead alone: ~20-40% conversation, ~5-15% contract. Comet is a real company with a real hiring process, this is more likely to surface a longer-cycle FT opportunity than a same-week retainer.