Opik · Comet · #5523 code review · Francesco

01 · What I expected vs what I found

Diligence beats a stale issue tracker

The cold-outreach plan listed issue #5523 (EvaluationEngine silently swallows task errors) as an unclaimed target. Standard play would have been: pull the repo, write the fix, commit, mail the maintainer with a PR-ready branch. Instead, I opened sdks/typescript/src/opik/evaluation/engine/EvaluationEngine.ts and saw that the code already looked nothing like the snippet in the issue body.

What the issue said the code looked like

for (let i = 0; i < datasetItems.length; i++) {
  try {
    const testResult = await this.executeTask(datasetItem);
    testResults.push(testResult);          // only on success
  } catch (error) {
    logger.error(`Error processing item ${datasetItem.id}`);
    // NO testResult pushed, silently skipped
  }
}
      

What the code actually looks like in main today

An ItemRunOutcome discriminated union ({ kind: 'success' | 'failure', testResult, error? }), a flat queue of (item, runIndex) pairs scheduled through a concurrency semaphore, testResults populated on every outcome, separate errors[] accumulator, createFailedTestResult() with TASK_ERROR_SCORE_NAME and scoringFailed: true, and EvaluationResultProcessor.processResults(testResults, experiment, elapsedSeconds, errors) surfacing both halves at the end. The original silent-swallow is gone.

That fix landed in PR #6624 (merged). There is also PR #6819 (open) refining the infrastructure-error path on top.

The honest read: shipping a third PR on a file that already has one merged and one open would be noise, not signal. The maintainer would close it as duplicate within a day, and the credibility of every future outreach drops a notch. Better to keep the source-reading work and pivot the angle.

02 · The sibling pattern still in the file

Same silent-swallow shape, one layer deeper

The task-error path is now tracked. The metric-error path, what happens when a scoring metric itself throws while evaluating a successful task output, is still logged-only. Same UX failure mode #5523 was opened against, just at a different layer of the call stack.

The exact code that still swallows, in `calculateScores` lines 505-509

for (const metric of effectiveMetrics) {
  logger.debug(`Calculating score for metric ${metric.name}`);
  try {
    validateRequiredArguments(metric, scoringInputs);
    const metricResults = await metric.score(scoringInputs);
    const resultArray = Array.isArray(metricResults) ? metricResults : [metricResults];
    scoreResults.push(...resultArray);
    // ... routes into assertionResults / feedbackScoreResults ...
  } catch (error) {                                    // ← line 505
    const errorMessage = error instanceof Error ? error.message : String(error);
    logger.error(`Metric ${metric.name} failed: ${errorMessage}`);  // ← line 508
  }                                                       // ← line 509  no push to scoreResults
}
      

The catch block exists, the error is logged, and that is the end of it. scoreResults never receives an entry. The downstream consumer sees an evaluation with five metrics declared and three results returned, with no machine-readable signal that the other two threw. The reason is buried in a debug log.

flowchart LR
  Task["Task runs
(success)"] --> Score[calculateScores called]
  Score --> Loop{For each metric}
  Loop --> Eval[metric.score]
  Eval -- success --> Push[Push scoreResult]
  Eval -- throw --> Log[logger.error only]
  Log -. nothing pushed .-> Out["scoreResults
(missing entries)"]
  Push --> Out
  Out --> Cons[Caller sees incomplete results
with no marker explaining why]
  style Log fill:#1e0a0a,stroke:#ef4444,color:#fca5a5
  style Cons fill:#1e1408,stroke:#fbbf24,color:#fde68a

03 · The fix shape

Mirror the pattern Comet's own engineers already established

The codebase already has the right shape: TASK_ERROR_SCORE_NAME = "__opik_task_error__", scoringFailed: true, a EvaluationScoreResult entry with value: 0 and reason: errorMessage. The metric-error path should follow exactly the same shape.

Proposed change, in three lines added to the catch block

} catch (error) {
  const errorMessage = error instanceof Error ? error.message : String(error);
  logger.error(`Metric ${metric.name} failed: ${errorMessage}`);
  scoreResults.push({                              // ← new
    name: METRIC_ERROR_SCORE_NAME,                 // new constant, mirrors TASK_ERROR_SCORE_NAME
    value: 0,
    reason: errorMessage,
    scoringFailed: true,
  });
}
      

That is the entire change at the call site. The new constant in types.ts is one line. EvaluationResultProcessor already special-cases scoringFailed (line 26: if (!score || score.scoringFailed || typeof score.value !== "number")), so the surfacing layer needs no changes, the metric error appears in the results identically to a task error.

Design question for the maintainer

Naming: New METRIC_ERROR_SCORE_NAME constant, or reuse TASK_ERROR_SCORE_NAME with a discriminator field? Either is defensible, I'd default to a new constant for clarity, but the maintainer's call.
Skipped metrics: Should the failed metric still emit a feedback score (with value 0) on the trace? Currently feedback scores are sourced from the same scoreResults array, so yes by default.
Assertion routing: For BaseSuiteEvaluator metrics, the error case routes into assertionResults as status: "failed". Confirm that's the intended behavior, I read it as "failure to score is itself a failed assertion", which feels right.

04 · The cold-mail framing

Diligence as the cold-outreach value, not a finished PR

The mail to the Opik TS SDK maintainer leads with "I came in to take #5523, read the code, saw your fix is already in, and stuck around long enough to find the sibling pattern". That is a different sales pitch than the other three repos, it sells a code-review eye rather than a PR-ready commit. For Comet, where the team is sharp and the AI-augmented "Ollie" bot already exists internally, code-review eyes are arguably the more valuable signal anyway.

The CTA path

1) I open a new issue documenting the metric-error sibling pattern, with the exact line numbers and the proposed fix shape.
2) If you signal interest, I send the PR within 1 day, with a vitest test against a deliberately-throwing metric to lock the behavior in.
3) 15-minute call if you'd rather align on naming first, I'm in CEST, available 15:00-24:00, which is 06:00-15:00 US Pacific.

Honest framing for the buyer: this is the weakest of the four leads in terms of immediate "proof of work", there is no committed branch. But it is also the lead with the highest stack-overlap match (Cloudflare Workers + React 18 + Vite + Tailwind + Playwright is the K-Quest stack, line for line). If the conversation goes anywhere, the fit on opik-frontend work is essentially 1:1.

05 · The outreach

Where this stands

Done

Source-read and sibling pattern identified

Read EvaluationEngine.ts, verified #5523 fixed in main, located the metric-error catch at lines 505-509, drafted the fix shape.

Cold mail to a named Opik TS SDK maintainer

Mail leads with the diligence story, ends with the dual CTA: file the issue + open PR on signal, OR 15-minute call to align on naming first.

If green-light

Issue + PR within 1-2 days

One new constant, three lines added to the catch block, one vitest test. Estimated half a working day end-to-end.

Goal

Conversation within 1 week, contract within 6 weeks

Realistic odds from this lead alone: ~20-40% conversation, ~5-15% contract. Comet is a real company with a real hiring process, this is more likely to surface a longer-cycle FT opportunity than a same-week retainer.

Opik (Comet)

Diligence beats a stale issue tracker

What the issue said the code looked like

What the code actually looks like in main today

Same silent-swallow shape, one layer deeper

The exact code that still swallows, in `calculateScores` lines 505-509

Mirror the pattern Comet's own engineers already established

Proposed change, in three lines added to the catch block

Design question for the maintainer

Diligence as the cold-outreach value, not a finished PR

The CTA path

Where this stands

Source-read and sibling pattern identified

Cold mail to a named Opik TS SDK maintainer

Issue + PR within 1-2 days

Conversation within 1 week, contract within 6 weeks

Read the source yourself

Diligence beats a stale issue tracker

What the issue said the code looked like

What the code actually looks like in main today

Same silent-swallow shape, one layer deeper

The exact code that still swallows, in calculateScores lines 505-509

Mirror the pattern Comet's own engineers already established

Proposed change, in three lines added to the catch block

Design question for the maintainer

Diligence as the cold-outreach value, not a finished PR

The CTA path

Where this stands

Source-read and sibling pattern identified

Cold mail to a named Opik TS SDK maintainer

Issue + PR within 1-2 days

Conversation within 1 week, contract within 6 weeks

Read the source yourself

The exact code that still swallows, in `calculateScores` lines 505-509