Skip to content

AICH hashset save can fail spuriously after hashing because `known2.met` lock wait times out

Summary

Current main gives CAICHRecoveryHashSet::SaveHashSet() only five seconds to acquire the global known2.met mutex. If the lock is still held, the function returns false and the just-calculated hashset is treated as a save failure.

eMuleAI removes that timeout and waits for the mutex instead. For this specific path that looks like the right stock-preserving fix: the expensive hashing work is already finished, the call is serialized through a single shared file, and a false negative save is worse than a short wait.

Current Mainline Status

Done in main via commit 8a5a33c (BUG-032 remove AICH hashset save timeout).

The landed fix is intentionally narrow: CAICHRecoveryHashSet::SaveHashSet() now waits for the known2.met mutex normally instead of treating a 5-second wait as save failure. The file format, write path, and caller behavior were otherwise left unchanged.

Evidence In Current Tree

  • srchybrid/SHAHashSet.cpp
  • CAICHRecoveryHashSet::SaveHashSet() does:
    • CSingleLock lockKnown2Met(&m_mutKnown2File);
    • if (!lockKnown2Met.Lock(SEC2MS(5))) return false;
  • srchybrid/KnownFile.cpp
  • both CreateFromFile() and CreateAICHHashSetOnly() call SaveHashSet() after building the recovery hashset
  • analysis\emuleai\srchybrid\SHAHashSet.cpp
  • replaces the five-second timeout with an unconditional wait
  • the inline rationale is explicit: timing out here was causing "Failed to save AICH Hashset" after successful hashing

Why This Looks Real

This is a classic false-failure race:

  • hashing already consumed the expensive I/O and CPU work
  • another thread can still hold the known2.met mutex for legitimate reasons
  • the caller gives up after five seconds and reports failure even though the state is otherwise recoverable by simply waiting a bit longer

This is especially poor during busy startup or concurrent shared-file hashing.

Likely Fix Shape

Keep the fix narrow:

  1. remove the five-second timeout in CAICHRecoveryHashSet::SaveHashSet()
  2. wait for the mutex normally
  3. keep the existing file-format, save path, and long-path-safe open logic unchanged

Do not blend this with the broader eMuleAI AICH tree semantics changes.

Validation Target

  • run concurrent shared-file hashing / AICH generation so SaveHashSet() calls contend
  • verify the hashset no longer fails spuriously on timeout
  • verify shutdown or close behavior still completes cleanly when hashing is in flight