Download checker — duplicate and near-duplicate intake guard
Workflow status is tracked in GitHub: https://github.com/emulebb/emulebb/issues/16. This local document is retained as an engineering spec/evidence record.
Summary¶
Add an intake-time checker that evaluates a new download against existing downloads, history, and shared files before the item is accepted.
The intent is to catch obvious duplicates, suspicious near-duplicates, and repeated junk before they clutter the queue.
Intended Mainline Shape¶
- compare new downloads against current downloads, known history, and shared inventory
- detect exact duplicates first and optionally warn or reject
- allow looser near-duplicate heuristics as an advanced mode
- optionally auto-blacklist or auto-hide clearly bad repeat items
- present the result as a user-facing decision rather than silently overriding everything
Why Add It¶
This is a file-handling convenience feature with real operator value on long-running nodes:
- fewer accidental duplicate downloads
- less queue clutter
- less repeated junk from fake/spam-prone searches
eMuleAI Implementation References¶
Review source: eMuleAI commit
8e34bdec2b7e4fe9e4307df9d80f691804be99ed.
- duplicate map load/update/remove/check paths:
DownloadChecker.h,DownloadChecker.cpp,DownloadChecker.cpp,DownloadChecker.cpp.
eMuleBB Direction¶
Start with exact ED2K hash matches against current downloads, completed known files, and shared files. Near-duplicate heuristics should be later and user-visible because filename/size similarity can produce false positives.
Scope Constraints¶
- exact-duplicate checks should be deterministic and cheap
- near-duplicate mode must stay optional because false positives are possible
- this feature should complement, not replace, the
KnownFileListcorrectness fixes underBUG-037 - a local strong-hash cache such as a BLAKE3 sidecar is a valid future implementation aid for exact-duplicate confidence, but it is not required for the first version
- if a BLAKE3 value is ever exposed on the network, it should start as advisory extra metadata for upgraded peers only, not as a replacement for MD4 identity
Acceptance Criteria¶
- [ ] exact duplicates can be detected before a new download is added
- [ ] operators can choose warn/reject/allow behavior
- [ ] optional near-duplicate mode can be enabled separately
- [ ] clearly blacklisted repeat items can be filtered automatically when configured
- [ ] no regression for normal add-download flows when the feature is disabled
- [ ] exact-hash duplicate behavior is covered before any fuzzy/near-duplicate rule is enabled