When scraping content, collecting user feedback, or processing transcripts, duplicate and near-duplicate sentences are common. They inflate dataset size, skew frequency statistics, and reduce quality of training data. Extracting unique sentences and grouping near-duplicates simplifies downstream workflows: it speeds manual review, reduces storage, and improves model training by avoiding repeated examples.
This tool extracts sentence-like units, normalizes them, detects exact duplicates and groups similar sentences using a fast shingle-based similarity (Jaccard). Tune the similarity threshold, ignore numbers or stopwords during comparison, and export results as CSV. Everything runs locally in your browser — ideal for private or sensitive text.
1. Split text into sentences using punctuation and line breaks. 2. Normalize (lowercase unless preserving case, trim, optional number removal). 3. Convert each sentence into token shingles and compute Jaccard similarity. 4. Cluster sentences with similarity ≥ threshold. The most frequent sentence in a cluster becomes the representative.
Shingle-Jaccard is fast and practical but not perfect; it may miss deep semantic paraphrases. For higher accuracy, consider embedding-based similarity (server-side). Use this tool for quick preprocessing and follow with manual review.
No — all processing is local in your browser.
We use punctuation (.!?), line breaks, and paragraph heuristics. For complex text consider pre-formatting.
Intersection over union of token-shingle sets — a fast measure for near-duplicate detection.
Yes — adjust the similarity slider to tune grouping strictness.
Yes — export group id, frequency, representative sentence and members.
It detects surface-paraphrases that share many tokens. For deeper meanings use embedding-based methods.
Enable this to avoid grouping differences due to IDs or dates.
Yes — UI is mobile-friendly but large inputs work best on desktop.
Yes — Preview shows representative sentences and sample members.
Yes — free and no signup required.
Groups use incremental IDs; the representative is chosen by frequency/shortness.
You can inspect groups in the output. I can add clickable filters if you want.
Output is for review and export; you can copy and paste to edit further.
Depends on threshold and shingle size; adjust to suit your dataset.
Use your site contact form and include example input & desired behavior.