String Comparisons
This library offers a range of functions to calculate text similarity, allowing you to measure the likeness of text data in an application. It implements well-established similarity metrics. The library currently supports the following algorithms:
- Cosine Similarity
- Jaccard Similarity
- Jaro Similarity
- Damerau-Levenshtein Distance
- Hamming Distance
- Levenshtein Distance
- Smith-Waterman Alignment
- Sørensen-Dice Coefficient
- Jaccard Similarity based on Trigrams
- Szymkiewicz Simpson Overlap
- N-Gram
- Q-Gram
- Optimal String Alignment
Installation
Assuming you have Node.js and npm/yarn/pnpm installed, install the library using:
# Install the 'string-comparisons' package using npm
npm install string-comparisons
# Alternatively, install the 'string-comparisons' package using yarn
yarn add string-comparisons
# Or, install the 'string-comparisons' package using pnpm
pnpm add string-comparisons
String Similarity Algorithm Comparison
Algorithm | Normalized | Metric | Similarity | Distance | Space Complexity |
---|---|---|---|---|---|
cosine.js | Yes | Vector Space Model | ✓ | O(n) | |
jaro.js | No | Edit Distance | ✓ | O(min(n, m)) | |
jaccard.js | No | Set Theory | ✓ | O(min(n, m)) | |
damerauLevenshtein.js | No | Edit Distance | ✓ | O(max(n, m)²) | |
hammingDistance.js | No | Bitwise Operations | ✓ | O(1) | |
jaroWinkler.js | No | Edit Distance | ✓ | O(min(n, m)) | |
levenshtein.js | No | Edit Distance | ✓ | O(max(n, m)²) | |
smithWaterman.js | No | Dynamic Programming (Local Alignment) | ✓ | O(n * m) | |
sorensenDice.js | No | Set Theory | ✓ | O(min(n, m)) | |
trigram.js | No | N-gram Overlap | ✓ | O(n²) | |
szymkiewiczSimpsonOverlap.js | Yes | Overlap Coefficient | ✓ | O(min(m, n)) | |
nGram.js | Yes | Jaccard similarity coefficient | ✓ | O(m * n) | |
qGram.js | Yes | Jaccard similarity coefficient | ✓ | O(n + m) | |
optimalStringAlignment.js | No | Edit distance | ✓ | O(max(n, m)²) |
Explanation of Columns:
- Normalized: Indicates whether the algorithm produces a score between 0 and 1 (normalized).
- Metric: The underlying mathematical concept used for comparison.
- Similarity: Whether the algorithm outputs a higher score for more similar strings.
- Distance: Whether the algorithm outputs a lower score for more similar strings. (One algorithm might use similarity, another distance - they provide the opposite information).
- Space Complexity: The amount of extra memory the algorithm needs to run the comparison.
Notes:
- ✓ indicates the algorithm applies to that category.
- Some algorithms can be used for both similarity and distance calculations depending on the interpretation of the score.
Example Usage
import StringComparisons from 'string-comparisons';
const { Cosine, Jaccard, Jaro, DamerauLevenshtein, HammingDistance, JaroWrinker, Levenshtein, SmithWaterman, SorensenDice, Trigram } = StringComparisons;
const string1 = 'programming';
const string2 = 'programmer';
console.log('Jaro-Winkler similarity:', JaroWrinker.similarity(string1, string2)); // Output: ~0.9054545454545454
console.log('Levenshtein distance:', Levenshtein.similarity(string1, string2)); // Output: 3
console.log('Smith-Waterman similarity:', SmithWaterman.similarity(string1, string2)); // Output: 16
const set1 = new Set([1, 2, 3]);
const set2 = new Set([2, 3, 4]);
console.log('Sørensen-Dice similarity:', SorensenDice.similarity(set1, set2)); // Output: 0.6666666666666667
const trigram1 = 'hello';
const trigram2 = 'world';
console.log('Trigram Jaccard similarity:', Trigram.similarity(trigram1, trigram2)); // Output: 0 (no shared trigrams)
// so on
Resources
- String Similarity Comparison in JS with Examples
- Cosine similarity between two sentences
- The complete guide to string similarity algorithms
- N-Gram Similarity and Distance
- Approximate string-matching with q-grams and maximal matches
- Research on string similarity algorithm based on Levenshtein Distance
- String similarity search and join: a survey