Javascript minhash function to generate a characteristic hash key for a string text

0

I am currently trying out a minhash function algorithm to test the similarity between two texts (or documents), based on this module: https://github.com/sjhorn/node-minhash/blob/master/lib/minhash.js.

In the function, I am separating my text into tokens (or k-shingles), and then calculating a hash number for each of the tokens/shingles:

crc32.str(token) & 0xffffffff

so I am getting back an array of numbers for my different token parts. Example:

The quick brown fox jumps over the lazy dog.  ==  -3677935418
The quick brown fox jumps over the lazy dog.  ==  -3191969143
The quick brown fox jumps over the lazy dog.  ==  -2264094193
The quick brown fox jumps over the lazy dog.  ==  -4003895775
The quick brown fox jumps over the lazy dog.  ==  -4077650760
The quick brown fox jumps over the lazy dog.  ==  -3917776217
The quick brown fox jumps over the lazy.  ==  -3677935418
The quick brown fox jumps over the lazy.  ==  -3191969143
The quick brown fox jumps over the lazy.  ==  -2264094193
The quick brown fox jumps over the lazy.  ==  -4003895775
The quick brown fox jumps over the lazy.  ==  -4077650760
The quick brown fox jumps over the lazy.  ==  -2302592728
M1 length: 6
M2 length: 6
Minhash1: -3677935418  Minhash2: -3677935418
Minhash1: -3191969143  Minhash2: -3191969143
Minhash1: -2264094193  Minhash2: -2264094193
Minhash1: -4003895775  Minhash2: -4003895775
Minhash1: -4077650760  Minhash2: -4077650760
shared: 5 total: 6
Shared/Total: 0.8333333333333334

Compared with each other, the matching hash-numbers are quite similar. The number of permutations are 6 in this example.

Now my question is, I would like to know how to create a single characteristic hash-string for that text, because all modules only compare texts directly with eachother and output a similarity coefficient. For similar texts/documents, the hash-strings should also be similar. Example (something like this):

The quick brown fox jumps over the lazy dog

Hash:KV5rsUfZpcZdVojpG8mHLA==

The quick brown fox jumps over the lazy

Hash:KV5rsUfZpcZdVojpG8hTPS==

Is it somehow possible maybe to create an identifying hash-string out of the single token-hashes? And encode them into a hex-string or similar?

EDIT: I know there is something like MongoDB Object_ID which is a unique hex-string thats being constructed from 3 fields:

a 4-byte value representing the seconds since the Unix epoch,
a 5-byte random value, and
a 3-byte counter, starting with a random value.

https://docs.mongodb.com/manual/reference/method/ObjectId/

Doing something similar with the Token-Array would be nice... but I have no idea how :(

EDIT: I created from the number tokens a hex string token and concatenated them together:

function convertToHex(numberArray) {
    if (Array.isArray(numberArray)) {
        return numberArray.map((number) => {
            if (number < 0)
            {
              number = 0xFFFFFFFF + number + 1;
            }

            return number.toString(16).toUpperCase();
            // number = number >>> 0;
            // return pa
        });
    } else {
        return null;
    }        
}

In this way, I get a similar character hex-string for similar documents, but with the number of permutations, the number of tokens gets also longer.. and its not practicable anymore to compare the hex-strings.

So I tried to break down the long hex string into smaller tokens, but here I have the same problem, the number of matching tokens will decrease, because bigger tokens with small differences are being created, which decreases the overall similarity at the end...

javascript
node.js
hash
sentence-similarity
minhash
asked on Stack Overflow Mar 20, 2019 by user2774480 • edited Mar 20, 2019 by user2774480

0 Answers

Nobody has answered this question yet.


User contributions licensed under CC BY-SA 3.0