Not really my area but…
A binary representation of a sentence is created using a ‘bag of words’ BOW, the location/ index of the words within the BOW dictionary is used to create a long binary feature matrix. Depending on the size of your dictionary this can obviously lead to very long matrixes, so a small 10k word BOW gives a 10k string of 1’s & 0’s… ouch lol.
A method of data compression is to first pre-process you sentences extracting features, or regular combinations of words to create a ‘bag of features’ and then encode the binary feature matrix from this.
Both these methods have the disadvantage that as new words are encountered the dictionary length changes, and the whole NN needs re-training on the new feature set.
To avoid this problem…another method that does not require a dictionary is feature hashing.
A hashing algorithm converts data of arbitrary length/ type into a unique (ish) numerical representation within a know range. (MURMURHASH3)
So in chatbots for example… long complex input sentences don’t have to be stored in their entirety… they can be hashed and just the unique result/ number stored. The hash algorithm will always convert the same sentence to the same hash output for searching/ matching etc.
Using the inbuilt windows shlwapi.dll hashing routines gives…
Hash(“this whole sentence will be converted into a single unique numberâ€) = 166579926
So the idea is to transform/ compress any given sentence/ data into a unique/ repeatable binary sequence/ vector of a fixed known length, because hashing algorithm always returns a number within a set range, the length of the resulting binary vector is always fixed.
To do this each word is passed through a hash algorithm, and a 1 is placed within the binary array element/ index derived from the hash output NOT the words location within the BOW. This gives a unique sparse binary vector ready to send to the NN, but again depending on the hashing algorithm used can result in some long vectors/ binaries.
A variation on this is the hashing trick, to use the MOD function on the hash output and place a 1 at the array index of the MOD’s remainder result.
https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/feature-hashinghttps://medium.com/value-stream-design/introducing-one-of-the-best-hacks-in-machine-learning-the-hashing-trick-bf6a9c8af18fhttps://en.wikipedia.org/wiki/Vowpal_Wabbit