Ai Dreams Forum

Member's Experiments & Projects => AI Programming => Topic started by: JohnnyWaffles on January 09, 2019, 06:48:23 pm

Title: How does Feature Hashing work? (For ANN Input)
Post by: JohnnyWaffles on January 09, 2019, 06:48:23 pm
I've built a feed forward neural network in Excel using VBA specifically for classification. I'm using this neural network at my place of work. It's the only programming language I have access too there. I'm trying to prep my data for NN input. I've tried using One Hot Encoding and Binary Encoding to encode my data, but I have over 100 categories and I usually end up with a memory error. Does anybody know how to do Feature Hashing? Apparently its the next best way to prep categorical data for NN input.

Online examples that I have seen don't explain it well. Can anyone here explain how it works?
Title: Re: How does Feature Hashing work? (For ANN Input)
Post by: Korrelan on January 10, 2019, 11:17:36 am
Not really my area but…

A binary representation of a sentence is created using a ‘bag of words’ BOW, the location/ index of the words within the BOW dictionary is used to create a long binary feature matrix.  Depending on the size of your dictionary this can obviously lead to very long matrixes, so a small 10k word BOW gives a 10k string of 1’s & 0’s… ouch lol.

A method of data compression is to first pre-process you sentences extracting features, or regular combinations of words to create a ‘bag of features’ and then encode the binary feature matrix from this. 

Both these methods have the disadvantage that as new words are encountered the dictionary length changes, and the whole NN needs re-training on the new feature set.

To avoid this problem…another method that does not require a dictionary is feature hashing.

A hashing algorithm converts data of arbitrary length/ type into a unique (ish) numerical representation within a know range. (MURMURHASH3)

So in chatbots for example… long complex input sentences don’t have to be stored in their entirety… they can be hashed and just the unique result/ number stored.  The hash algorithm will always convert the same sentence to the same hash output for searching/ matching etc.

Using the inbuilt windows shlwapi.dll hashing routines gives…

Hash(“this whole sentence will be converted into a single unique number”) = 166579926

So the idea is to transform/ compress any given sentence/ data into a unique/ repeatable binary sequence/ vector of a fixed known length, because hashing algorithm always returns a number within a set range, the length of the resulting binary vector is always fixed.

To do this each word is passed through a hash algorithm, and a 1 is placed within the binary array element/ index derived from the hash output NOT the words location within the BOW.  This gives a unique sparse binary vector ready to send to the NN, but again depending on the hashing algorithm used can result in some long vectors/ binaries.

A variation on this is the hashing trick, to use the MOD function on the hash output and place a 1 at the array index of the MOD’s remainder result.