How does Feature Hashing work? (For ANN Input)

JohnnyWaffles · « **on:** January 09, 2019, 06:48:23 pm »

I've built a feed forward neural network in Excel using VBA specifically for classification. I'm using this neural network at my place of work. It's the only programming language I have access too there. I'm trying to prep my data for NN input. I've tried using One Hot Encoding and Binary Encoding to encode my data, but I have over 100 categories and I usually end up with a memory error. Does anybody know how to do Feature Hashing? Apparently its the next best way to prep categorical data for NN input.

Online examples that I have seen don't explain it well. Can anyone here explain how it works?

Korrelan · « **Reply #1 on:** January 10, 2019, 11:17:36 am »

Not really my area butâ€¦

A binary representation of a sentence is created using a â€˜bag of wordsâ€™ BOW, the location/ index of the words within the BOW dictionary is used to create a long binary feature matrix. Depending on the size of your dictionary this can obviously lead to very long matrixes, so a small 10k word BOW gives a 10k string of 1â€™s & 0â€™sâ€¦ ouch lol.

A method of data compression is to first pre-process you sentences extracting features, or regular combinations of words to create a â€˜bag of featuresâ€™ and then encode the binary feature matrix from this.

Both these methods have the disadvantage that as new words are encountered the dictionary length changes, and the whole NN needs re-training on the new feature set.

To avoid this problemâ€¦another method that does not require a dictionary is feature hashing.

A hashing algorithm converts data of arbitrary length/ type into a unique (ish) numerical representation within a know range. (MURMURHASH3)

So in chatbots for exampleâ€¦ long complex input sentences donâ€™t have to be stored in their entiretyâ€¦ they can be hashed and just the unique result/ number stored. The hash algorithm will always convert the same sentence to the same hash output for searching/ matching etc.

Using the inbuilt windows shlwapi.dll hashing routines givesâ€¦

Hash(â€œthis whole sentence will be converted into a single unique numberâ€) = 166579926

So the idea is to transform/ compress any given sentence/ data into a unique/ repeatable binary sequence/ vector of a fixed known length, because hashing algorithm always returns a number within a set range, the length of the resulting binary vector is always fixed.

To do this each word is passed through a hash algorithm, and a 1 is placed within the binary array element/ index derived from the hash output NOT the words location within the BOW. This gives a unique sparse binary vector ready to send to the NN, but again depending on the hashing algorithm used can result in some long vectors/ binaries.

A variation on this is the hashing trick, to use the MOD function on the hash output and place a 1 at the array index of the MODâ€™s remainder result.

https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/feature-hashing

https://medium.com/value-stream-design/introducing-one-of-the-best-hacks-in-machine-learning-the-hashing-trick-bf6a9c8af18f

https://en.wikipedia.org/wiki/Vowpal_Wabbit

JohnnyWaffles · « **Reply #2 on:** March 28, 2019, 03:27:41 pm »

Hey, sorry itâ€™s been a while since I have a had a chance to come back to this. Thank you however for the informative post.

I do not understand how to do the hashing trick though. If I can provide some examples of the data Iâ€™m working with, do you think you could help me understand how to apply the hashing trick?

Iâ€™m using data from an engineering system. In this case, I have attributes of piping components.

For example, the material of a component is one attribute Iâ€™d consider using. Iâ€™m using the following VBA code to produce a hash sequence:
https://stackoverflow.com/questions/7358955/generate-short-hash-string-based-using-vba/7359753#7359753

CATEGORY:    HASH SEQUENCE:
Steel     37152
Plastic     31081
Aluminum     2310
Bronze     9364

So with this small dataset, how would I proceed to use the hashing trick?

goaty · « **Reply #3 on:** March 29, 2019, 07:06:49 am »

not really answering your question but...
hashing isn't that good, because they cant be near matched, you need descriptive factors you can near match between is better.

How does Feature Hashing work? (For ANN Input)

JohnnyWaffles

How does Feature Hashing work? (For ANN Input)

Korrelan

Re: How does Feature Hashing work? (For ANN Input)

JohnnyWaffles

Re: How does Feature Hashing work? (For ANN Input)

goaty

Re: How does Feature Hashing work? (For ANN Input)

Recent Topics

Recent News

Users Online

Articles