How does Feature Hashing work? (For ANN Input)

  • 3 Replies
  • 647 Views
*

JohnnyWaffles

  • Roomba
  • *
  • 13
How does Feature Hashing work? (For ANN Input)
« on: January 09, 2019, 06:48:23 pm »
I've built a feed forward neural network in Excel using VBA specifically for classification. I'm using this neural network at my place of work. It's the only programming language I have access too there. I'm trying to prep my data for NN input. I've tried using One Hot Encoding and Binary Encoding to encode my data, but I have over 100 categories and I usually end up with a memory error. Does anybody know how to do Feature Hashing? Apparently its the next best way to prep categorical data for NN input.

Online examples that I have seen don't explain it well. Can anyone here explain how it works?

*

Korrelan

  • Trusty Member
  • ***********
  • Eve
  • *
  • 1383
  • Look into my eyes! WOAH!
    • YouTube
Re: How does Feature Hashing work? (For ANN Input)
« Reply #1 on: January 10, 2019, 11:17:36 am »
Not really my area but…

A binary representation of a sentence is created using a ‘bag of words’ BOW, the location/ index of the words within the BOW dictionary is used to create a long binary feature matrix.  Depending on the size of your dictionary this can obviously lead to very long matrixes, so a small 10k word BOW gives a 10k string of 1’s & 0’s… ouch lol.

A method of data compression is to first pre-process you sentences extracting features, or regular combinations of words to create a ‘bag of features’ and then encode the binary feature matrix from this. 

Both these methods have the disadvantage that as new words are encountered the dictionary length changes, and the whole NN needs re-training on the new feature set.

To avoid this problem…another method that does not require a dictionary is feature hashing.

A hashing algorithm converts data of arbitrary length/ type into a unique (ish) numerical representation within a know range. (MURMURHASH3)

So in chatbots for example… long complex input sentences don’t have to be stored in their entirety… they can be hashed and just the unique result/ number stored.  The hash algorithm will always convert the same sentence to the same hash output for searching/ matching etc.

Using the inbuilt windows shlwapi.dll hashing routines gives…

Hash(“this whole sentence will be converted into a single unique number”) = 166579926

So the idea is to transform/ compress any given sentence/ data into a unique/ repeatable binary sequence/ vector of a fixed known length, because hashing algorithm always returns a number within a set range, the length of the resulting binary vector is always fixed.

To do this each word is passed through a hash algorithm, and a 1 is placed within the binary array element/ index derived from the hash output NOT the words location within the BOW.  This gives a unique sparse binary vector ready to send to the NN, but again depending on the hashing algorithm used can result in some long vectors/ binaries.

A variation on this is the hashing trick, to use the MOD function on the hash output and place a 1 at the array index of the MOD’s remainder result.

https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-reference/feature-hashing

https://medium.com/value-stream-design/introducing-one-of-the-best-hacks-in-machine-learning-the-hashing-trick-bf6a9c8af18f

https://en.wikipedia.org/wiki/Vowpal_Wabbit

 :)
« Last Edit: January 10, 2019, 01:42:19 pm by Korrelan »
It thunk... therefore it is!...    /    Project Page    /    KorrTecx Website

*

JohnnyWaffles

  • Roomba
  • *
  • 13
Re: How does Feature Hashing work? (For ANN Input)
« Reply #2 on: March 28, 2019, 03:27:41 pm »
Hey, sorry it’s been a while since I have a had a chance to come back to this. Thank you however for the informative post.

I do not understand how to do the hashing trick though. If I can provide some examples of the data I’m working with, do you think you could help me understand how to apply the hashing trick?

I’m using data from an engineering system. In this case, I have attributes of piping components.

For example, the material of a component is one attribute I’d consider using. I’m using the following VBA code to produce a hash sequence:
https://stackoverflow.com/questions/7358955/generate-short-hash-string-based-using-vba/7359753#7359753

CATEGORY:    HASH SEQUENCE:
Steel                                        37152
Plastic                                31081
Aluminum                        2310
Bronze                                9364

So with this small dataset, how would I proceed to use the hashing trick?

*

goaty

  • Autobot
  • ******
  • 220
Re: How does Feature Hashing work? (For ANN Input)
« Reply #3 on: March 29, 2019, 07:06:49 am »
not really answering your question but...
hashing isn't that good, because they cant be near matched,   you need descriptive factors you can near match between is better.
« Last Edit: March 29, 2019, 07:40:53 am by goaty »

 


Users Online

43 Guests, 1 User
Users active in past 15 minutes:
Freddy
[Administrator]

Most Online Today: 70. Most Online Ever: 340 (March 26, 2019, 09:47:57 pm)

Articles