21
General Project Discussion / Re: Pattern based NLP
« Last post by infurl on March 06, 2021, 02:17:16 am »The original lists are unsorted, so they are hashed & sorted in program. (Hashed by ASCII adding.) There are typically 0-5 duplicate hash ids/collisions so the correct matches are checked letter-by-letter as well.
...
After: 76ms of preparing. Hashing & sorting spelling and word list.
Pro-tip #2. There is no reason that you would have to do the preparation such as hashing and sorting at run-time. You could break out the portion of the code that does that preparation into a separate program which you run at compile time. This program does all the necessary preparation and then prints out all the data structures in a format that can be included by your final program and compiled in place into its final form. That will save you a chunk of time every time you run the actual program.
In my case I am parsing and processing millions of grammar rules which can take a considerable amount of time just to prepare. Although small grammars can be processed from start to finish at run-time, I have found it much faster to compile the different files that make up the grammar into intermediate partially processed files; these files in turn get loaded and merged into a final grammar definition which is then saved in source files that can be compiled and linked directly into my parser software, as well as a database format which can be loaded as a binary file at run-time.
That last feature has lots of advantages. The preprocessed files were so large that it was taking a long time just to compile them, but the best thing is that by separating the data files from the software, I can choose completely different processing options on the command line.