Made the code 124 lines, nearly as fast as before, re-coded neater an area, slightly improved score, and added generation ability. Had it ready like 2 days ago, got delayed though by life....
--------------------------------------------------------
https://encode.su/threads/3595-Star-Engine-AI-data-compressor--------------------------------------------------------
Star Engine - AI data compressor
I named my AI after unstable stars and atoms, which gravitate in matter to "compress it" and then once too large will extract it out as radiation to "generate new insights". It's currently in python (~10x slower than Green, hence ~12 hours for 100MB training), uses lots of RAM, and only outputs binary '01010101' instead of fully compressed 'Y', but I just started implementation and know how to fix all that.
EVALUATION RESULTS (compare to Hutter Prize and Large Text Compression Benchmark champions):
10,000 bytes in
3,328 bytes out
Shelwien's Green: 3,453
50,000 bytes in
15,174 bytes out
Shelwien's Green: ?
100,000 bytes in
28,028 bytes out
Shelwien's Green: 29,390
1,000,000 bytes in
244,494 bytes out
Shelwien's Green: 256,602
10,000,000 bytes in
[old] 2,288,646 bytes out
Shelwien's Green: 2,349,214
100MB bytes in
I estimate I "can" get ~20,400,000 bytes out
Shelwien's Green: 21,819,822
NEXT LETTER PREDICTION RESULTS (compare to size of data that would be needed to be able to cheatingly get the subjectively correct enfollowing 500 letters for some given prompt):
FOR 10,000 BYTES TRAINED ON:
The girl was sitting on Wikiquot;[http://www.lewrockwell|Ramp>
<contributor>
<text xml:space="preserve">#REDIRECT [[AlECT [[AREDIRECT [[Acce key="8">MediaWiki talk</namespace>
<namespace>
<namespace key="-1"-1">Template talk</namespace>
<namespace key="15">C51ist society might wom and prediawiki.org/xml/export-0.3/" '' moChmlers<potkin|Kropotkin]], PeternSChmler, cht w[s0��xpace>
<namespace key="12"1:/timestamp>2002-02002-02-25T15T15">Wikts. [[Bertrand chietikte Wtrand conal[http://uk.end
</page>
<page>
</revision>
</page>
<namespace key="geri.3<c<page>
FOR 100,000 BYTES TRAINED ON:
The girl was sitting on they can confunce (non-->, with this surelCatd, mak.]
The characteristics set maki.org/ Poccurs in the [[M.
It act Lam, ''unism==
{{main|150px|[[hu:Anarchism]]
[[sl:space="preserve">#REDIRECT [[Fory/fEDIRECT [[Afrom the [[Max Stirner]], but be givities}}
==The [[scienti. The authoritarian ar impain when he overl legration that if regoing (189898952</id>
</contributor>
</contributor>
<username>Ams</username>
<id>15898948</username>Ams</username>Josed. of nexchange example, the first manifests t893>A�xinitially preferentify the many ecles|[[Chich ce 19999|Wizely understand me>
<id>7543</id>
</contributor>
<minor />
<contributor>
<ip>Conversion script</ip>
<namespace key="1">Talk</namespace>
FOR 1,000,000 BYTES TRAINED ON:
The girl was sitting on [[copper]] or [[Zeno "repudiated the omnipotence of 0
| align="right" assumedia.org: The [[bar (lawk=��q.f|Melillage of 14, Andre plays. Par-TV Jaskirport<Plts for its variants from by Shrugged imperiod of Atlas Shrugged|section]] 152.
==San Sebastian Minese: 陳��M.�ju.jpg|thumb|left|Statue of Ayn Rand]]
[[gl:Astrongly replicated by one.
E5*t)#REdoct, rather pervasive death in tre">{|20010
|90 MHz took him for deity asks for in the South Pacific]]'' (glor accumulated "The Book)]], [[Alfreducation system is afa)
* [[PurgBifferency_code=197,�на]]
[[an:Austria]]
[[als:Archeologie]]
[[ru:Арія (крия]]
[[zh-min-nan:Oscar Chióng]]
[[da:Austria (geography of reconstruction:Oscar Christians appeared somethings said to have travel taken from 1
|Colorado]]. The lowere click, On said to have been effective values." | 60 Metallurgy]]) [[twe_oxaxU.S. state]] Science]]s while that Redge talleged|sections]] 121 and 161.
==BC]]]]
{{main|Anarchow has energy university of Povertyle [[Tih
[[Hollywood]] was interesting
Code Use: Place in a folder "code".py, input1.txt, input2.txt, compressed.txt. Run code at desired length of data to eat, it tells you at bottom how many bytes it is compressed, switch to input input2.txt and run length ex. from 10000 to 9980, run, check decompressed.txt. For generation mode, toggle the words generate in the decode section, if on you simply run the code for ex. 100,000 runs and if the file is 10,000 letters long with your prompt at bottom it'll see end of file and start extending the file in decompressed.txt by 90,000 letters.
How it works: A tree stores all 16/ 15/ etc letter-long strings as it runs over a dataset, exact matches are stored as counts. Before I save, I search for the last 16, 15, etc letters I see, I take the longest match found in tree and see what children leaves were saw to follow and their count probabilities. I take each letter prediction's counts and divide it by total counts for this layer to get softmax %s ex. 0.4 a, 0.6 b, so that add up to 1.0. Long matches are more accurate but have less counts, so I only get some weight for this layer, and if I have only total 30 counts but only 3 possible next letters seen follow then I am more sure I know the distribution, so this layer gets more weight, I do lengthOfPredictionSet * 7 to get the wanted roof of counts to be confident, then divide to get the % this layer gets. If it gets 30%, I have 70% left to find in short context matches. I also give hardcoded static weight partially since I must have not cracked the formula. lowest layer is no context set of predictions, simply how common each letter is. I apply exponential function to layer predictions, and blending layers (which), so it pools thinking if it's 0.6% then it's probably 7.2%, and if 9.9% probably 9.4%, same for other end; 0.1. Energy is used for recency, if I'm mixing layer 8 ATM then I check the last 300 letters for the latest 8 letters and make a set of predictions that follow, for this temporary set I give more recent predictions more count, and if I just saw 'p' 1 or 2 letters ago then I don't predict 'p' as much, but do lots after ~3 letters. For compression evaluation, I take my final set of predictions and subtract from a high 1.0 until find the prediction I'd need to remake the file, the better my AI predicts the next letter the less cost it is to steer it to the correct prediction to remake the file. Once I subtracted from high 1.0, I also have a low 0.0 and the space before last subtraction ex. 0.7 to 0.65, this is my new high and low, repeating this gives me a very long number ex. 0.8456346856.... As I make the number I carry away and store locked digits ex. 0.[763]73 high and low 0.[763]112, at the end I store just one number that is inbetween. This long number is converted to binary then supposed to be compressed to letters ex. 5Ge8$9&(gf@3Nfy. An extra set is in prediction sets in case any unseen letter needs to be steered to. Decompression uses nearly the same code used for compression.