Releasing full AGI/evolution research

MikeB · « **Reply #240 on:** April 13, 2021, 08:48:22 am »

Quote from: LOCKSUIT on April 12, 2021, 03:23:41 pm

That'd be awesome if you can fix his code to run faster

I couldn't get his file open command to run because it's been deprecated in newer versions... Line: "freopen("enwik8.txt", "r", stdin);" .. So i replaced it with a standard windows file read inputting into a Fixed Array (a large char array containing the full 100mb enwik8.txt) and it ran enwik8.txt in 567ms with the text output, and only 21ms without the text output... So I'm not sure the code was right... 'Count2' was only set to 50 instead of 10,000/100,000 also.

So I'm just replacing line by line of the full thing in C++... it's not too much code total

LOCKSUIT · « **Reply #241 on:** April 13, 2021, 04:49:18 pm »

   for (int count2 = 0; count2 < 50; ++count2) {
      int node = 0;
      for (int i = count2; i < count2 + 15; ++i) {

The 50 is how many letters of the file to step.
The 15 is the branch/ window size that does the stepping.
15 is essentially what I use in my AI but yes the 50 is not eating much, try 1,000,000 for 1MB.

The output for 50 is (either is fine): see attachment.
Though it can be different if need of course.

MikeB · « **Reply #242 on:** April 16, 2021, 12:18:11 pm »

I tested just the small loop (as above), in a few different ways and found some things.....

The baseline times for the other C++ conversion were:
(without data display)
10: 5ms
100: 36ms
1000: 189ms
10,000: 2.2s

In my version the Tree storage is 'fixed size character array/s' instead of String based. Tree storage is split between header & data, not all in one. Custom character search. "Stop" max set to 5.
(without data display.)
10 - 1,000: 1 - 2 ms. (+/- 1 ms)

I don't have perfect output yet, and mostly running into problems with exponential numbers, and stack overflows going higher than 1000.

Also all C++ projects only run on 1 thread (1 core) of the cpu (on a 6 core cpu thats 16% cpu), unless you explicitly say to use more threads/cores. So to get max performance you would need to lookup the number of cpu cores and divide the work into that many groups, then the kernal/operating system should hopefully divide it per cpu core. Some cpus may be advanced enough to split it evenly though and reach 100% cpu... that's if the time taken goes into several seconds/minutes...

There is definitely a lot of extra work for Strings...

I'll play with it over the next few days and try to get perfect output @ 10,000 bytes and see what the time for it is...

Is the header required? This section?

Code

'∩╗┐<mediaw', [1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [3, 18, 33, 48, 63, 78, 93, 108, 123, 138]

LOCKSUIT · « **Reply #243 on:** April 16, 2021, 04:10:36 pm »

The output doesn't have to have both numbers and letters in the same list / array, no.

Mine (my FULL algorithm note!) does 100,000 bytes in about 14 seconds and the small C++ code my freelancer made seems to be about 3 or 4 times faster.

Ok so I didn't actually share my tree in python....it is much faster of course solo, my tree below does, well I'm on my slower computer and don't want to test it but it ex. can do 1MB in 10 seconds for example though this is a bad guess mostly from memory.

MikeB · « **Reply #244 on:** April 19, 2021, 01:10:11 pm »

I fixed the output and tested it at 1,000,000 bytes.

This is on one thread @ 2.6ghz (16% cpu usage)... just testing the small C++ code of your freelancer versus my C++ code..

C++ freelancer:
10: 5ms
100: 36ms
1000: 189ms
10,000: 2.2s
100,000: 24.5s
1,000,000: out of ram error at @ 1.5gb

C++ mine:
10: 1ms
100: 1ms
1,000: 3ms
10,000: 280ms
100,000: 4s
1,000,000: 92s (200mb stack size. 265mb ram usage total)

The output is the same for both.

The difference is the Freelancers code uses Managed Strings which uses the Heap data storage type, which is much slower and constantly resizes itself.

My code uses Fixed Arrays and uses the Stack data storage type, which is faster but you need to declare how much you use at program start... But even for 1,000,000 bytes analysed it only uses 200-265mb. It increases by ~4.4 times for each step between 10,000, 100,000, 1,000,000... so the next step is ~850mb for 10,000,000 bytes.

If your PC does the Freelancers code of 100,000 bytes in 4 seconds.. then your PC is 6 times faster than mine... so it would do my C++ code (100,000 bytes) in about 650ms.

If you want the output and/or code I can send it.. can't seem to post attachments.

LOCKSUIT · « **Reply #245 on:** April 19, 2021, 02:04:24 pm »

Thanks for testing that, that's cool it can be much faster, I wonder if in Python I can do optimizations like that too though.

Currently my compression scores are below and I should be able to reach 20.3MB at least if I tweak my parameters as there is a pattern among my scores and there is room to tweak a lot still. But I can't on my slow code, it takes 5 hours for 10,000,000 bytes using Cython or pypy. The harder part that would increase my score is this line below, It moves forward as data gets larger, it should I mean, ex. 9, 8, 7.....then on 10x larger data should be ex. 9, 8.6, 7.5........the first item is hardcoded weight for layer1....2...3.....obviously layer 1 has up to 256 different letter features....layer2 is 256*256 possible....so the weight goes down like a exponential curve sorta. This is what I really need to tweak below, but it's hard to on big data cuz I must test each one at a time item1 u/down adjusted, then item2.... I could normalize it but it seems hard to understand.

_25ofRoof = (w * [0.9, 0.9, 0.87, 0.83, 0.79, 0.73, 0.57, 0.46, 0.36, 0.31, 0.3, 0.28, 0.33, 0.61, 0.59, 0.53][len(predictions) - 1 - q]) * remaining

10,000 bytes in
3,295 bytes out
Shelwien's Green: 3,453
Byron Knoll's cmix: 2,146 (best compressor on Earth, until 1GB test)

100,000 bytes in
27,899 bytes out
Shelwien's Green: 29,390
Byron Knoll's cmix: 20,054

1,000,000 bytes in
241,348 bytes out
Shelwien's Green: 256,602
Byron Knoll's cmix: 176,388

10,000,000 bytes in
2,219,318 bytes out
Shelwien's Green: 2,349,214
Byron Knoll's cmix: 1,651,421

100MB bytes in
My bests above show I'm always 1.5MB ahead of Green's, I estimate I should get 20,300,000 bytes out if change my parameters correctly
Shelwien's Green: 21,819,822
Byron Knoll's cmix: 14,838,332

Attached is current code in case i missed adding it.

LOCKSUIT · « **Reply #246 on:** April 19, 2021, 02:14:42 pm »

Oh and the other thing I gotta fix one day but would be great if someone knew how is see those if x else x if else x.... those coloful long lines are supposed to be a exponential threshold....and it seems it is better if it is a curve with adjustable bumps,but I don't know how to make that and in 1 or 2 lines of code. By exp curve I mean ex. 9.5, 9.4, 9, 6, 3, 2, 1.5, 1.2, 1......but adjustable to i can move around multiple bumps and curves so get it to dedicatingly give inputs a more precise weighting.

Just got for 10MB: 2,218,935 in 3.5 hours on my 6 year old computer using Cython, I'll have to see if pypy is faster in the long runs.

MikeB · « **Reply #247 on:** April 20, 2021, 08:56:32 am »

It's really impressive compression... I can't actually understand much of it at all, but I know the function of individual lines...

Not sure if Python can work with the Stack/fixed arrays, even C# (java hybrid) phased it out because it's risky and can lead to accessing memory outside the program and has security implications. So all fixed arrays in C# must have the unsafe{ } code tags and be fully certified to be published. It's the fastest way to do data processing so they really should recognise it more.

There may be some Sine equation you can do for the ascending/descending graph with bumps. I'm not good with maths. There is a sine/cosine instruction for CPUs but I think it's expensive in cycles so it may/may not be faster than multiple if-then-else statements...

I'm coming up with 344ms for 1,000,000 bytes, and 120mb data usage now. But the data output is questionable, So I'll try to convert the full thing to C++ (over this/next week).

LOCKSUIT · « **Reply #248 on:** April 20, 2021, 01:25:19 pm »

2,218,110 now, code is below

LOCKSUIT · « **Reply #249 on:** April 20, 2021, 06:22:47 pm »

BTW above I excluded 3 lines which stop RAM from getting too large at the cost of some compression, below is tree with adjustable stopping for branch lengths i.e. if it is 5, 4, 3, 3, 2, then it saves 1, instead of adding a branch like 3, 1, 1, 1, having only seen it once should not save it all yet is the idea, takes time to store the full string.

node = 0
stop = 0
for i in window:
char_index = tree[node].find(i) + 1
if char_index == 0:
tree[node] = tree[node] + i
tree[node + 1].append(1)
tree[node + 2].append(len(tree))
node = len(tree)
tree.extend(('', [], []))
if stop == 4: break
stop += 1
else:
tree[node + 1][char_index - 1] += 1
node = tree[node + 2][char_index - 1].

MikeB · « **Reply #250 on:** April 21, 2021, 06:50:05 am »

Do you use the Stop lines in processing 10,000,000?

LOCKSUIT · « **Reply #251 on:** April 21, 2021, 03:29:48 pm »

No, I didn't, and it uses something like 16GB or 24GB of RAM. On 100MB you'll need them, setting it to 2 or 1 is probably best, by 100MB it loses little compression and RAM will only go up a little. I'm trying 100MB for the first time and so far it used 14.25GB of RAM after 13 hours, set at =2. It might pass through.

LOCKSUIT · « **Reply #252 on:** April 21, 2021, 11:42:55 pm »

Oh one more thing, the better way to do the "stop" method for stopping storage explosion problem is this so that there is no 'less compression': instead of storing only 1 or a few next letters until see more occurrences, simply point to the full line in the RAM. Ex. the way I update my tree branch is below and the better way is below that:

h 4537, e37, l, 23, l 20
h 4538, e38, l, 24, l 21, o 1

h 4537, e 37, l, 23, l 20
h 4537, e 37, l, 23, l 20, goto letter 6345632 in enwik8 and use those 16 letters

Cuz storing every 1 letter offset of the file with a window of 16 letters makes the file 16 times larger at least. So pointing to each 16 letters

I'm still unsure how to do this exactly, cuz enwik8 is 100,000,000 bytes long, but maybe this is done for smaller files and in big data it eventually erases most these 'goto''s.

MagnusWootton · « **Reply #253 on:** April 22, 2021, 01:14:43 pm »

For Markus Hutter's dream, it is not 90% compression is what u need.
For AGI, you require 99.999999999999999999999999999999999999999% so close to 100% its not funny.

Getting the first 9, is not it at all, its how long the line of 9's is what u need.

LOCKSUIT · « **Reply #254 on:** April 22, 2021, 04:16:03 pm »

Put some thought into this bro. The best AIs get enwik8 dataset down to about 15MB from 100MB, and look at how awesome they are, the Transformer architecture is what makes up DALL-E on openAI.com, this is so close to AGI now and everyone talks like its nearly AGI. It is not only 5% the way to AGI, nor 20%, we are got 70% of how the human brain works in "working form"!! It's alien technology! You can't even tell now if its AI you are talking to or watching a video of ?who? created.

That's why we use 2 evaluations, lossless compression and checking how wicked the text/image/etc completions/invetions it makes up artistically/ intelligently. LC tells you a sanity check, that it really is AI, and the subjectiveevaluation tells you what compression IS AGI, cuz if you got 17.5MB and it generates poorly like LSTMs, and Transformers get ex. 15.3MB and generate human like material, this tells you what point is what and the LC tells you if you got to a better point RELATIVELY. Ruler+where you are. Hotdog+bun. Metric+subjective. True+thinksTrue.

Releasing full AGI/evolution research

MikeB

Re: Releasing full AGI/evolution research

LOCKSUIT

Re: Releasing full AGI/evolution research

MikeB

Re: Releasing full AGI/evolution research

LOCKSUIT

Re: Releasing full AGI/evolution research

MikeB

Re: Releasing full AGI/evolution research

LOCKSUIT

Re: Releasing full AGI/evolution research

LOCKSUIT

Re: Releasing full AGI/evolution research

MikeB

Re: Releasing full AGI/evolution research

LOCKSUIT

Re: Releasing full AGI/evolution research

LOCKSUIT

Re: Releasing full AGI/evolution research

MikeB

Re: Releasing full AGI/evolution research

LOCKSUIT

Re: Releasing full AGI/evolution research

LOCKSUIT

Re: Releasing full AGI/evolution research

MagnusWootton

Re: Releasing full AGI/evolution research

LOCKSUIT

Re: Releasing full AGI/evolution research

Recent Topics

Recent News

Users Online

Articles