GPU paralleling question

  • 21 Replies
  • 455 Views
*

unreality

  • Nomad
  • ***
  • 71
  • vividly dreaming inside a matrix
Re: GPU paralleling question
« Reply #15 on: November 19, 2017, 02:46:08 pm »
No, I found out what's happening. GPUs are only good if you're application doesn't require compute units to run in complete sync or if you're dealing with KB of data, not MB, or unless you have some astonishing unique GPU I'm unaware of. GPUs have registry memory also called private memory, which is fast like cpu memory, but it's tiny. GPUs also have local memory, which isn't so fast and can typically take 10 cycles latency per read/write, but again this usually on the order of KB. Global GPU memory is large, but has latency on the order of 400 to 800 clock cycles. Sure there are faster GPUs, but I haven't found any that are magnitudes faster.

The problem with that guys GPU code was that he was setting 1MB of memory, which goes way into global memory. It just depends on what kind of code you're using. If you know of a data mining benchmark that shows 500 times that of typical cpu then awesome, but out of the 1000s of benchmarks posted by people around the world so far there aren't any.

That's great that some NN applications can take advantage of GPUs.

My AI needs GB, not KB, and it deals with 8 to 64 bit data types, not 2,048. I'm not sure a GPU will increase the performance of my future ASI by than 10 times.

*

unreality

  • Nomad
  • ***
  • 71
  • vividly dreaming inside a matrix
Re: GPU paralleling question
« Reply #16 on: November 19, 2017, 03:36:08 pm »
Also there aren't that many CUs / SMs on a GPU. The GTX 980 is a good graphics card, but it has 16.

*

keghn

  • Trusty Member
  • ********
  • Replicant
  • *
  • 675
Re: GPU paralleling question
« Reply #17 on: November 20, 2017, 02:48:53 pm »
 Anybody know or have a tutorial on learning to program Nvida GPU by example?

*

keghn

  • Trusty Member
  • ********
  • Replicant
  • *
  • 675
Re: GPU paralleling question
« Reply #18 on: November 20, 2017, 06:22:20 pm »
 There is a choice of between GPU programs. Like there is CUDA and OpenCL.

Data Movement in OpenCL (7): 



*

unreality

  • Nomad
  • ***
  • 71
  • vividly dreaming inside a matrix
Re: GPU paralleling question
« Reply #19 on: November 21, 2017, 08:41:44 am »
GPU FAN BOY -> ur reading some poor sources there,  my gtx980 (2000 corer) clears my quadcore by 500 times.

MORE LENIENT TO CPUS ->  there are ways to make a cpu go better,  for example,  a gpu can do a box blur as fast as a gpu by snakeying a box and adding and taking from it,  and also building box keys can be done on cpu quite well if you read once write many,   *but* a gpu will just naively chew through it with a double for loop in the threading box,    so they are both pretty cool.

My honest opinion, is if GPUs had more ram id never use system code again, because its too taxing on my mind, and u get your operation up and running quicker with less building hassles.

Thats... er...  after youve finished getting through all the horrid documentation and wrote a billion lines just to set the basic system up.  hmm im contradicting myself.

That's great news. Maybe you're doing it the right way. Below is one source that gives tons of data mining gpu and cpu examples using well known data mining benchmark apps. The fastest gpu score is 16032, while the fastest cpu score is 3500. The gpu isn't even 5 times faster.

My Surface Pro 3 tablet that I use here to surf the internet that has an i5 is about 1/7th it's gpu.

There's a youtube video (haven't found the link yet), but the guy shows gpu code along with how long the gpu takes to clear 1<20 (~ a million) float type size. It doesn't get much simpler than that. The loop size was 1<20 (~ a million). When he used just one gpu core, it took a whopping 463 ms! When he used 256 cores it took 2.7ms. What's great, but what's interesting is that a typical desktop pc should take about 5 to 10 ms to do that! Once again we have that 1/5th figure! Why?? I understand gpus are amazing at graphics, but I'm more interested in AI, pattern recognition, etc. BTW, that youtube guy has a lot of gpu teaching videos. So one would expect the guy to know his stuff. Why would it take one gpu core so long to clear a million floats? Sure, when he used 256 cores, it was about 170 times faster, but that's only about 5 times faster than a typical cpu. What am I missing?

Zillions of gpu & cpu data mining benchmarks:
http://monerobenchmarks.info/

I did the same gpu test on my tablet that has an i5 4300U cpu. To make it simple I used one core. It has 4 cores. It wrote to one MB of RAM, taking 0.4 ms. The 256 gpu cores took 2.7 ms. If I used all 4 cpu cores it would have taken 0.1 ms. That's 27 times faster on an old windows tablet. Maybe someone can test their gpu to see how long it takes to write to 1<<20 bytes of RAM. I'm guessing that data mining uses more of the gpu features, and NN training uses even more. It's also possible the youtube guy uses a cheap gpu. That's difficult to believe since he's a popular channel that specializes in gpus.

*

ranch vermin

  • Not much time left.
  • Starship Trooper
  • *******
  • 476
  • Its nearly time!
Re: GPU paralleling question
« Reply #20 on: November 21, 2017, 01:11:56 pm »
No, I found out what's happening. GPUs are only good if you're application doesn't require compute units to run in complete sync or if you're dealing with KB of data, not MB, or unless you have some astonishing unique GPU I'm unaware of. GPUs have registry memory also called private memory, which is fast like cpu memory, but it's tiny. GPUs also have local memory, which isn't so fast and can typically take 10 cycles latency per read/write, but again this usually on the order of KB. Global GPU memory is large, but has latency on the order of 400 to 800 clock cycles. Sure there are faster GPUs, but I haven't found any that are magnitudes faster.

The problem with that guys GPU code was that he was setting 1MB of memory, which goes way into global memory. It just depends on what kind of code you're using. If you know of a data mining benchmark that shows 500 times that of typical cpu then awesome, but out of the 1000s of benchmarks posted by people around the world so far there aren't any.

That's great that some NN applications can take advantage of GPUs.

My AI needs GB, not KB, and it deals with 8 to 64 bit data types, not 2,048. I'm not sure a GPU will increase the performance of my future ASI by than 10 times.

Make sure you check the date of the paper your looking at,   I just had a look myself and saw a convolution filter benchmark I disagree with but it was from the directx 9 days.     the gpu should totally whip a cpu at it.    but I guess if you dont believe me,  just wait till you do it for yourself for real.

But doesnt matter anyway,  I could be wrong still.  I need to do more testing yet myself, and my gpu just died. :P  So i cant even find out for sure.

*

unreality

  • Nomad
  • ***
  • 71
  • vividly dreaming inside a matrix
Re: GPU paralleling question
« Reply #21 on: November 21, 2017, 07:38:35 pm »
Let me know when you're able to get a good test on a gpu. CPUs have their advantage. Same goes for GPUs. I thought GPUs were supposed to be faster with memory, CPUs better at logic and if statements. Although if you get a good cpu like the i9, which has a lot of L1 cache, then you can get high memory bandwidth. Well over 2000 GB/s. That's blazing!

There are endless benchmarks. The only ones I've found that compare CPUs with GPUs is data mining. Like I said, thousands of tests from people around the world using known data mining benchmark programs show GPUs were roughly 5 times faster.

It's difficult to say until my AI code is done, many many years from now, but so far it seems it will run best on CPUs. So far it's very logic intensive, but also does a lot of rapid look up table calls all over the place.