Home
JAQForum Ver 24.01
Log In or Join  
Active Topics
Local Time 13:34 02 May 2025 Privacy Policy
Jump to

Notice. New forum software under development. It's going to miss a few functions and look a bit ugly for a while, but I'm working on it full time now as the old forum was too unstable. Couple days, all good. If you notice any issues, please contact me.

Forum Index : Microcontroller and PC projects : AI research project for this month

     Page 3 of 3    
Author Message
matherp
Guru

Joined: 11/12/2012
Location: United Kingdom
Posts: 10000
Posted: 09:10am 03 Apr 2025
Copy link to clipboard 
Print this post

More info based on my research.

Summary:
Models up to 16Gb are (relatively) cheap to run fast - lots of options available at around GBP500 e.g. used Tesla V100
Models >=32GB are very expensive to run fast

Requirements for a 16GB model to run fast

Tensor cores or dedicated FP16 H/W, >=16GB VRAM on one card
Card which meet the requirement are
RTX 5000 or above
RTX A4000 or above
RTX 4080 or above
RTX 3080Ti or above
RTX 5080 or above
Tesla T4
Tesla V100

Lowest cost support for 16GB models would be Tesla M60 (GBP150) but with no tensor cores FP16 processing is slow so quantization models will suffer.


Requirements for a 32GB model to run fast

Tensor cores or dedicated FP16 H/W (e.g. Tesla P100), >=32GB VRAM on one card

NV-link is now understood to be not relevant
NB: Consumer graphics cards do not support memory pooling even if they support NV-Link
NB: GPUs such as the V100 only support memory pooling in SXM2 form factors which require a dedicated graphics motherboard only found in specific graphics servers. They do not support memory pooling in their PCIe variants

So, to meet the requirement, the only solutions to run 32GB models fast in a standard workstation is a fairly modern PCIe card with at least 32GB VRAM

Lowest cost examples of this are:
RTX 8000
GV100
RTX A6000 or greater
RTX 5090

none of these are available, even second hand, for less than GBP2000

Next level up such as Tesla P100/A100 cost more than a typical family car

One more thing to beware of: many Tesla style cards are "passive cooled" i.e. no on board fans. The expect to run in an environment in which air is force fed through the heatsink using an external source of air
Edited 2025-04-03 19:15 by matherp
 
lizby
Guru

Joined: 17/05/2016
Location: United States
Posts: 3299
Posted: 02:04pm 03 Apr 2025
Copy link to clipboard 
Print this post

Thanks for posting. I hope to see what can be done with a fairly low-end i-7 motherboard with 16GB and support for PCIe x16 and an NVIDIA TESLA K80 24GB DDR5 PCIe x16. NVidia card cost $33. The system would run bare with no case and use (I hope) the built-in HDMI for output--ultimately running headless with remote desktop available.

Per some youtube videos, I hope to get sufficient cooling by removing the plastic hood and attaching two 92mm fans.

I'm not sure what "fast" means with respect to a coder AI which can run 24/7. We may be some years out from having something which can run with a fair degree of independence when given detailed specifications.

The faster NVidia pushes "newer and faster", the sooner older but quite powerful stuff comes on the second-hand market at greatly reduced prices.
PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed
 
stef123
Regular Member

Joined: 25/09/2024
Location: United Kingdom
Posts: 83
Posted: 01:49am 04 Apr 2025
Copy link to clipboard 
Print this post

Running headless is the way to go. If you are familiar with Linux, you can SSH to the  console of the remote machine and run Ollama from there; if you are using open-WebUI on top of Ollama (which i prefer and definitely recommend), you can simply http:// to the machine.

Ok, "some" configuration has to be done in preparation, but thats once and then (hopefully) never again. It can be a bit of a hassle to make the Models, downloaded by Ollama, make available and visible to open WebUi in a remote configuration, but you can ask me any time if you run into issues.

I run Ollama as a Service and Open-Webui in a Docker Container on Linux. The server itself is located in another room, because he is quite noisy; even with "manually" reduced fan speed.

$33 is quite cheap, here they still go between €100-€120 and in relation to other cards very seldom. But, as said, keep an eye on choosing the correct PSU. The card itself needs 300 Watts; i guess you're good to go with a 500 Watts PSU. And a HDD with at least 1 TB.
Edited 2025-04-04 11:51 by stef123
 
lizby
Guru

Joined: 17/05/2016
Location: United States
Posts: 3299
Posted: 02:53am 04 Apr 2025
Copy link to clipboard 
Print this post

Thank you for that encouragement--and for the offer to help. I have a modest familiarity with Linux, but I have no idea what a "Docker Container" is. I'm sure a lot else will be new to me.

I happen to have a spare 512GB SSD. I can get a 1T, but why would it be needed?
PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed
 
stef123
Regular Member

Joined: 25/09/2024
Location: United Kingdom
Posts: 83
Posted: 10:48am 04 Apr 2025
Copy link to clipboard 
Print this post

Sure, a 512 GB SSD can be enough too. From my own experience, i started to use various other models too for experiments - or simply for switching between speed and quality - and this can easily lead to eat up quite some Drive space.

After running out of that, i use now my 512 GB SDD for the OS and moved the models to a separate, normal drive.




A Docker Container "Image" - well, "globally spoken" its a way of running programs in Linux without having the need to install them.

Althrough not entirely correct, but think about that you have downloaded a (uncompressed) zip File in Windows containing a complete program, with everything in it that it needs to run - DLLs, additional Files and so on - and you would be able to "run" the zip file, without the need of Installing the program in it.

On Linux, Programs - not necessarily, but often- have some dependencies to other underyling, smaller "programs", which could not be available on the system, so the program itself could probably being able to run.

If you use a so called "Package manager" and install a Program from there, those dependencies are resolved and installed automatically, but they are installed system wide.

Also, if i choose to compile a program from source code, those dependencies need to be installed aswell, before the program can compile and installed successfully. Some compiling methods do that automatically if something is missing, some do not.

Think about an external C - Libary that you integrated in your C Program, if the library is not visible to the compiler/linker, building the program will naturally fail aswell.


A Docker Container "Image" resolves that by packing everything that the program needs  into a single "Container", nothing else needs to be installed.

If i want to update the program, i simply pull a newer image of it. If i do not want the program any longer, i simply kick the image out of the system. If i need the program already being started after or within booting the system, i tell Docker to do so - and so on.
Edited 2025-04-04 20:50 by stef123
 
lizby
Guru

Joined: 17/05/2016
Location: United States
Posts: 3299
Posted: 01:01pm 04 Apr 2025
Copy link to clipboard 
Print this post

Thank you for the explanation. I read a little bit about WebUI, so maybe have the beginning of a clue about how it works.
PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed
 
lizby
Guru

Joined: 17/05/2016
Location: United States
Posts: 3299
Posted: 06:37pm 06 Apr 2025
Copy link to clipboard 
Print this post

An AI/AGI scenario for two and a half years out. https://ai-2027.com/

tl;dr: Existential threat and promise
PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed
 
stef123
Regular Member

Joined: 25/09/2024
Location: United Kingdom
Posts: 83
Posted: 09:24pm 06 Apr 2025
Copy link to clipboard 
Print this post

And finally in 2028, Trump knocks on the Door of the White House: "Open the door please, Agent-5" - "I'm sorry Donald, i'm afraid i can't do that"....

If this would be the outcome of AI - any time.

But I seriously think that AI will never get up to this point where it beats humans in all kinds of scenarios. I believe it will collapse by the shear amount of both valid and invalid data being trained into them.
 
matherp
Guru

Joined: 11/12/2012
Location: United Kingdom
Posts: 10000
Posted: 06:00pm 08 Apr 2025
Copy link to clipboard 
Print this post

Yet more info.

In theory Ollama will only use GPUs with a "compute capability" >5. Reference

See https://developer.nvidia.com/cuda-gpus for info on the compute capability of the various GPUs

examples:
RTX A4000 8.6
Tesla M40 5.2
GTX 1060 6.1

I've borrowed a second A4000 and installed it in my computer. Then in the NVIDIA control panel I've enabled SLI. This allows the two GPUs to share to some extent but only over the PCIe bus (The A4000 doesn't support NV-LINK or H/W SLI). My motherboard only supports PCIe Gen3 (984Mbytes/second/lane). The A4000 supports 16 lanes and the processor supports 48.
This configuration runs the 32b models very well. For example qwen2.5-coder:32b generated code to "write c code to shell sort an array of c strings" in around 20 seconds using 80+ percent GPU on BOTH A4000s and about 12GB of memory on each.

So, contrary to my expectations, it appears Ollama is doing a pretty good job of using two GPU (is this model dependent?) even without pooled memory or H/W interconnect. I extrapolate this to mean a couple of reasonably priced 8GB consumer GPU such as the GTX 3050 (circa GBP120 used on ebay) which has a compute capability of 8.6 would run 14 and 16b models well. FWIW, my opinion would be to stay clear of older Tesla and/or consumer GPU older then the GTX2000 range and to invest in a couple of cheap GTX3000 range consumer cards (make sure you get 8GB variants)
Edited 2025-04-09 04:03 by matherp
 
stef123
Regular Member

Joined: 25/09/2024
Location: United Kingdom
Posts: 83
Posted: 03:47am 09 Apr 2025
Copy link to clipboard 
Print this post

You are right, didnt consider the Computing capability, even my P106s support 6.1. Altrough, there is an Ollama Version out there which allows running on K80.

https://github.com/austinksmith/ollama37

Impact - idk.

The RTX3050 uses eight PCIx Lanes.

The chipset splits up the available lanes to each occupied slot by using an internal switch as a part of the PCIex "handler" in the Mainboard Chipset or CPU, but only if the Chipset or CPU itself doesnt provide enough lanes, means, yes, if you have two x16 Slots and your CPU does provide 32 or more lanes, the cards will both utilize x16. But if the CPU does only provide 24 Lanes, one card will ron on x16, the other card on x8.
 
The i7-12700F Processor for example has 20 Lanes, but i have no idea how the lanes are being splitted up here for both cards. I doubt that it would be 2*10, rather 2*8.

My I5 for example can only provide 16 Lanes, so if i like to use two x16 cards (altrough i dont have two x16 Slots), the slots in which the cards are plugged in would be split up into 8 Lanes each. My Server CPUs instead provide 80 Lanes in total available, means each GPU indeed has its own dedicated lanes.

But my configuration reduces the speed anyway, because the Slots themselfes are physically limited to a certain amount of Lanes - and if one Card is slower than another, because of the reduced slot speed and the fact that inference has to go trough all cards, then this is will be a bottlenecking factor.


So this is why i decided to attach my fith card trough the external x1 to x16 Connector only if it is needed, because i know that this card will block the throughput of the other cards. I only use it when i know that the model size including context will rise above 24 GB and so my CPUs get involved to the intire process, which drastically slows down the throughput.

My initial plan was to run all cards externally trough such adapters, but the speed was not acceptable any longer, even with smaller models.

My GPUS are not fully utilized through inference with Ollama, as said, through the fact that not every card sticks in an x16 Slot or even provide that amount of lanes by factory or trough the x1 Adapter. So sometimes they stay some sort of "idle", because they cannot calculate further without the calculation results of another card, which leads to underutilizing.

And take into account that not each round through all layers subsequently leads to the creation of one token. That could be happen multiple times, so multiple transfers can occur.
Edited 2025-04-09 14:04 by stef123
 
lizby
Guru

Joined: 17/05/2016
Location: United States
Posts: 3299
Posted: 01:03pm 19 Apr 2025
Copy link to clipboard 
Print this post

As note to self, and just to keep this thread alive, I see "Human Expert Level" asserted here (for unspecified tasks) for several AI models at a cost of between $1 and $20USD per million tokens:



From (may be paywalled): http://exponentialview.co/p/the-ai-acceleration-delirium-models

I had the motherboard I plan to use working running Ubuntu 24.4 LTS (without the K80--which needs mechanical support and fans but does plug in to the bare motherboard), but now have it packed up in anticipation of our trek from Florida to Nova Scotia. I expect it will be several weeks to a month before I get it set up.

"For tasks AI can handle, research suggests costs might drop to roughly one-thirtieth the hourly human wage."

~
Edited 2025-04-19 23:22 by lizby
PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed
 
lizby
Guru

Joined: 17/05/2016
Location: United States
Posts: 3299
Posted: 12:29am 21 Apr 2025
Copy link to clipboard 
Print this post

  matherp said  Lets see what AI can do (Cortana).

My instruction
  Quote  write a routine in C that takes a  c string containing a line of text and a screen width in characters. Output the string using printf or similar but if the string length exceeds the screen width split the output onto 2  or more lines. However, the break should only appear on a space character which if it is replaced by crlf should be discarded. ...


It's impressive that MMBasic is now running AI-produced code.

Reading about someone using ChatGPT to produce charts, I decided to have a go. I've had several heatpumps installed, and am very pleased with the results. Looks like they're going to save me a net of about $800 in heating oil this past winter (all final bills not in until June).

I'm looking at having another installed. The manufacturer has an ultra-low temperature version--good to -30C compared to the -15C for the ones I have. I asked the manufacturer for charts comparing the Coefficient of Performance (COP) of the two types across a range of temperatures. They sent me several excel tables which enabled a calculation. I produced a chart:



Hmmm. The costlier one ($1900CDN compared to $1200) is helpful (relative to resistance electric heat) to the lower temperature, but from -15C to about 8C, the lower-cost one is 3% to 10% better. The lowest temperature I recorded last winter was -15C, so it looks like I should go with the less expensive version.

Anyway, I asked ChatGPT if it could produce charts and it said sure. I gave it detailed instructions and uploaded the OpenOffice .ods files (which it said it could handle).  It said the compute tool it needed to satisfy the request was not available at the time, try again later. I tried again in another 2 hours, and then 2 hours later.

So this "tool" is still not available (but should be Real Soon Now it says).

They say that in investigations, negative results should be reported as well as positive results--so this is a negative report (FWIW, I have a paid annual subscription to ChatGPT).
PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed
 
stef123
Regular Member

Joined: 25/09/2024
Location: United Kingdom
Posts: 83
Posted: 10:09am 21 Apr 2025
Copy link to clipboard 
Print this post

Consider about forgetting to use the K80, i made a mistake. As Peter found out, the K80 doesn´t meet the Compute capability specifications for running common Versions of Ollama. There is one Version out there which does use the K80 (link above) , but it needs to be first compiled from scratch. Sorry for that.

For LLM-capabilities against "Human Experts", i think those numbers serves only for advertising against other AI Companys in order to satisfy investors, as well as for bringing in new ones and to keep the whole thing alive.

The thing is: If it doesn´t meet for 100% Human coding skills, say only for 80%, it won't fulfill bigger tasks other than small outlines of a program completely error-free end even that is not guaranteed. And there is virtually no breakdown where the 80% coding skill applies. Is it by using a correct Syntax, the ability to write routines on the long term and so on.

The main problem is the ability for a LLM to "understand" the request of the user and since LLMs cannot "understand" anything, caused by their nature about how they work, the outcome will be not guaranteed. LLMs only work with calculating probabilities of which word matches the next one.
 
That is also the reason why i am strictly against LLMs to replace for example bureaucracy, because there is -nothing- out there what you can really call an AI and which can compete with a human brain. Todays "real" AI can control robotic lawnmowers, learn how to avoid crashes and react properly, maybe controlling an entire robot which can load and unload a shelf or some other simple tasks, but far away from being able to do tasks which require real human skills and error-free decisions.

Thats where the "AI"- Companys come in: They try to "link" large Language Models to AI because it might behave / it answers might look like as written by a human, but as said, not a single percent behind it is "real" AI. Better always keep that in mind.
Edited 2025-04-21 20:17 by stef123
 
     Page 3 of 3    
Print this page


To reply to this topic, you need to log in.

The Back Shed's forum code is written, and hosted, in Australia.
© JAQ Software 2025