![]() |
Forum Index : Microcontroller and PC projects : AI research project for this month
![]() ![]() ![]() ![]() |
|||||
Author | Message | ||||
matherp Guru ![]() Joined: 11/12/2012 Location: United KingdomPosts: 10004 |
deepseek-r1:14b running entirely on a Nvidia A4000 (16GB GDDR6 VRAM). Quite interesting to see the reasoning. Edited 2025-03-30 03:17 by matherp |
||||
stef123 Regular Member ![]() Joined: 25/09/2024 Location: United KingdomPosts: 83 |
DS-Coder-2.5 is in my opinion quite good, but there are a lot of other models suitable for coding, but mostly only available through Ollama and running local. The DS-Coder-V2:16B comparable fast, but in my option not very usable at all. I dont like Reasoning Models at all, because they talk too much before they produce anything usable (which in turn also costs money for nothing) - if any. Coding is something that Reasoning-Models cannot handle at all. DS-Coder is good for Python, C++, so i also guess on bare C. Sometimes even ARM/AVR/8051 Assembly. But beware: You still have to recheck the code that they produce. They have a good knowledge about, but sometimes cannot add 1+1 together. As being said, they only repeat what they have stored, they dont have no real "knowledge" at all. And, as i had found out by myself, the "blahblah" Reasoning Models produce has its roots in the Llama-Models from where they seem to be trained further (so thats the "reason" why their training was comparable cheap). If you have a running Model (besides a Reasoning one) available, just ask your question, but put it in braces and before you hit the Return key, add "assume that you are confronted with this question. question yourself as much as you can about the input above, like a human would do, no step by step explanations and replace everything in the upcoming output that addresses "you" with "i". provide a very long output, question as much as you can about your own considerations, insights etc. NEVER address the answer to me, the user or any person, always address yourself! embrace your own thinking output with "<think>" and after that, close it with </think>. After that, give me a conclusion about your thoughts and provide the user a solution based on your thinking. if you do any mistake, i will fine you with $1 million.". And -tadaaa- you will get the same output like on a reasoning model, even comparable. So DeepSeek didnt invent anything new with the R1 Models, besides running the "thinking" internally on a faster, smaller model and moving its output to a larger one. The Text above has its origins in my own experiments. |
||||
stef123 Regular Member ![]() Joined: 25/09/2024 Location: United KingdomPosts: 83 |
I dont have my Server running at the moment, but i give you an example with Llama3.1 (idk which quantization, but it has a model size of about 4.7 GB), running on my I5-4570, getting about 4-5 Tokens/second. The interesting part, after inserting what i have said before - forcing it to think - the model discusses with itself. The text after the <i> brackets is -not- written by me. Edited 2025-03-30 03:53 by stef123 |
||||
matherp Guru ![]() Joined: 11/12/2012 Location: United KingdomPosts: 10004 |
Fascinating thread. Assume target is local running 32GB models - my research FWIW is now as follows: CPU/RAM: almost doesn't matter but anything 10-core or above is fine. Critical is the number of PCIe lanes so workstation based is likely to make most sense. X299/I9-10920X with 32GB or something like that is relatively cheap and a good base. For the GPU, Tensor cores are a must. I'm lucky to have picked up my A4000 cheap. If I added a Tesla T4 (around GBP500 second hand) with a NV-link (need to check it would work between the two) that would give me 32GB of properly fast processing. Start from scratch perhaps go for 2x Tesla-T4 with NV-link and a basic graphics card for the display? Edited 2025-03-30 03:51 by matherp |
||||
stef123 Regular Member ![]() Joined: 25/09/2024 Location: United KingdomPosts: 83 |
Tesla T4- yup, would be the thing to go, i assume. I forgot completely the existence of NV-Link.... Your summary looks indeed very good. |
||||
lizby Guru ![]() Joined: 17/05/2016 Location: United StatesPosts: 3299 |
I like the fact that it starts off with the realization that the "tallest" volcano (base to peak) may not be the one with the peak at the highest elevation. But it doesn't quite get to the question of when the base is deep below sea level. That would be a leg up from the start. Edited 2025-03-30 04:26 by lizby PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed |
||||
stef123 Regular Member ![]() Joined: 25/09/2024 Location: United KingdomPosts: 83 |
It repeated with Mouna Kea as the one with the highest elevation from ground level (6000m below sea level + ca 4000m above sea level) But indeed it didn't continue with the answering the highest volcanos at all, -but- it somehow also answered correctly by thinking about the definition about "the highest volcano". In this case, i would need to clear up the definitions. Its also depending on the model size how it "thinks" and comes to a conclusion. Some do it better, some worse. The context size also plays a big role. Btw, sorry that i have to edit my postings several times or still making mistakes in my spelling. As i said, English is not my native language; i'm German, but i dont like using Translators. I have learned a lot about the English language over the last 40 years and understand both spoken and written English, but I still have some difficulty with self-written English texts. Edited 2025-03-30 07:19 by stef123 |
||||
lizby Guru ![]() Joined: 17/05/2016 Location: United StatesPosts: 3299 |
Not to worry. The only parts of what you have written that I didn't understand (at least to a degree) were the technical ones about data transfer speeds and hardware bus characteristics--and that had nothing to do with German-to-English. It's a great thing about this forum that there are very skilled people from all over the world contributing. Your knowledge in this area is very valuable. PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed |
||||
lizby Guru ![]() Joined: 17/05/2016 Location: United StatesPosts: 3299 |
I'm not sure where to post this, because I'm not sure where the various AI threads are headed. In view of @stef123's post of the PicoMite commands rendered to an openoffice table, I thought I'd do the same for MMBasic For DOS. Not sure it's clean enough for use (and I don't know how to use it exactly), but it's a first pass. MMBDOS commands.zip Functions to follow at some point. . . . Hmmmm. As it turns out, functions were much easier than commands" MMBDOS functions.zip Edited 2025-03-30 12:40 by lizby PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed |
||||
stef123 Regular Member ![]() Joined: 25/09/2024 Location: United KingdomPosts: 83 |
Your knowledge in this area is very valuable. Thank you very much, i appreciate it! Of course i also don't know everything in this field, maybe i am also wrong here and there in my assumptions / knowledge, but i'm happy to share those things that i know. LLMs would not be able to solve every problem, not today, not in the future, that's for sure, because they tend to make more or less serious errors, caused by the way they work and there is no real workaround for this issue. They are only working with mathematical probabilities, but they could never reach, like something in the Field of mathematical Analysis, -exactly- for example the value "1" - they could get close to this number, but they would never reach it - so LLMs do, they can make lesser errors by time, but they would never get to the point where they dont make -any- mistake - what you would expect from a device that only calculates with zeros and ones. In certain fields like 'Finance sector, programming, healthcare and so on, mistakes are in reality simply not allowed - by default. They -can- happen, but -should- not and if a mistake is made by a human, it usually can be resolved. If an AI or anything else based on complex mathematics makes an mistake, it could be a -very- serious one, depending on where its being used for, so i am strictly against using at LLMs or AI in those fields or using them overall as a replacement for a human brain. It can be a helpful tool like a simple calculator, but one which doesnt calculate very precisely. Companies that deal with this topic, of course, predominantly claim the opposite - but they have to do this in order to attract and retain investors, as well as customers/consumers. In first order, its all about making money out of a new product, but one which doesn´t get finished ever at all. The price that we humans, as well as nature, have to pay for construction, training and operation of LLMs/AI, due to the ever-increasing demand for energy and resources in that field, does not in any way correspond to what we receive in return. That might be changing with more and more efficient technologies, but the outcome doesn't change at all. Behind AI lies no savior for humanity or the long-awaited opportunity to create paradise for all on earth. The problems we face in the world are rooted in entirely different causes, most of which cannot be solved mathematically. And, as it is always the case in the field of Computing: https://www.youtube.com/watch?v=2C9X0YUmt98#t=43s Edited 2025-03-30 18:27 by stef123 |
||||
stef123 Regular Member ![]() Joined: 25/09/2024 Location: United KingdomPosts: 83 |
The best Speed i can get with DeepSeek-R1:14B is pretty exact 8 Tokens per Second. Thats way slower then your card, Peter, but i think its pretty understandable that a $25 card cannot compete against a pretty expensive A4000. Note that my cards are not fully utilized, caused by the PCIx Bottleneck. But i am naturally unable to run a 14B Model on a single 6 GB card. CPU only gives me 2.97 Tokens/sec. The fastest speed i am able to achive is with llama3.2:3b = 41.06 tokens/sec llama3.2:1b = 46 Tokens/sec But i made a huge mistake on how Ollama generally works with multiple cards: It -usually- spreads the model over -all- available cards, not to a single one, if applicable. Sorry for that. It -should- work like that - using only the amount of cards which are needed in terms of memory needs in order to keep the Layers on one card for faster calculations, without the need of utilizing the PCIx Bus - but it doesnt. You can limit this by using a script which generally limits the available CUDA devices to Ollama. Llama 3.2:3b also usually would spread over all cards, which slows down the speed drastically, down do 21 Tokens per Second, whereas running the model on only one card will go up to the number i have told above. Edited 2025-03-31 04:17 by stef123 |
||||
Mixtel90![]() Guru ![]() Joined: 05/10/2019 Location: United KingdomPosts: 7468 |
I wonder which AI system, when asked what is the best way to solve the environmental issues that we are facing, will be honest enough to reply that it's to get rid of digital currency and AI servers? Mick Zilog Inside! nascom.info for Nascom & Gemini Preliminary MMBasic docs & my PCB designs |
||||
lizby Guru ![]() Joined: 17/05/2016 Location: United StatesPosts: 3299 |
Considering that the root causes of the environmental issues that we are facing long precede the existence of either digital currency or AI servers, that would not seem likely to be an answer that any intelligence--artificial or otherwise--would give. PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed |
||||
Mixtel90![]() Guru ![]() Joined: 05/10/2019 Location: United KingdomPosts: 7468 |
But at least it's something that's currently under our control. Before the AI systems won't let us shut them down... ;) Seriously, I've just read that Nokia are now saying that their roadmap is to put more and more processing into each rack and add lots more cooling. The limit is regarded to be the maximum amount of power and cooling that they can fit into the building. That can be as much (or even more) as the amount used by a small town for a single bit barn. The best way is to concentrate on much lower power tech rather than carry on extending the existing systems but that is far less attractive commercially, even if it is better for the environment (well, hopefully). Mick Zilog Inside! nascom.info for Nascom & Gemini Preliminary MMBasic docs & my PCB designs |
||||
stef123 Regular Member ![]() Joined: 25/09/2024 Location: United KingdomPosts: 83 |
Indeed. Instead of concentrating on creating and using more efficient and less power hungry Tensor Processing Units, they buy everything out of the GPU market as much as they can; regardless of the power consumption. By that, they also generate a huge amount of E-Waste. Needless to say that companies Nvidia, AMD & Co do not care about this at all, they make good money out of it due the ever increasing demand. Sure, the training of (official) LLMs does only use a fraction of earths total power consumption (assumptions say 2.7% of it), but inference does it as well - through models being hosted and made available to the user by large datacenters, as well as locally running models - and the need for more power is steadily increasing. And the models which are officially known are not the only ones. Plus Power consumption for mining. Electrical Energy doesnt grow on trees and most companys don't want to invest in their own power consumption. Instead of generating their own energy, they shift energy demand to the general public, which then has to refinance the construction of new power plants (of whatever kind) through higher electricity prices. Neither solar power plants, nor wind turbines, nor nuclear power plants pay for or build themselves – and I don't see the point in having to pay for their gadgets, which only those who can afford can and are allowed to participate in. Or does anyone seriously believe that this will become and remain a forever open and free system for everyone, a field in which you can really make money? |
||||
Volhout Guru ![]() Joined: 05/03/2018 Location: NetherlandsPosts: 4815 |
Mixtel, Not sure if it is still valid, but 2 years ago, in Netherlands, the green energy total (wind/solar) barely compensated the energy needed for airconditioners. Eliminite both, and accept summer is warmer than winter.... Volhout P.S. I think AI should focus all it's energy on thinking how it can become more energy efficient. I said that before. I hope AI concludes the world is better without AI. PicomiteVGA PETSCII ROBOTS |
||||
lizby Guru ![]() Joined: 17/05/2016 Location: United StatesPosts: 3299 |
Per perplexity.ai, "As of 2023, wind power accounted for approximately 59% to 64% of Iowa's electricity generation". Iowan's don't do this because they are "green". Also, "In 2024, wind and solar energy together accounted for 30% of Texas's electricity generation". Texans aren't noted for tree-hugging (of course, in west Texas where most of the wind is, they don't have any trees to hug). Perhaps even more significantly for the majority of the world which doesn't have much electricity, nearly 30 gigawatts of panels have flooded into Pakistan since 2020 (and elsewhere)--almost 16GB of solar in a single year--at Chinese DDP prices (Delivered Duty Paid), that's almost .1% of Pakistan's GDP--and almost all of it behind the meter. Two 500-watt panels on sunny days would match the average grid-delivered residential daily usage in Pakistan (and be 10 times or more the average residential daily usage in many parts of Africa). This is happening because solar and wind do pay for themselves (at least in the right places and where tariffs don't kill imports--looking at you, U.S. and Canada (165% on panels in Canada)). The point in the long run--at least according to "abundance theory" enthusiasts--is that you won't have to "make money" on AI. This, of course, has echoes of nuclear electricity being "too cheap to meter" (but there's an argument that that wasn't really tried after the 70s--at least in the U.S.). Abundance ~ Edited 2025-03-31 22:36 by lizby PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed |
||||
lizby Guru ![]() Joined: 17/05/2016 Location: United StatesPosts: 3299 |
@stef123--another question about hardware. Assuming you have an ai coding model which fits entirely (with room to do calculations) in that NVIDIA Tesla K80 24GB GDDR5 Accelerator GPU, would anything more than nominal memory and CPU be needed--e.g. 16GB and i-7 or even i-5? Once trained and loaded, is there anything other than text passed back and forth between CPU and GPU (i.e., text for program design parameters to the GPU, and text code back)? PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed |
||||
matherp Guru ![]() Joined: 11/12/2012 Location: United KingdomPosts: 10004 |
With my A4000 running a 14b model I see 100% GPU utilisation, about 20% CPU (I7-12700) and no obvious memory usage |
||||
stef123 Regular Member ![]() Joined: 25/09/2024 Location: United KingdomPosts: 83 |
Once trained and loaded, is there anything other than text passed back and forth between CPU and GPU (i.e., text for program design parameters to the GPU, and text code back)? I not quite sure how much the CPU gets involved in this entire process, but i doubt that it would create a massive overhead. What i know is -at least on my system- that one core of the CPU gets utilized when converting the text into Tokens/Data Preparation and feeding them into the "Neural Network" = the model itself, residing on the graphics card. That process of Tokenization and Data Preparation isn´t really fast here, but it sure depends on how old the CPU itself is and which Math Acceleration techniques it provides. My Server CPUs, as previously explained, are not good at all by todays standards. Ollama is (like most other programs for LLMs) for a good part based on Python and if not strictly implemented (or even possible), Multicore parallelism isnt being used here, in this case for certain tasks. There has to be done some work done outside the graphics card - because GPUs naturally can´t handle other tasks than calculating - in terms of Tokenizing the input Text, overall Model handling and so on, but thats negligible. In my case, a single core of my XEONs is pretty weak by todays standards - and one core of a modern CPUs works way faster than mine, so the additional Jobs can be done way faster. My CPUs get their strenght trough multicore usage, if possible. If you use an model which consumes the entire 24 GB GPU memory without leaving much space for a context size which fits programming needs (above 2048 Tokens) and you then increase the context size, then the CPU and System RAM will get generally involved in this process - and the CPU will block the GPU in terms of generated tokens per second. Depending on the Model itself and its quantization, by increasing the context size (which can be a word or a part of a word), it would also affect the so called attention mechanisms, so that a context size for example of 32768 would result in 1,073,741,824 pairs of elements (whatever they mean by that (after some additional research that means bits)) in order to compute relationships between them. The additional memory being needed for 32768 Tokens can go from 512MB for a Model with 4 Bit quantization, up to 160 GB for a FP32 Model - as said, depending on the models precision and size being used. In short: Yes, you can extend Ollamas ability to handle a larger context size or using a larger model than the graphics card could handle by utilizing the System CPU and its Memory in parallel with the GPU, but for the cost of speed. The larger the model and/or its context size (beyond 24 GB), the more slower it will be overall. The effect would be less hard if you instead would use two GPUs with 24 GB each. As being said, CPUs are (mostly) still slower than GPUs. If the model (including context) fits entirely into your graphics card, then not much CPU power will be used at all, as Peter also observed. I7/I9, as stated, will be quite enough, but if possible, better go for 32 GB RAM. That in turn gives you some room for experiments with (some) larger models or context sizes. Ollama is also able to swap out to virtual memory, but that would be a neck braker in terms of speed. Edited 2025-04-01 17:22 by stef123 |
||||
![]() ![]() ![]() ![]() |
![]() |
![]() |
The Back Shed's forum code is written, and hosted, in Australia. | © JAQ Software 2025 |