![]() |
Forum Index : Microcontroller and PC projects : AI research project for this month
Page 1 of 3 ![]() ![]() |
|||||
Author | Message | ||||
lizby Guru ![]() Joined: 17/05/2016 Location: United StatesPosts: 3299 |
I'm not exactly sure what my use case is, but I'm interested in buying a system which can run DeepSeek locally, the 32B version--"reasoning" or not. I haven't owned a desktop PC for 15 years, so I don't have anything to build on. I'm looking for something 12-core with 32G memory, possibly expandable to 64GB, no Nvidia, running Ubuntu bare and headlessly--a system connected by wifi to my network (perhaps its own network), but not connected to the internet. A link to a discussion: DeepSeek locally. That link has links to how to set it up. So far I've found this $359 processor + motherboard and this $82 memory I've only just started looking into this. Has anyone else here considered a similar journey? ~ Edited 2025-03-27 00:18 by lizby PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed |
||||
vegipete![]() Guru ![]() Joined: 29/01/2013 Location: CanadaPosts: 1121 |
I lack knowledge about this, but saw the following in our campus news: https://ubctoday.ubc.ca/news/march-03-2025/restricting-use-deepseek-ubc Likely much less of an issue if run locally, without net access. Visit Vegipete's *Mite Library for cool programs. |
||||
matherp Guru ![]() Joined: 11/12/2012 Location: United KingdomPosts: 10000 |
I've got something very similar 12700 with 32GB RAM + Quadro A4000. It works with the 32GB Deepseek but is pretty slow. Note that as soon as the model size exceeds the graphics card VRAM it will run on the CPU using main memory. Personally, I'm a lover of old workstations which can be fantastic value. I can't prove it but IMHO something like this would be better value and work better Hate to say this but the other approach is a MAC mini. This has the advantage of unified memory so the GPU can use as much memory as it wants and is better at this massively parallel tasks than the CPUs Edited 2025-03-27 02:01 by matherp |
||||
lizby Guru ![]() Joined: 17/05/2016 Location: United StatesPosts: 3299 |
Pete--yes, safety concerns are why I want to run locally. Peter--thanks for the link and indication that I may be on a plausible path.. I'm trying to find something similar on U.S. ebay--so many options to try to understand. PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed |
||||
stef123 Regular Member ![]() Joined: 25/09/2024 Location: United KingdomPosts: 83 |
Well, i for myself have an old Dual Xeon E5-2697V2 Server running with 128 GB DDR3 RAM including five P106-100 Ex-Mining-Cards with 6 GB GDDR5 each , so that i get a total of 36 GB VRAM. Linpack Performance of the two CPUs alone on FP32 is about 500 GFlops (which i consider as rather slow for this setup), the cards are around 4.4 TFLOPs each at FP32, so weŕe talking roundabout 22 TFlops. BTW: When running newer Linpacks, you have to consider that it calculates then on Double (64 Bit) Preciston, older CPUs like mine only provide AVX (no AVX2,AVX-512 and so on) and dont handle 64 Bit precision very well. Older Linpack Benachmarks use FP32. FP32, althrough not being often used by LLMs because of its size, is the preferred speed where to keep an eye open. My setup runs Ollama with various Deepseek-Coders of different Quantization, as well as some other LLMs. I dont prefer the Reasoning models, because their output, especially for Coding, is more or less useless. I dont get what People love about Models like R1, it produces mostly unusable answers. If you like the fact that it "thinks": You can force every other LLM to "think" about your request, the output will be the same like on R1; you simply have to tell the LLM to do so. You have to consider the following things (go to "Conclusion" if this doesn matter at all for the moment): One graphics card alone, especially something like the rather cheap Nvidia K80 with 24 GB, wont do it, it has less performance than two P106-100. If you want speed with models like 32B-Models, you have to pay a certain good amount of money for at least a -really- good card (above €1000) The Problem is: Fitting the model into the graphics card is not everything; the context size, means, the "input and output" window, where you ask the LLM something and the answer is being put in, can go from 2048 Tokens (i call it here "words" althrough it is not the correct term) up to the limits what the LLM can handle (sometimes up to 128000 Tokens). Increasing the context size does eat up a -lot- of available Memory and slows down inference. With a rising context size, the Model needs more and more RAM of the Graphics card. If the card does not have enough memory, System RAM and the CPU will be used in a mixed manner (this doesn´t count for Apple M1s and upwards because of the unified memory). As soon as the CPU gets involved, LLMs will get usually -very- slow, because CPUs, unlike GPUs, are naturally way slower on matrix multiplications, which is needed by LLMs. The CPU, as soon as it gets involved in this entire process, will block the GPU(s), not matter how fast they could calculate. Calculating one Token (or one Word) needs several Billion matrix multiplications per second through the entire model. The Core count does not matter overall, i have 24 of them in my Xeons (with in total 60 MB L3-Cache) , but you still can see that i only get 500 GFLOPs (estimated for both CPUs should be around 900 GFlops). A newer one (with the same amount of cores)may exceed this by a factor of three or four, but it is still nothing for LLMs. Remember, my five cards alone do 22 TFlops and are still way too slow. You also won´t be happy with a context Window of say 2048 Tokens, especially for coding purposes, this is virtually nothing. ChatGPT for example is fixed to 8192 Tokens (no matter, if you use the free Version or the one you have to pay for) and cannot handle more. This means you can usually get 3-4 Pages of Code and it simply cannot generate more, so oyu cannot keep outputs consistent. If you then request more Code thean the LLM can handle by its given context size, then the LLM will start to make errors by throwing out Code parts of the previous generated code - or if the context size is intially too small, it will give incomplete answers. If you use one or even more graphics card, you have to keep an eye about how much PCI Express Lanes it does use. Some cards only use four, some eight, some all sixteen lanes. This is vital, because there is much inference between the System and the GPU (or multiple GPUs) while generating answers, means that there is a lot of Data going in and out through the PCI EXpress bus; the lesser lanes being used, the lesser the speed. Forget using cheap PCI Express X1 to X16 adapters and running the GPUs outside the computer with an external Power supply (more on that later) or buying cheap Mining boards with more than one PCI Express X16 Slot, because they usually use only one Data Lane on each slot. Mining calculations are far different from the needs that an LLM has. In terms of power: You have to be aware that GPUs need a lot of power. My Server and all five graphics cards easily draw 500 to 600 Watts out of the Wall socket and as slower the output of the LLM is, the more power it will consume. A 32B-Model with my setup a context size of 32768 and more easily eats up to 40-50 GB of CPU and GPU RAM, with a speed of 1-2 Tokens per Second. Means for a large program (and no guarantee that the output is usable at all!), you can count that it needs several hours to generate an answer, with the entire System fully under load. If you use smaller models, say 3B, 7B, 14B, the output speed will dramatically increase, but the quality of the answer also decreases dramatically. Even a single Apple M4 cant do wonders. Its CPU only can provide 4.4 TFLOPs, the GPU 8.52 TFlops, thats for an LLM virtually nothing, which needs at best several hundred TFLops to run a bigger LLM (32B, 70B, or even 405B) decently. CONCLUSION: Your I7 where you have an Eye on does 691.2 GFLOPS on CPU and roundabout 700 GFLOPS on its integrated GPU, thats nothing to run Models of that size and Quantization. Remember: The bigger the quantization and as more memory they need by that (also by increasing the context size), the more precise are the answers, but the Tokens per second go drastically down. With precision i mean an usable output. LLMs naturally only do repeat what they have stored. Also : LLMs usually use Nvidia or AMD GPUS for acceleration, they wont use the integrated Intel GPU; so you will be sticking to run the LLM on CPU alone and you wont do that. Nothing is more worse than running LLMs only on CPU. There are some so called "Tensor"-Processors from Google for a few dozen bucks you can plug into your USB port, but that adds up to nothing, they are only usable for very, very small AI Models running on a Microcontroller, means Models with a size of a few Megabytes. All i can say: Save you money, its not worth running larger LLMs locally (especially on CPU only without a dedicated and very strong graphics card) - unless you are willing to pay a fairly decent amount of money in the area of several thousand Euros for a good Mainboard, fast GPU-Accelerators and a lot of money for paying power consumption. Then you may get speeds like you get with Online-ChatGPT. As long its free to use, better use that, it costs nothing. I am aware of your safety concerns - "what if they use the code that i have generated" or for some other reasons, but believe me, you wont be happy with a local installation in that price range. Its virtually useless. It might do a "relative" good job on small 3B models, but you won´t get the speed gain you expect from where you are now. Edit: A short Table with some numbers what you could expect from certain graphics cards. A 32B Model would roughly need 28 GB Memory (with a context size of 2048). Seen here for example is a 70B-FP16 Model running on four (!) Nvidia H100 cards, providing together 6400 TFLOPS on FP16 (290 times the calculation speed of my cards), producing together roughly ten tokens per seconds, consuming 2.8 Kilowatts. One H100 costs between 25000-30000 US$. https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference Edited 2025-03-27 19:44 by stef123 |
||||
lizby Guru ![]() Joined: 17/05/2016 Location: United StatesPosts: 3299 |
Thank you @stef123, for the detailed response. This is far better practical guidance than I have seen anywhere. Ten tokens per second is a target I have seen recommended, but with the suggestion that a local system achieving it could cost considerably less than you have estimated. I accept that with your experience you are closer to the truth. So until I have researched further, I will put a purchase on hold (unless I just want to have a play). Regarding security, I do not have any concern about code that I might generate using AI being appropriated for use by others. I've been retired for 26 years, and anything that I do I'm happy to put in the public domain. I have seen Together.ai recommended as a relatively inexpensive U.S. host for DeepSeek. Do you have any experience with that or similar sites? I see on U.S. ebay that NVidia P106-100 6GB are on sale for around $30. If I were to attempt a system using those, what do you think would be a minimum for, say a 12B model, and what would be a suitable motherboard and processor for them? Do you have a link explaining how one would go about setting up such a system? (Just musing: if a local AI coding model could generate, say, 30 lines of good code per day at a cost of 500 watt hours times 24 hours = 12kWh = $1.44 where I am, then I could perform the "analyst / debugger" part of my former "programmer / analyst / debugger" job description, and let the machine do the coding. It would be an interesting experiment. I have to admit that I have seen a decline in my coding skills in the past year.) PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed |
||||
matherp Guru ![]() Joined: 11/12/2012 Location: United KingdomPosts: 10000 |
stef123 Very interesting post - thanks would a couple of these added to a standard graphics card be a good solution? |
||||
stef123 Regular Member ![]() Joined: 25/09/2024 Location: United KingdomPosts: 83 |
Thank you - i hope you can understand what I want to say, English is not my native language. I think you will be good to go with three of those cards for a 12B Model and the estimated throughput will be around 6-7 Tokens per seconds, depending on the context size. But honestly, if you can get your hands on a 24 GB NVIDIA K80 - they usually go between $100 and $120 and even if they are technically slower than three of those P106-100s - take it, it will save you the investment for a motherboard, since you can already use the given X16 PCI Express Slot (assuming you have no graphics card sticking in there) and you can be sure that all PCX Express Lanes are being used. In addition you get another 6 GB on top, which you can utilize for context. Sticking three cards with a fairly decent height (those cards usually occupy the height of two slots!) into a normal PC can be quite challenging, yet impossible because of the given amount of PCIe Slots. More or less "normal" PCs deliver one or two X16 Slots, sometimes x4 and a few x1 Slots. You -can- modify x4 and x1 slots in order to take cards with bigger connectors by cutting one end of the slot open, which has to be done under very big precautions, but from the electical point of view this is possible. Even i had to modify the Riser Slots themselfes of my Server in order to get all cards into it. So things to consider when buying such an Nivida K80: Make sure that your existing PC has enough Room for the card (in terms of height, as well lenght!) and that the PSU provides an extra 8 Pin connector, suitable for plugging it into the graphics card, and keep sure that it can handle the Extra amount of power needed. The K80 draws 300 Watts under load, i think with an 500 Watts PSU you're good to go. The software which i use (on a Linux System, i think the same goes for Windows) is Ollama, which handles the LLMs (Models can be easily downloaded by Ollama itself by using for example "Ollama run deepseek-Coder-V2:16B", checkout https://ollama.com/library?sort=popular. With it, you get for -a first start- a simple shell window where you can interact with the Model itself. Ollama automatically uses the accelerator card, as soon as the drivers are corretly installed and available. Ollama also provides an API, which is handy for programs like "Open WebUI". It can be bit complex to set it up properly, but in fact you get a Browser interface. Once logged into this interface (by using for example "http://localhost:8080" in the address window), you can select the model you would like to use (assuming you have previously downloaded it via Ollama), set the context size (and other parameters, if needed). Conversations are automatically archived and sorted by days and weeks and so on, you can easily reclal them by clicking on them. It can even search the internet, if needed, or you can attach documents to it and let the LLM analyze them (called RAG) etc. Open WebUI, as well as Ollama, can be automatically started after bootup. This is what i would recommend to do, because it is the least expensive way to get in touch with models up to 14/16B and a usable quantization of them. But keep in mind that the Cards like the K80 are only (CUDA) accelerators, they dont provide any Video output. So running them in a normal PC means that the CPU itself should provide integrated graphics. Edited 2025-03-28 01:43 by stef123 |
||||
stef123 Regular Member ![]() Joined: 25/09/2024 Location: United KingdomPosts: 83 |
I think so, but keep in mind that models using 32B or even 70B would not result in an extraordinary speed; i would estimate you would still only get 2-3 Tokens per second with three or four of them on a 70B Model. But the question is, where to physically put them into. A cheap old Server (with two PSUs!) as i use, is an option, but no guarantee that cards of this size will fit into them. For example, i have to keep the lid of my server open because the heatsinks already stick out of the case. I couldnt use the P40s, because they are too long. Some of the LLMs use less quantization in order to fit into memory and to gain speed, which in tun reduces the precision. Its rather complex to explain, so i will let ChatGPT do this Job (altrough a "minimal loss" of accuracy is a bit understated - that might be somehow true for general questions, but not for specific tasks) : In the context of Large Language Models (LLMs), quantization refers to the process of reducing the precision of the model's weights and computations. This is typically done to decrease the memory footprint and computational requirements of the model, allowing it to run more efficiently, especially on hardware with limited resources such as mobile devices or edge devices. How Quantization Works: Precision Reduction: In a typical machine learning model, weights are represented using floating-point numbers (like 32-bit floats, FP32). Quantization involves converting these weights to lower precision, such as 16-bit floats (FP16) or even 8-bit integers (INT8). Model Size: By reducing the precision of the weights, the model size decreases. For example, if you reduce the precision from 32 bits to 8 bits, you can reduce the memory usage by a factor of 4. Performance: Quantization can also lead to performance improvements by speeding up inference, as computations with lower-precision numbers are typically faster. Some specialized hardware accelerators (like TPUs and some GPUs) are optimized for lower-precision arithmetic, further enhancing the speed of quantized models. Trade-offs: The main trade-off in quantization is that reducing precision can cause a loss in model accuracy, as the reduced-precision weights may not perfectly represent the original full-precision weights. However, the loss in accuracy can often be minimal, and techniques like fine-tuning the model after quantization can help recover some of the lost performance. Types of Quantization: Post-training Quantization: This approach applies quantization after the model has been fully trained. The model is converted to a lower precision (e.g., INT8) and then fine-tuned, if necessary, to maintain accuracy. Quantization-Aware Training (QAT): In this approach, the model is trained with quantization in mind from the beginning. The training process simulates the effects of quantization during the forward and backward passes. This can lead to better accuracy retention after quantization compared to post-training quantization. Benefits of Quantization: Reduced Memory Usage: Smaller models that can fit into memory more easily. Faster Inference: Especially on hardware that supports lower-precision arithmetic. Energy Efficiency: Lower-precision computations often consume less power. Challenges: Loss of Accuracy: There's a risk of slightly reduced accuracy due to the quantization process, although this can often be mitigated with fine-tuning or QAT. Hardware Compatibility: Not all hardware supports all types of quantization, so the specific quantization strategy may depend on the hardware you're using. In summary, quantization in LLMs is a technique that helps make these large models more efficient in terms of memory and computation, making it easier to deploy them on devices with less powerful hardware while trying to maintain accuracy as much as possible. Edited 2025-03-28 01:59 by stef123 |
||||
matherp Guru ![]() Joined: 11/12/2012 Location: United KingdomPosts: 10000 |
Fascinating. So if a model is using quantization it is much better if the GPU has good FP16 performance as well as FP32. This means only the more modern GPUs with Tensor Cores are really suited best. |
||||
lizby Guru ![]() Joined: 17/05/2016 Location: United StatesPosts: 3299 |
Would two of them work better? I have no desktop hardware at present--only laptops--so anything which would handle NVidia devices could be suitable. ~ Edited 2025-03-28 03:41 by lizby PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed |
||||
stef123 Regular Member ![]() Joined: 25/09/2024 Location: United KingdomPosts: 83 |
Indeed, because Tensor Cores are not only fast in a specific Range as most Graphics cards are (FP32, sometimes also FP16, or cards which already have Tensor cores in them) but in every other number Format as well, and they really double up their speeds, if they go down from there to a lower precision. That doesnt mean that the model will get slower on normal graphics cards with a lower precision, because less precision usually needs less calculation speed - we all know that for example that an 8 Bit Multiplication needs far less effort that a 32 Bit one, even in hardware, because of the lesser Bandwidth being needed, so they also do benefit from it. But they wont process four 8 Bit values in a parallel fashion anyway, like Tensor cores do. Graphics cards on the other hand are good at FP32 calculations, as this is (usually) the only in hardware implemented, accelerated number format on them. But, as being said, the lower the precision, the lesser the quality of the output. If you go for example down to a 2 Bit-Model, you can obtain very good speeds, but the output is more or less (really) unusable. As said, some quantizations like 4 Bit or even a 8 Bit work quite good, but are still compromises between quality, speed and memory needs. Some other techniques are also involved, but for now i won´t go to deep into it. |
||||
stef123 Regular Member ![]() Joined: 25/09/2024 Location: United KingdomPosts: 83 |
Its depending on how "better" is defined. Two or more cards doesnt automatically mean , that this would double up the speed. In case of Ollama, and if you use a model which occupies lesser memory than the graphics cards memory size, only one card would be utilized, the other one would stay idle. As soon as the memory needs are increasing (beyond 24 GB) by using a larger model or a increased context size, the model would being split up and both cards are used. If if you use one card and a model which exceeds the memory size of the card, the CPU will get involved and this would drastically slow down the entire process because of the reasons i've explained previously. You can theoretically and pratically force Ollama to use both cards at the same time, but then the PCI Express Bus would be the bottleneck, because the Bus itself, even at the newest Version, is still slower than the Memory on the cards- and the NVIDIA K80 only uses PCI Express 3.0 anyway, which is limited to 16 GB per Second Transfer rate. On the other hand, the GDDR5 Memory which is being used on this card, with a 256 Bit Bus, can theoretically go up to 224 GB/Second. If a Token (="Word") is being calculated between two or more cards, there would to be a -lot- of data being shoveled from one card to another, because the Results of the multiplications from one card must be made available to the other card. This is a bottlenecking factor, but cannot be avoided at all. So this is why Ollama does utilize this technique; holding one model locally on one Card as long it is possible, because using two separate cards at the same time would reduce the speed and calculations can be done faster on one single card, rather than using two cards and shovel data between them through a rather slow interconnection. You have to understand that LLMs represent so called Layers, where one Layer is in fact an x by y Matrix table full of numbers, and that there are also several layers in the "z-Axis". So this "virtually" represents a 3D-Cube full of numbers or, for better understanding, a stack of paper laying on top of each other. Data, for example a (part of a) question, comes (very simply spoken!) in at the first layer, matrix mutiplications are then performed on this layer, and the results of this layer are moved "down" to the next one. On this layer, again Matrix Multiplications have to be done with the results of the upper layer, then they are being transferred to the next layer and so on. Think about your paper stack. Each paper has numbers written onto them and you have to multiply them all with a numeric representation of your "question" (in reality, a part of it). If you are finished with caluclations on the first paper, you have take the next piece of paper from of your paper stack, where are -also- numbers written down onto it. You then take the calculated results of your first paper and multiply then with the new numbers - etc - unless you are finished with the entire paper stack. This goes on and on, if neccessary, multiple times. So naturally, if you utilize more than one card, the results of the last layer in the card have to be transferred to the first layer of the other card, because the layers of the LLM are being split up between -both- cards. For example Layer 1-24 resides on one card, layer 25 to say 34 on the other one. That in turn involves the slower PCI Express bus, because the results from layer 24 have to be transferred to layer 25. It could also go the other way around, moving the results last layer from the second card back to the first layer of the first card - unless certain conditions are met and the "word" can be finally generated. But yes, two cards will be still better than one. As said, you wouldnt experience a speed gain, but you would prevent the CPU from being used. There are boards which can provide separate PCI Express busses in a electical manner, so that each card can have its own PCI express lanes instead of sharing them, but regarding this question i am out. My Server has two CPUs, so it does automatically provide at least two full X16 Slots - which are on the other hand divided into one x16 and two x8 Slots per CPU anyway. I asked ChatGPT for boards with independent X16 Slots, but i think the price vs speed gain isnt worth it at all. I think that a "typical" gaming Mainboard with at least two X16 (shared) Slots - which are far away enough from each other, so two cards would fit in - would also do the job, even if the cards have to share the lanes. This isn´t a high end setup anyway. ChatGPT: If you're looking for a motherboard with two "real" separate PCI Express x16 slots (i.e., two physical x16 slots that are independent and do not share bandwidth or data lanes), you will typically need a high-end workstation or server motherboard designed for multi-GPU setups or specialized PCIe configurations. These motherboards are typically found in the X299, X399, and high-end AMD Threadripper or Intel Xeon-based platforms, which support multiple GPUs or other high-bandwidth devices. Here are some options to consider: 1. ASUS ROG Zenith II Extreme (for AMD Threadripper) Socket: sTRX4 Chipset: AMD X399 PCIe Slots: It has multiple x16 slots (four total, two of which run at full x16 speed without sharing bandwidth) because of the Threadripper architecture. Why it's good: The motherboard is built to handle high bandwidth for gaming, creative, and workstation needs with dedicated lanes for each x16 slot. 2. MSI Creator TRX40 Socket: sTRX4 Chipset: AMD TRX40 PCIe Slots: The board has 3 x16 slots, where two can run independently at x16 speeds with the proper CPU. Why it's good: Specifically designed for content creators who need separate PCIe lanes for GPUs or other high-performance devices. 3. Gigabyte Z490 AORUS XTREME WATERFORCE Socket: LGA 1200 Chipset: Intel Z490 PCIe Slots: It offers two x16 slots, with at least one running at full x16 speed when using an Intel i9 or higher. Why it's good: Premium motherboard for gaming and high-performance tasks with separate PCIe lanes for optimal multi-GPU setups. 4. Supermicro X11SPA-T (for Intel Xeon Scalable CPUs) Socket: LGA 3647 Chipset: Intel C621 PCIe Slots: Supports up to 7 PCIe slots, with multiple x16 slots operating independently. Why it's good: A workstation-class motherboard designed for high-performance computing and server environments, supporting multiple GPUs. 5. ASRock TRX40 Taichi (for AMD Threadripper) Socket: sTRX4 Chipset: AMD TRX40 PCIe Slots: 3 PCIe 4.0 x16 slots, with at least two slots running at full x16 speeds. Why it's good: Solid value for users who need multiple independent x16 slots, supporting high-performance GPUs or other peripherals. 6. EVGA X299 DARK Socket: LGA 2066 Chipset: Intel X299 PCIe Slots: This motherboard has two full x16 slots with independent PCIe lanes for high-end setups. Why it's good: It’s designed for overclocking and high-bandwidth usage, making it an excellent choice for serious performance builds. Key Points to Keep in Mind: CPU Choice Matters: For true independent x16 lanes, ensure the CPU you're using supports sufficient PCIe lanes. For instance, Intel's Xeon and AMD Threadripper CPUs provide the required lanes for full x16 speeds. Chipset: The chipset plays a big role in how PCIe lanes are distributed, so make sure the motherboard is designed to handle your specific PCIe requirements. Motherboards for consumer-grade processors (like Intel Core i7/i9 or AMD Ryzen) may offer x16 slots, but they typically share data lanes between slots (e.g., when two GPUs are installed, one slot may run at x8 to share the bandwidth). For full x16 lanes with no sharing, high-end workstation or server motherboards like those listed above are your best bet. Edited 2025-03-28 16:56 by stef123 |
||||
lizby Guru ![]() Joined: 17/05/2016 Location: United StatesPosts: 3299 |
@stef123 -- Thank you once again for the detailed response. I see that a lot of research is in order. The K80s are quite cheap on ebay in the U.S.--on the order of $40 or less. I found this youtube site of somebody's build of a $2,000 4xK80 system about a year ago. (He says it's not very good but doesn't exactly say for what purpose and why.) Hardware (prices on ebay are very much down since a year ago): HP Proliant DL350p Gen 8 Server Motherboard ≈ $80 2 x Intel Xeon E5-2667 8 core ≈ $40 ($20 each) 256 gb ddr3 ECC ≈ $130 8 x Nvidia P40 24gb ≈ $1400 ($170 each) 2x HPE 460w switching PSU ≈ $50 ($25 each) 2tb Sata ≈ $100 2300w mining PSU ≈ $70 2x HP cooler heatsink ≈ $ 40 6x pcie 8x to 16x risers ≈ $60 ($10 each) various other stuff ≈ $100 ≈ $2000 for the whole build PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed |
||||
stef123 Regular Member ![]() Joined: 25/09/2024 Location: United KingdomPosts: 83 |
Yes, this would be a thing to go. I have an "heavily modified" DL380P Gen8, where i unfortunaly have to use for the fifth card one of those cheap x1 to x16 "expanders" (in reality it isnt really an expander, it still provides x1 Speed only), I've paid for the whole Server, including two E5-2670 CPUs, two integrated 460 Watts PSUs and 128 GB RAM - used, of course - 75 US$, without hard drives. Why so cheap - these are ten Year old Datacenter Servers and nobody wants them anymore. On top came two E5-2697v2 CPUs, both for around $30. As i said, there was "some" need for modifications. PCI Express Slots had to be "mechainically aligned" for my purpose, the Risers had to be screwed out of their metal cages in order to give the heatsinks of the cards some space; i also had to grab 12 Volts from the PSUs directly off their backplane and connect it individually to each graphics card (you could also use an external PSU for that purpose, thats no problem), keeping the lid open and so on. The good thing is that the Server already has eight Fans, blowing over the CPUs, RAM and also over the cards. This doesnt count for the fifth card, there i had to mount an external fan on it. If the fifth card isnt really needed (because its x1 adapter would slow down the other cards aswell, as a've explained), i simply disconnect the Power cable of it, this shuts down the "extender" as well, which also needs its own power connection. You wont see the outcome of this construction, but it works. As said, i wont be able to use the K80, because its too long, and by there it would not fit at all. Servers are also fairly special. They usually wont provide 5 Volts for other (internal) purposes -mine does after making some research- , have tons temperature Sensors -my Server has around 30 of them spreading though the entire System- and have here and there their special configuration needs in order to get the best performance out of them (if you can say so). As said, you have to keep an eye open about the power consumption. If you are getting energy for cheap, thats good. At my place i have to pay 28 to 30 Cents per KWh, so this would quickly sum up to a faily good amount of money, if i would use the server on a daily basis - what i dont do. But please, don't take my words too literally; as a simple User i cannot give a full comprehensive guide about which combinations of what hardware is the best and what not, i can only give you some tips from my own experience, but choosing the right hardware and test everything out has to be done by yourself. Even if you are afterwards not too happy at all about the results - you have to consider that you wont get a rocket by using these components. Its in fact the cheapest method you can get to run LLMs locally on dedicated hardware. As always, Price is the key. Edited 2025-03-29 00:54 by stef123 |
||||
lizby Guru ![]() Joined: 17/05/2016 Location: United StatesPosts: 3299 |
Thanks again for the details. How significant would you say was the change from two E5-2670 CPUs to two E5-2697 CPUs? With this configuration of NVidia GPUs, how much of a difference in tokens per second does the CPU make? Or, for that matter, the amount of memory on the main board? I'm not looking for a rocket. I'd be happy with something that could chug along producing code at one C statement per hour (having been given sufficient training and a complete specification), running 24 hours a day at 500 watts per hour. Electricity for me is $.12 per kWh. If it could achieve 2.5 C statements per hour, fully debugged, it would be 60 statements per day, matching my very best day of coding, along about 1982. Have you seen any metric for DeepSeek Coder which would indicate how tokens per second might translate to lines of C code per hour? PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed |
||||
stef123 Regular Member ![]() Joined: 25/09/2024 Location: United KingdomPosts: 83 |
The CPU change did literally nothing to it. I was expecting more gain too, because i didn't had any experience with LLMs before aswell, even not much with Servers or XEON Processors at all. Yes, in my early years when i was building Servers for a company, but that was Pentium III-Era. The thing is: Those are CPUs from the Ivy-Bridge-Era, running here on a Sandy-Bridge-Board, so they dont have anything in common with the standards of modern CPUs in terms of internal math calculation extensions. They only provide AVX, no AVX-2, AVX-512 and so on, which -could probably- have a good impact in terms of calculation speed. For comparison: As i said, both CPUs did on the 2022 Linpack, which utilizes FP32-Calculations instead of FP64 as the newer Versions if LinPack, only 500 GFLOPs - which is rather weak; thats normally the performance of one CPU, but i think thats somehow caused by the underlying Sandy Bridge Architecture of the System. Ivy Bridge Processor are at least technically compatible to Sandy Bridge Systems. Whereas my old I5-4570 4 Core-Processor, only one Generation later (Haswell) but at least with AVX-2, does on the newer FP64-Linpack 150 GFLOPs. The Server, on the other hand, does 250 GFlops on FP64 Linpack You see, that the core count (24 in total) doesnt matter much om such old architectures. I didn't run LinPack on the old CPU Setup, but i think that the numbers were only slighty below on this 16 Core setup, maybe on the FP32 Linpack 350-380´ish. Means, -if- calculation speed matters somehow, better go for Systems with Xeon V3 Processors; because they provide the AVX-2 Extension of the "newer" Haswell architecture. Technically they wouldnt run in my System of course - other Socket, other Chipset. Afair HP DL Gen9 Servers utilize the Haswell architecture. But i wouldn´t do that, it would not improve anything perceptible when running LLMs strictly on the Graphics cards - which is -in general- the way to go. The only thing that improved was the possibility ro run more memory. An single E5-2670 has the ability to handle "only" 384GB, whereas the v2 can address up to 768 GB. Times two = 1.576 TB. But you wont do that with LLMs and not on DDR3 Memory. Im pretty sure that models of that size would generate one token or less per hour. If i am running the DS-Coder-v2.5:32B, i'm getting along with about 1.14 Tokens per second with four cards, leaving the fifth one untouched (as said because of the adapter which slows everything down). I would expect about 2 - 2,5 Tokens per Second with a dual K80 Setup. BUT keep in mind that LLMS generally slow down as longer the discussion is keeping on and so more tokens are generated in total. Thats because the LLM has to keep track of the contents of the previous discussion contents, in order to keep the new generated answers in context. That would not happen immediately, but strikes fairly hard when using larger models or a bigger context size. Edited 2025-03-29 16:24 by stef123 |
||||
lizby Guru ![]() Joined: 17/05/2016 Location: United StatesPosts: 3299 |
Ok, thank you very much again for the details you go into. I think I know enough now to realize that I need to investigate further before spending much cash on hardware. I thank you for having provided a good idea of what budget hardware might look like. (Can you provide a photograph of your setup? I don't care how hacked-looing it might be.) I also think I know enough now to be able to state what my goal is (which may not yet be achievable at an acceptable price point): to have (or have access to) an AI coder which would match my own maximum productivity level from about 40 years ago--50 or so lines of C code per day. With a price under $10 a day, and preferably less than $3 per day. So I think I should have a play on DeepSeek’s U.S. host, Together.ai (or similar), before I buy any hardware, just to see what the scope of the achievable is at this time. The model, I guess, would be Qwen 2.5 Coder 32B at $.80 per million tokens (Comments I've read about it range from "It's horrible" to "I tried it and it works".) Lots to learn to even get started. ~ Edited 2025-03-30 00:52 by lizby PicoMite, Armmite F4, SensorKits, MMBasic Hardware, Games, etc. on fruitoftheshed |
||||
matherp Guru ![]() Joined: 11/12/2012 Location: United KingdomPosts: 10000 |
stef123: How do you measure tokens per second. I'm using Ollama on W11. Thanks |
||||
stef123 Regular Member ![]() Joined: 25/09/2024 Location: United KingdomPosts: 83 |
Just add --verbose on the command line. For example "ollama run model:xxb --verbose |
||||
Page 1 of 3 ![]() ![]() |
![]() |
![]() |
The Back Shed's forum code is written, and hosted, in Australia. | © JAQ Software 2025 |