How to run this model without GPU?

#5
by luminais - opened

How to run this model without GPU?

Backyard AI lets you use a GUI on OSX & Windows, to download Hugging face Models to run them locally - in a simple chat-app system. It's useful to confirm if a model works on your computer, and to simplify trying a range of different model sizes, to understand the performance and limitations.

There are many other ways, but this is a traditional 'app'.

The most complicated parts are:

  • Install Backyard.AI, ensure it's up-to-date
  • Get login for Backyard .AI (I think this is still optional, but I get one so that I have a way to link my use to the people that make the software, so I can easily tell them if things are broken)
  • You need a hugging face login (you already have that, to be able to ask the question above)
  • You have to select a model that fits into your RAM. (Eg. 8GB RAM might actually be 6 or 7 GB after disk cache & iGPU assignment) - I find a 8B or 7B param 4 bit Q works in 8 GB Ram on PC with iGPU only)
  • You have to copy the URL of the repository
  • You have to paste it into BackyardAI
  • You have to wait for the large file to download (Usually about 4 or 5 GB or similar - this is a MASSIVE file, if you're only using cellular or terrestrial wireless or satellite data, find a wired connection that uses wifi on the last part between the router and your computer, inside the house, if you can, to reduce Radio Smog (RF Smog) or Spectrum congestion, or satellite or WISP contention, or Cellular network load, etc)
  • You then create the model. It's a lot of questions but you can sort of work it out without needing to change too much.
  • You disable as much as you can from starting at boot (all unnecssary apps, and turn off browser Auto starts and background operations and speed boosts) restart, shut down all other apps you couldn't disable
  • You run Backyard
  • You load the model
  • You chat with it

I get about 1-2 tokens a second (a bit less than a word a second, sometimes a bit more with simple words) on 10 to 12 year old hardware in 8 GB of ram.
It's not usable professionally, because the computer is fully occupied while trying to answer, and it takes a very long time to get a few pages out. But more modern computers are completely different. Also, if you can get a TPU or GPU on a high speed bus, such as a PCI bus, it's a potential way to maintain old hardware.

Sign up or log in to comment