1. simonw 12 hours ago
    MLX is worth paying attention to. It's still pretty young (just over a year old) but the amount of activity in that ecosystem is really impressive, and it's quickly becoming the best way to run LLMs (and vision LLMs and increasingly audio models) on a Mac.

    Here's a fun way to start interacting with it (this loads and runs Llama 3.2 3B in a terminal chat UI):

      uv run --isolated --with mlx-lm python -m mlx_lm.chat
    1. masto 8 hours ago
      Ran it and it crapped out with a huge backtrace. I spotted `./build_bundled.sh: line 21: cmake: command not found` in it, so I guessed I needed cmake installed. `brew install cmake` and try again. Then it crapped out with `Compatibility with CMake < 3.5 has been removed from CMake.`. Then I give up.

      This is typical of what happens any time I try to run something written in Python. It may be easier than setting up an NVIDIA GPU, but that's a low bar.

      1. H3X_K1TT3N 7 hours ago
        This is absolutely every experience I have with python.
      2. simonw 8 hours ago
        Which Python version was that? Could be that MLX have binary wheels for some versions but not others.
        1. masto 8 hours ago
          Adding `-p 3.12` made it work. Leaving that here in case it helps someone.
          1. porridgeraisin 7 hours ago
            Aha, knew you wouldn't give up. Not what our kind do
      3. jack_pp 2 hours ago
        for the record these problems don't really exist on linux in my experience
    2. mathfailure 6 hours ago
      How much disk & RAM does it need?

      What's your tokens/sec rate (and on which device)?

      1. simonw 5 hours ago
        I've been running it on a 64GB M2. My favorite models to run tend to be about 20GB to download (eg Mistral Small 3.1) and use about 20GB of RAM while they are running.

        I don't have a token/second figure to hand but it's fast enough that I'm not frustrated by it.

    3. _bin_ 6 hours ago
      I wish apple would spend some more time paying attention to metal-jax :) it crashes with a few lines still and seems like an obvious need if apple wants to be serious about enabling ML work on their new MBPs.

      MLX looks really nice from the demo-level playing around with it I've done, but I usually stick to jax so, you know, I can actually deploy it on a server without trying to find someone who racks macs.

      1. dkga 6 hours ago
        So, on an M4 I sometimes get faster training on plain vanilla jax compared to the same model in pytorch or tensorflow. And jax-metal often breaks :/
        1. _bin_ 4 hours ago
          No kidding? Might switch to CPU then. And yeah jax-metal is so utterly unreliable. I ran across an issue it turns out reduces to like a 2 line repro example which has been open on github for the better part of a year without updates
  2. fsiefken 12 hours ago
    That's great, like the ai ryzen max 395, apple silicon chips are also more energy efficient for llm (or gaming) then nvidia.

    For 4 bit deepseek-r1-distill-llama-70b on a Macbook Pro M4 Max with the MLX version on LM Studio: 10.2 tok/sec on power and 4.2 tok/sec on battery / low power

    For 4 bit gemma-3-27b-it-qat I get: 26.37 tok/sec on power and on battery low power 9.7

    It'd be nice to know all the possible power tweaks to get the value higher and get additional insight on how llm's work and interact with the cpu and memory.

    1. vlovich123 18 minutes ago
      How does mlx compare with the llama.cpp backend for LM Studio?
    2. nico 11 hours ago
      Thank you for the numbers

      What have you used those models for, and how would you rate them in those tasks?

      1. realo 9 hours ago
        RPG prompts works very very well with many of the models, but not the reasoning ones because it ends up thinking endlessly about how to be the absolute best game master possible...
        1. nico 9 hours ago
          Great use case. And very funny situation with the reasoning models! :)
    3. bigyabai 5 hours ago
      > apple silicon chips are also more energy efficient for llm (or gaming) then nvidia.

      Which benchmarks are you working off of, exactly? Unless your memory is bottlenecked, neither raster or compute workloads on M4 are more energy efficient than Nvidia's 50-series silicon: https://browser.geekbench.com/opencl-benchmarks

  3. pj_mukh 13 hours ago
    Super cool, and will definitely check it out.

    But as a measure for what you can achieve with a course like this: does anyone know what the max tok/s vs iPhone model plot look like, and how does MLX change that plot?

  4. robbru 11 hours ago
    TinyLLM is very cool to see! I will def tinker with it. I've been using MLX format for local LLMs as of late. Kinda amazing to see these models become cheaper and faster. Check out the MLX community on HuggingFace. https://huggingface.co/mlx-community
    1. nico 11 hours ago
      Great recommendation about the community

      Any other resources like that you could share?

      Also, what kind of models do you run with mlx and what do you use them for?

      Lately I’ve been pretty happy with gemma3:12b for a wide range of things (generating stories, some light coding, image recognition). Sometimes I’ve been surprised by qwen2.5-coder:32b. And I’m really impressed by the speed and versatility, at such tiny size, of qwen2.5:0.5b (playing with fine tuning it to see if I can get it to generate some decent conversations roleplaying as a character)

      1. simonw 11 hours ago
        I've shared a bunch of notes on MLX over the past year, many of them with snippets of code I've used to try out models: https://simonwillison.net/tags/mlx/

        I mainly use MLX for LLMs (with https://github.com/ml-explore/mlx-lm and my own https://github.com/simonw/llm-mlx which wraps that), vision LLMs (via https://github.com/Blaizzy/mlx-vlm) and running Whisper (https://github.com/ml-explore/mlx-examples/tree/main/whisper)

        I haven't tried mlx-audio yet (which can synthesize speech) but it looks interesting too: https://github.com/Blaizzy/mlx-audio

        The two best people to follow for MLX stuff are Apple's Awni Hannun - https://twitter.com/awnihannun and https://github.com/awni - and community member Prince Canuma who's responsible for both mlx-vlm and mlx-audio: https://twitter.com/Prince_Canuma and https://github.com/Blaizzy

        1. nico 10 hours ago
          Amazing. Thank you for the great resources!
  5. gitroom 10 hours ago
    dang, i've been messing with mlx too and its blowing my mind how quick this stuff is getting on macs. feels like somethings changing every time i blink