Mixtral is a state-of-the-art AI model developed by Mistral AI, utilizing a sparse mixture-of-experts (MoE) architecture.
To get started, follow the instructions at mistral-inference to download the model. Once downloaded, run llama_or_mistral_ckpt.py to convert the checkpoint for MaxText compatibility. You can then proceed with decoding, pretraining, and finetuning. You could find Mixtral 8x7B example in the end_to_end/tpu/mixtral/8x7b test scripts.
Additionally, Mixtral integrates with MegaBlocks, an efficient dropless MoE strategy, which can be activated by setting both sparse_matmul and megablox flags to True (default).
Model Flop utilization for training on v5p TPUs.
Model size | Accelerator type | TFLOP/chip/sec | Model flops utilization (MFU) |
---|---|---|---|
Mixtral 8X7B | v5p-128 | 251.94 | 54.89% |