| --- |
| license: apache-2.0 |
| tags: |
| - ctranslate2 |
| --- |
| # Fast-Inference with Ctranslate2 |
| Speedup inference by 2x-8x using int8 inference in C++ |
|
|
| quantized version of [google/flan-ul2](https://huggingface.co/google/flan-ul2) |
| ```bash |
| pip install hf_hub_ctranslate2>=2.0.6 ctranslate2>=3.13.0 |
| ``` |
|
|
|
|
| Checkpoint compatible to [ctranslate2](https://github.com/OpenNMT/CTranslate2) and [hf-hub-ctranslate2](https://github.com/michaelfeil/hf-hub-ctranslate2) |
| - `compute_type=int8_float16` for `device="cuda"` |
| - `compute_type=int8` for `device="cpu"` |
|
|
| ```python |
| from hf_hub_ctranslate2 import TranslatorCT2fromHfHub, GeneratorCT2fromHfHub |
| |
| model_name = "michaelfeil/ct2fast-flan-ul2" |
| model = TranslatorCT2fromHfHub( |
| # load in int8 on CUDA |
| model_name_or_path=model_name, |
| device="cuda", |
| compute_type="int8_float16" |
| ) |
| outputs = model.generate( |
| text=["How do you call a fast Flan-ingo?", "Translate to german: How are you doing?"], |
| min_decoding_length=24, |
| max_decoding_length=32, |
| max_input_length=512, |
| beam_size=5 |
| ) |
| print(outputs) |
| ``` |
|
|
| # Licence and other remarks: |
| This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo. |