Add `text-embeddings-inference` snippet in `README.md`

This PR adds a sample snippet on how to deploy and run inference with Text Embeddings Inference (TEI) via Docker in the `README.md`.

Thanks in advance 🤗

Files changed (1) hide show

README.md +47 -0

README.md CHANGED Viewed

@@ -93,6 +93,53 @@ embed_document2 = outputs[1].outputs.data
 </details>
 <details>
   <summary> via <a href="https://github.com/ggml-org/llama.cpp">llama.cpp (GGUF)</a></summary>
 After installing <a href="https://github.com/ggml-org/llama.cpp">llama.cpp</a> one can run llama-server to host the embedding model as OpenAI API compatible HTTP server with the respective model version:

 </details>
+<details>
+  <summary>via <a href="https://github.com/huggingface/text-embeddings-inference">Text Embeddings Inference</a></summary>
+- Via Docker on CPU:
+  ```bash
+  docker run -p 8080:80 \
+    ghcr.io/huggingface/text-embeddings-inference:cpu-1.9 \
+    --model-id jinaai/jina-embeddings-v5-text-small-classification \
+    --dtype float32 --pooling last-token
+  ```
+- Via Docker on NVIDIA GPU (Turing, Ampere, Ada Lovelace, Hopper or Blackwell):
+  ```bash
+  docker run --gpus all --shm-size 1g -p 8080:80 \
+    ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 \
+    --model-id jinaai/jina-embeddings-v5-text-small-classification \
+    --dtype float16 --pooling last-token
+  ```
+> Alternatively, you can also run with `cargo`, more information can be found in the [Text Embeddings Inference documentation](https://hf.co/docs/text-embeddings-inference).
+Send a request to `/v1/embeddings` to generate embeddings via the [OpenAI Embeddings API](https://platform.openai.com/docs/api-reference/embeddings/create):
+```bash
+curl -X POST http://127.0.0.1:8080/v1/embeddings \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "jinaai/jina-embeddings-v5-text-small-classification",
+    "input": [
+      "Query: Overview of climate change impacts on coastal cities",
+      "Document: The impacts of climate change on coastal cities are significant...",
+    ]
+  }'
+```
+Or rather via the [Text Embeddings Inference API specification](https://huggingface.github.io/text-embeddings-inference/) instead, to prevent from manually formatting the inputs:
+```bash
+curl -X POST http://127.0.0.1:8080/embed \
+  -H "Content-Type: application/json" \
+  -d '{
+    "inputs": "Overview of climate change impacts on coastal cities",
+    "prompt_name": "query",
+  }'
+```
+</details>
 <details>
   <summary> via <a href="https://github.com/ggml-org/llama.cpp">llama.cpp (GGUF)</a></summary>
 After installing <a href="https://github.com/ggml-org/llama.cpp">llama.cpp</a> one can run llama-server to host the embedding model as OpenAI API compatible HTTP server with the respective model version: