Inference Gates

OpenAI-compatible HTTP APIs for AI model inference, billed against your rb credits. No deployment, no infrastructure — create a key, point the OpenAI SDK at razorbridge.eu, and start making requests.

What are Inference Gates?

Inference Gates give you instant access to AI models through a standard API. They work with any tool that supports the OpenAI API format — Python SDKs, curl, LangChain, or your own code.

  • OpenAI-compatible chat completions endpoint
  • Billed per-token against your rb credit balance
  • No infrastructure to manage — just an API key
  • Same account and credits as GPU Blades

Create an API key

Generate a key from the CLI:

$ rb gate keys create "Lab notebook"
✓ API key created

  Key: rb-gate-a1b2c3d4e5f6g7h8
  Name: Lab notebook

  ⚠ Store this key securely — it will not be shown again.

API keys are shown only once at creation. Store the key in a secure location (e.g., environment variable or secrets manager).

Use with OpenAI SDK

Point the OpenAI Python SDK at the razorBridge endpoint:

from openai import OpenAI

client = OpenAI(
    base_url="https://razorbridge.eu/api/v1/rb/gate",
    api_key="rb-gate-a1b2c3d4e5f6g7h8"
)

response = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[{"role": "user", "content": "Explain backpropagation in simple terms"}]
)

print(response.choices[0].message.content)

Set your key as an environment variable to avoid hardcoding it:

export RB_GATE_KEY="rb-gate-a1b2c3d4e5f6g7h8"
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://razorbridge.eu/api/v1/rb/gate",
    api_key=os.environ["RB_GATE_KEY"]
)

Use with curl

Test the endpoint directly:

curl https://razorbridge.eu/api/v1/rb/gate/chat/completions \
  -H "Authorization: Bearer rb-gate-a1b2c3d4e5f6g7h8" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3.1-8b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Available models

List available models from the CLI:

$ rb gate models
MODEL               CONTEXT    PRICING (per 1M tokens)
llama-3.1-8b        128k       €0.18 input / €0.18 output
llama-3.1-70b       128k       €0.88 input / €0.88 output
mistral-large       128k       €2.00 input / €6.00 output
qwen-2.5-72b       128k       €0.90 input / €0.90 output

Model availability and pricing may change. Use rb gate models for the current list. All pricing is in EUR credits.

Check usage

See your Inference Gates cost breakdown:

$ rb gate usage
Inference Gates usage (last 30 days):

MODEL             REQUESTS    TOKENS        COST
llama-3.1-8b      142         1.2M          €0.22
llama-3.1-70b     28          340K          €0.30
                                            ──────
Total                                       €0.52

Budget control

Professors and organizers can create API keys with restrictions via the Hub:

  • Model restrictions — limit which models a key can access (e.g., only allow llama-3.1-8b for coursework)
  • Budget caps — set a maximum spend per key (e.g., €5 per student) — coming soon
  • Rate limits — control requests per minute to prevent accidental cost spikes — coming soon

Today, account-level spend caps protect every account from runaway cost, and model restrictions are enforced per key. Per-key budgets and rate tiers are on the roadmap.

Organizers: go to Hub > Inference Gates > Keys to manage API keys for your students.

Roadmap — multi-modal endpoints

Inference Gates today serve OpenAI-compatible chat completions (LLMs). We are extending the gate so you can run audio, image, and video models through the same key, billing, and EU-routed infrastructure — no deployment on your side:

  • Audio — speech-to-text (transcription) and text-to-speech, OpenAI-compatible (/audio/transcriptions, /audio/speech)
  • Images — text-to-image generation (/images/generations)
  • Video — text-to-video and image-to-video generation, billed per second of output
  • Streaming — server-sent-events streaming for chat completions (stream: true)

These are on the roadmap and not yet live. The chat-completions endpoint above is available today. Want early access? Email [email protected].

Billing

Inference Gates costs are deducted from the same credit balance as GPU Blades. One account, one ledger — no separate billing for inference.

  • Costs are calculated per-token based on the model used
  • Charges appear instantly in your transaction history
  • Check your balance anytime with rb credits