---
library_name: peft
tags:
  - contrastive-learning
  - identity-linking
  - cross-cloud
  - security
  - embeddings
  - flan-t5
  - lora
base_model: google/flan-t5-base
license: mit
pipeline_tag: feature-extraction
---

# Stage 4 — Cross-Cloud Identity Embeddings

Part of the **Trinetra** multi-cloud threat detection pipeline (Group 24).
Maps logically equivalent cloud identities from AWS, Azure, and GCP to
nearby points in a shared 128-dimensional embedding space using
contrastive learning.

## Problem Solved

The same person has different account identifiers across cloud providers:
- AWS: `user_alice`
- Azure: `user_alice_az`  
- GCP: `user_alice_gcp`

Without identity linking, a cross-cloud pivot attack — where stolen AWS
credentials are reused on GCP — appears as two completely unrelated events
to the downstream graph neural network. This model maps all three to nearby
points so Stage 5's graph can connect them and Stage 6's RGCN can detect
the pivot.

## Architecture

| Component | Detail |
|-----------|--------|
| Base model | `google/flan-t5-base` (encoder only) |
| Fine-tuning | LoRA rank=16, alpha=32, target=q,k,v,o |
| Projection head | Linear(512→256→128) + ReLU + LayerNorm |
| Output dim | 128 (z_identity) |
| Training loss | Contrastive with margin=0.3 |
| Positive pairs | Same person, different cloud provider |
| Negative pairs | Different persons, any provider |
| Epochs | 50 × 40 steps |
| Hardware | Kaggle T4 x1 |

## Files

| File | Description |
|------|-------------|
| `adapter/` | LoRA adapter weights (flan-t5-base encoder) |
| `proj_head.pt` | Projection head weights (512→128) |
| `config.json` | Training configuration |
| `z_identity.parquet` | Pre-computed embeddings for all 33 pipeline identities |

## Quick Start
```python
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel
from peft import PeftModel
from huggingface_hub import hf_hub_download

REPO = "sohomn/stage4-identity-embeddings"
BASE = "google/flan-t5-base"

tokenizer = AutoTokenizer.from_pretrained(REPO)
base      = AutoModel.from_pretrained(BASE)
encoder   = PeftModel.from_pretrained(base, REPO + "/adapter")
encoder.eval()

proj = nn.Sequential(
    nn.Linear(512, 256), nn.ReLU(), nn.LayerNorm(256), nn.Linear(256, 128)
)
proj.load_state_dict(torch.load(
    hf_hub_download(REPO, "proj_head.pt"), map_location="cpu"
))
proj.eval()

def embed(identity: str, provider: str) -> list:
    text   = f"identity: {identity} provider: {provider}"
    inputs = tokenizer(text, return_tensors="pt", max_length=32,
                       truncation=True, padding=True)
    with torch.no_grad():
        out  = encoder.encoder(**inputs)
        mask = inputs["attention_mask"].unsqueeze(-1).float()
        emb  = (out.last_hidden_state * mask).sum(1) / mask.sum(1)
        z    = proj(emb)
        z    = F.normalize(z, dim=-1)
    return z[0].tolist()

print(embed("user_alice", "AWS"))      # 128-dim vector
print(embed("user_alice_az", "Azure")) # should be close to above
```

## Output Contract

- Shape: `(128,)` per identity, L2-normalised
- Equivalent identities: cosine similarity > 0.8
- Non-equivalent identities: cosine similarity < 0.3
- Consumed by: Stage 5 graph construction (last 128 dims of 514-dim node vector)

## Training Details

| Setting | Value |
|---------|-------|
| Identity registry | 11 persons × 3 providers = 33 identities |
| Positive pairs | 33 (same person, different cloud) |
| Negative pairs | sampled each batch |
| Effective batch | 32 (16 pos + 16 neg) |
| Best loss | see config.json |
| Seed | 42 |