--- library_name: peft tags: - contrastive-learning - identity-linking - cross-cloud - security - embeddings - flan-t5 - lora base_model: google/flan-t5-base license: mit pipeline_tag: feature-extraction --- # Stage 4 — Cross-Cloud Identity Embeddings Part of the **Trinetra** multi-cloud threat detection pipeline (Group 24). Maps logically equivalent cloud identities from AWS, Azure, and GCP to nearby points in a shared 128-dimensional embedding space using contrastive learning. ## Problem Solved The same person has different account identifiers across cloud providers: - AWS: `user_alice` - Azure: `user_alice_az` - GCP: `user_alice_gcp` Without identity linking, a cross-cloud pivot attack — where stolen AWS credentials are reused on GCP — appears as two completely unrelated events to the downstream graph neural network. This model maps all three to nearby points so Stage 5's graph can connect them and Stage 6's RGCN can detect the pivot. ## Architecture | Component | Detail | |-----------|--------| | Base model | `google/flan-t5-base` (encoder only) | | Fine-tuning | LoRA rank=16, alpha=32, target=q,k,v,o | | Projection head | Linear(512→256→128) + ReLU + LayerNorm | | Output dim | 128 (z_identity) | | Training loss | Contrastive with margin=0.3 | | Positive pairs | Same person, different cloud provider | | Negative pairs | Different persons, any provider | | Epochs | 50 × 40 steps | | Hardware | Kaggle T4 x1 | ## Files | File | Description | |------|-------------| | `adapter/` | LoRA adapter weights (flan-t5-base encoder) | | `proj_head.pt` | Projection head weights (512→128) | | `config.json` | Training configuration | | `z_identity.parquet` | Pre-computed embeddings for all 33 pipeline identities | ## Quick Start ```python import torch import torch.nn as nn import torch.nn.functional as F from transformers import AutoTokenizer, AutoModel from peft import PeftModel from huggingface_hub import hf_hub_download REPO = "sohomn/stage4-identity-embeddings" BASE = "google/flan-t5-base" tokenizer = AutoTokenizer.from_pretrained(REPO) base = AutoModel.from_pretrained(BASE) encoder = PeftModel.from_pretrained(base, REPO + "/adapter") encoder.eval() proj = nn.Sequential( nn.Linear(512, 256), nn.ReLU(), nn.LayerNorm(256), nn.Linear(256, 128) ) proj.load_state_dict(torch.load( hf_hub_download(REPO, "proj_head.pt"), map_location="cpu" )) proj.eval() def embed(identity: str, provider: str) -> list: text = f"identity: {identity} provider: {provider}" inputs = tokenizer(text, return_tensors="pt", max_length=32, truncation=True, padding=True) with torch.no_grad(): out = encoder.encoder(**inputs) mask = inputs["attention_mask"].unsqueeze(-1).float() emb = (out.last_hidden_state * mask).sum(1) / mask.sum(1) z = proj(emb) z = F.normalize(z, dim=-1) return z[0].tolist() print(embed("user_alice", "AWS")) # 128-dim vector print(embed("user_alice_az", "Azure")) # should be close to above ``` ## Output Contract - Shape: `(128,)` per identity, L2-normalised - Equivalent identities: cosine similarity > 0.8 - Non-equivalent identities: cosine similarity < 0.3 - Consumed by: Stage 5 graph construction (last 128 dims of 514-dim node vector) ## Training Details | Setting | Value | |---------|-------| | Identity registry | 11 persons × 3 providers = 33 identities | | Positive pairs | 33 (same person, different cloud) | | Negative pairs | sampled each batch | | Effective batch | 32 (16 pos + 16 neg) | | Best loss | see config.json | | Seed | 42 |