UAV-Self-Positioning-23M-ZCN

Overview of the DenseUAV vision-based UAV self-positioning framework, showing the image feature extraction pipeline used to match UAV-view images with satellite-view references for localization in low-altitude urban environments.

Task: Image feature extraction / UAV self-positioning in low-altitude urban environments
Base model: timm/vit_small_patch16_224.augreg_in1k
Backbone: ViT-S
Library: timm, PyTorch
Dataset: Dmmm997/DenseUAV (DenseUAV)
Training Performance: Weights & Biases
License: Apache-2.0
Internship organization: Institute of Mathematical and Computational Sciences (IMACS), Ho Chi Minh City University of Technology, Vietnam
Supervisor: M.Sc. NGUYEN VAN GIA THINH

Model description

UAV-Self-Positioning-23M-ZCN is an image feature extraction model finetuned from the backbone timm/vit_small_patch16_224.augreg_in1k for vision-based UAV self-positioning in low-altitude urban environments.

This model is trained from the official DenseUAV baseline with a ViT-S backbone, using the code and training pipeline from the repository Dmmm1997/DenseUAV and the DenseUAV dataset (Dmmm997/DenseUAV). All credits for the dataset, baseline architecture, and evaluation protocol belong to the authors of DenseUAV. In this work I:

configure baseline/opts.yaml for this experiment,
train the model using the original baseline training scripts,
and export the checkpoint as a model on Hugging Face.

The training performance logs are reported using Weights & Biases; check them at DenseUAV Non-GPS Training Performance

If you use this model in research or applications, please cite the DenseUAV paper and repository (see Citation below).

Origin and citation (important)

This model is fully based on:

The original DenseUAV repository: Dmmm1997/DenseUAV
The paper: Vision-Based UAV Self-Positioning in Low-Altitude Urban Environments (IEEE Transactions on Image Processing, 2024), arxiv.org/abs/2201.09201

If you use this model, please cite:

@misc{dai2023visionbaseduavselfpositioninglowaltitude,
      title={Vision-Based UAV Self-Positioning in Low-Altitude Urban Environments}, 
      author={Ming Dai and Enhui Zheng and Zhenhua Feng and Jiedong Zhuang and Wankou Yang},
      year={2023},
      eprint={2201.09201},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2201.09201}, 
}

And the repository:

Dmmm1997. DenseUAV: Vision-Based UAV Self-Positioning in Low-Altitude Urban Environments. GitHub, 2023. https://github.com/Dmmm1997/DenseUAV

I am not the author of DenseUAV; I only retrain and publish a checkpoint under the same Apache-2.0 license.

Quick usage

The model is published as an image feature extractor on Hugging Face, similar to the models under the Image Feature Extraction task on huggingface.co/models.

Example usage with timm (PyTorch):

The uploaded checkpoint was saved from a wrapper model, so ViT weights are stored under the prefix backbone.backbone..
The snippet below strips that prefix and loads the backbone successfully (prints torch.Size([1, 384])).

import os

import torch
from PIL import Image
from timm import create_model
from torchvision import transforms

device = "cuda" if torch.cuda.is_available() else "cpu"

# 1. Create ViT backbone
model = create_model(
    "vit_small_patch16_224",
    pretrained=False,
    num_classes=0,  # feature extractor
).to(device)

# 2. Download checkpoint from Hugging Face
state_dict = torch.hub.load_state_dict_from_url(
    "https://huggingface.co/Bancie/UAV-Self-Positioning-23M-ZCN/resolve/main/UAV_SelfPositioning_23M_ZCN.pth",
    map_location=device,
)

# 3. Strip wrapper prefix and load ViT weights
vit_prefix = "backbone.backbone."
vit_state_dict = {k[len(vit_prefix):]: v for k, v in state_dict.items() if k.startswith(vit_prefix)}
model.load_state_dict(vit_state_dict, strict=False)
model.eval()

# 4. Preprocess a local UAV-view image (224x224)
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5)),
])

image_path = "/path/to/your/drone_image.png"
assert os.path.exists(image_path), f"Image not found: {image_path}"

image = Image.open(image_path).convert("RGB")
x = transform(image).unsqueeze(0).to(device)

# 5. Extract feature vector
with torch.no_grad():
    feat = model(x)  # [1, D]
    feat = torch.nn.functional.normalize(feat, dim=-1)

print(feat.shape)  # torch.Size([1, 384])

In the DenseUAV setting, these features are used to:

build embeddings for UAV-view and satellite-view images,
compute distances (e.g., cosine or Euclidean),
and evaluate Recall and SDM as in the original DenseUAV code.

Training data

Dataset: DenseUAV (Dmmm997/DenseUAV)
Data type: UAV-view and satellite-view images captured in low-altitude urban environments
Split: train / query / gallery as described in the original DenseUAV README
Preprocessing: follows the baseline pipeline from the DenseUAV repository (resize, augmentation, normalization)

To fully reproduce training and evaluation, please clone the original repository and follow its instructions.

Training procedure

Code: directly based on the baseline implementation under baseline/ in Dmmm1997/DenseUAV.
Backbone: vit_small_patch16_224.augreg_in1k from timm.
Config: a customized baseline/opts.yaml for this experiment.
Scripts: trained using the original baseline scripts (for example train.py / train_test_local.sh) as suggested by the DenseUAV authors.

The architecture and loss functions follow the baseline; only minor hyperparameter choices and random seed are adapted to my compute resources.

Intended uses and limitations

Intended uses

Research on UAV self-positioning and cross-view geo-localization.
Backbone / feature extractor for new methods (e.g., variants or extensions built on top of DenseUAV).
Quick experiments on DenseUAV without training from scratch.

Not intended for

Out-of-domain UAV positioning (very different cities, weather conditions, altitudes, or sensors).
Safety-critical applications without proper validation, calibration, and monitoring.

Limitations

Trained on a specific dataset; may not generalize to all UAV scenarios.
Processes single images; temporal information from video sequences is not explicitly modeled.

Evaluation

The evaluation protocol follows the original DenseUAV repository (Recall@K, SDM, etc.).
To re-evaluate this checkpoint on DenseUAV:

Clone the DenseUAV repository and prepare the dataset as described there.
Place this checkpoint under checkpoints/<name>/.
Run:

python test.py --name <name> --test_dir <dataset_root>/test
python evaluate_gpu.py
python evaluateDistance.py --root_dir <dataset_root>

Replace <name> with your checkpoint directory name and <dataset_root> with the path to the DenseUAV dataset. For details, please refer to the original DenseUAV README.

License

Model checkpoint: released under Apache-2.0, the same license as the original DenseUAV project.
Original code and dataset: owned by the DenseUAV authors – see the LICENSE file in Dmmm1997/DenseUAV.

By using this model, you agree to comply with the Apache-2.0 license of both DenseUAV and Hugging Face.

Downloads last month: -

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Bancie/UAV-Self-Positioning-23M-ZCN

Base model

timm/vit_small_patch16_224.augreg_in1k

Finetuned

(1)

this model

Dataset used to train Bancie/UAV-Self-Positioning-23M-ZCN

Paper for Bancie/UAV-Self-Positioning-23M-ZCN

Vision-Based UAV Self-Positioning in Low-Altitude Urban Environments

Paper • 2201.09201 • Published Jan 23, 2022 • 1