SpectraFormer dataset
Description
This is the dataset used to train SpectraFormer - a transformer-based Machine Learning model aimed for Raman spectra unmixing for graphene buffer layer on SiC substrate.
See more: arXiv paper, GitHub repo.
Each datafile contain the following coordinates: wave_number (Raman shift values, $cm^{^(-1)}$), and spatial coordinates X_0, X_1, and optionally for depth maps also X_2.
Dataset Structure
The dataset is organized into three sample categories:
4H-SiC-Piranha/- 4H SiC polytype6H_spectra_20250423/- 6H SiC polytype subdivided by acquisition parameters (e.g.,10s_1p/,5s_10p/,5s_5p/)main/- Primary sample set with standard configurations
File Naming Convention
Each file follows this naming pattern: {system_type}_{spatial_dims}_{original_name}.nc
Filename components:
spatial_dims- spatial map dimensions (e.g.,15x15= 15×15 spatial points)- Acquisition parameters in original name:
Xs(e.g.,10s,5s) - acquisition time in secondsXp(e.g.,1p,5p) - laser power percentage (1%, 5%, 10%)Xacc(e.g.,1acc,2acc) - number of accumulations100x- integration factor (e.g., 100× objective)
Example: 6H_spectra_20250423_15x15_10s_1p_2.nc = 6H sample, 15×15 spatial points, 10s integration, 1% laser power, 2nd acquisition file
Data Format & Dimensions
Files are stored in NetCDF4 format with the following structure:
Coordinates:
X_0,X_1- spatial coordinatesX_2(optional) - depth coordinate for depth-profiling mapswave_number- Raman shift in cm^(-1)
Data variable:
__xarray_dataarray_variable__- Raman intensity counts at each spatial and spectral point
Each file contains a spatially-resolved Raman spectrum map, allowing analysis of spectral variations across the sample surface.
Data Processing
Raw data from spectroscopy measurements (stored as .txt files with coordinates, wave numbers, and counts) was parsed and converted to NetCDF4 format using a spatial binning approach. This enables efficient multi-dimensional analysis with xarray.
Usage
Download the folder content into data/parsed_data_spatial/SiC-high-f to train your model. Load files using standard tools:
import xarray as xr
# Load a dataset
ds = xr.load_dataarray('6H_spectra_20250423_15x15_5s_5p_1.nc')
# Access coordinates and data
print(ds.dims) # {'X_0': 15, 'X_1': 15, 'wave_number': 1800}
print(ds.wave_number) # Raman shift values in cm^(-1)
print(ds)