ArabicNLP 2026 Shared Task

ArA-DF 2026

Arabic speech deepfake detection: build systems that distinguish bonafide human speech from spoofed Arabic audio across dialect and robustness conditions.

Task
Binary audio classification
Metric
Equal Error Rate
Venue
ArabicNLP 2026 @ EMNLP

Participant resources are live.

  • Dataset live
  • Baseline live
  • Development phase
  • Codabench open

Overview

A focused benchmark for Arabic audio authenticity.

ArA-DF 2026 is listed by ArabicNLP 2026 as Shared Task 14. Participating systems assign each Arabic speech sample to one of two labels: bonafide, for genuine human speech, or spoofed, for synthetic speech produced by text-to-speech or voice-conversion systems.

The task covers Modern Standard Arabic and Arabic dialects, with evaluation designed to reward detectors that generalize beyond narrow speakers, synthesis systems, and recording conditions.

Data & Baseline

Live dataset and baseline resources.

The ArA-DF 2026 dataset and baseline are hosted by ArabicSpeech on Hugging Face. Full loading commands and repository instructions remain on the Hugging Face cards.

Dataset on Hugging Face

Labeled train/dev data plus unlabeled development-test and final-test configs.

Baseline on Hugging Face

wav2vec2 XLS-R 300M frontend with an AASIST backend and helper scripts.

Codabench leaderboards

Competition pages for the two released tracks; leaderboards and submission phases are maintained there.

ArabicNLP shared-tasks page

Conference-wide dates, contact listing, and paper-submission updates.

Dataset snapshot

Labels use 0 for bona fide speech and 1 for spoofed speech. Released audio is 16 kHz mono PCM, packaged as lossless FLAC in WebDataset TAR shards, with metadata available through Hugging Face Parquet configs.

Dataset split rows and label availability
Split Rows Labels
Train22,500Yes
Dev21,000Yes
Track 1 development-test16,023No
Track 1 final-test144,210No
Track 2 development-test14,193No
Track 2 final-test127,746No

Baseline snapshot

The baseline repository includes a trained checkpoint, XLS-R frontend weights, configuration, parquet-generation utility, and published result file. Lower EER is better.

Baseline model-card EER results
Baseline split EER (%) Utterances
Track 1 development-test14.6716,023
Track 1 final-test14.54144,210
Track 2 development-test27.2114,193
Track 2 final-test27.11127,746

Values are baseline model-card results, not participant rankings.

Submission

Phases and submission limits.

Each track has development-test sets for experimentation and blind final-test sets for official evaluation.

Experimentation

Development phase

Build systems with the released train and dev data, then submit predictions on the development-test sets through Codabench.

  • Up to 100 submissions per team.
  • Maximum 10 submissions per day.
  • Best submission appears on the public leaderboard.
  • No final commitment required; this phase is for experimentation.
Official scoring

Evaluation phase

Submit predictions on the blind final-test sets. These submissions determine the official challenge results.

  • Up to 3 system submissions per team.
  • Best EER across the 3 submissions is the official score.
  • Submit separately for Track 1, Track 2, or both.
  • System description papers remain required for final ranking eligibility.

Important Dates

Task timeline.

ArA-DF 2026 milestones for data release, submissions, official scoring, and system-description papers.

Resources released

Training, dev and open test data, evaluation scripts, and baseline.

Development phase

Experimentation and submissions on the development-test sets.

Evaluation phase

Submissions on the final-test sets and official scoring.

Leaderboard freeze

Final leaderboard state is frozen for official results.

System papers due

System description paper submissions due.

Acceptance notification

Notification of acceptance for system-description papers.

Final results released

Official final results are released to participants.

Camera-ready due

Camera-ready versions of accepted system papers due.

Complete team registration before requesting Codabench access. Paper-submission instructions will be shared with registered teams when available.

Evaluation

Ranked by Equal Error Rate.

Equal Error Rate is the primary metric because it is threshold-independent and widely used in audio anti-spoofing evaluation. Lower EER is better; official scores come from the blind final-test submissions.

EER Primary ranking metric
Accuracy Secondary leaderboard metric
Macro-F1 Secondary leaderboard metric

Contact & Forum

Where to ask questions and report issues.

Use the public Google Group for task questions, discussions, announcements, and clarifications. Route private, platform, and repository issues to the appropriate channel below.

Registration

Team registration form

Register your team first, then request access on the Track 1 and/or Track 2 Codabench pages using the same team name and contact email.

Public forum

Questions, discussions, and announcements

Use the ArA-DF 2026 Google Group for participant-wide questions, official clarifications, and shared announcements.

Private issues

Team-specific or sensitive support

Email the organizers for private issues such as team changes, registration mistakes, or sensitive access problems.

Submission platform

Codabench submissions and leaderboards

Use the corresponding Codabench competition page for track-specific submission, leaderboard, and platform issues.

Dataset and baseline

Hugging Face repository issues

Use the Hugging Face dataset and baseline repositories for data, loading, baseline, and model-card questions.

Organizers

Organizing team.

For public task questions, discussions, and announcements, use the ArA-DF 2026 Google Group: ara-df-2026@googlegroups.com.

Vasista Sai Lodagala

HUMAIN, Saudi Arabia

Yassine El Kheir

DFKI, Germany

Sara Althubaiti

HUMAIN, Saudi Arabia

Pedro Moreno Mengibar

HUMAIN, Saudi Arabia

Ahmed Ali

HUMAIN, Saudi Arabia