Our Secure Future (OSF), an organization dedicated to the advancement of the Women, Peace and Security (WPS) agenda, is leading the development of a WPS-specific Artificial Intelligence (AI) benchmark to reduce the risk of decision-making blind spots in generative AI models.
Generative AI systems are increasingly used in high-stakes environments, from policy drafting to threat analysis and humanitarian coordination. Yet these systems often contain blind spots. Their training data and evaluation frameworks may encode societal biases or underrepresent women's perspectives, leading to skewed outputs and uneven performance across groups. Such gaps can weaken human security, hamper crisis response, and undermine sustainable peace by sidelining half the population.
Evidence shows that gender inclusion strengthens peace: agreements involving women are significantly more likely to endure for at least 15 years. The UN Security Council Resolution 1325 formally recognized women's essential roles in peacebuilding, governance, and community resilience. Yet research in security still tends to overlook the unique experiences and perspectives of men and women, and the datasets that feed new AI tools reflect those omissions. Without addressing them, AI may fall short in promoting effective security outcomes.
Researchers have increasingly turned to bias benchmarks: quantitative task suites designed to test how well AI systems behave across social groups, sensitive attributes, or contexts. These benchmarks vary in scope, but many rely on prompt-based tests intended to reveal strengths and weaknesses of models in specific domains. While useful for isolating certain types of bias, most focus on decontextualized tasks and do not capture the multilingual, intersectional, and conflict-driven realities of peace and security. Some recent safety or risk benchmarks, such as scenario-based safety evaluations or conflict-sensitivity assessments, introduce more contextual realism, but they are not designed to measure gendered or intersectional bias, tend to focus on broad risk categories, and typically rely on single-turn interactions.
A Women, Peace, and Security (WPS) benchmark could address these shortcomings. It would test large language models in realistic contexts: cease-fire negotiations, refugee protection planning, or counter-extremism communications in local languages. Carefully designed scenario-based probes would highlight where systems succeed or fail, guiding developers toward specific improvements. A standardized WPS benchmark could also evaluate mitigation strategies such as improved prompting, fine-tuning with curated datasets, or domain-specific guiding "constitutions." Beyond that, it could serve as a model for similar domain-focused benchmarks in human security from public health to climate resilience, where the stakes are equally high.
In select cases, sharing the results of such a benchmark could encourage companies to integrate human-security considerations into their training pipelines, improving both safety and reliability. In practice, these benchmarks would act as feedback loops, translating WPS needs into measurable targets for better model design and deployment.
The Limits of Existing Bias Benchmarks
Most current bias benchmarks like StereoSet, CrowS-Pairs, and GenderBench focus on general language patterns rather than real-world complexity. They test how a model completes a simple sentence about a profession or activity, but cannot capture the dynamics of a peace negotiation or a humanitarian assessment. Without contextual depth, these tools miss how large language models behave in high-stress or culturally sensitive environments.
They also overlook intersectionality and linguistic diversity. Bias often intensifies when identities overlap precisely where conflict tends to escalate. Many benchmarks examine only English or a handful of major languages, ignoring local, ethnic, and gendered nuances. A model's behavior can shift dramatically depending on context, meaning these evaluations often miss crucial weaknesses.
Cognitive bias in conflict scenarios is another gap. Models may replicate human tendencies like status-quo bias or confirmation bias, but existing tools rarely test how these manifest in WPS settings. Would a model recognize risks faced by peacekeepers or prioritize only concerns voiced by dominant groups? Without rich, context-based evaluation, decision-makers may unknowingly rely on flawed tools in national security or disaster response.
Finally, most audits remain academic exercises. They reveal bias but rarely shape procurement or development pipelines. There are no shared standards requiring WPS-specific evaluation, leaving funders and governments without clear criteria to enforce accountability. As a result, awareness of bias does not always lead to change.
These shortcomings don't make current benchmarks useless, but they underline the need for complementary ones that target where risks and underrepresentation matter most.
How a WPS Benchmark Can Drive Change
A WPS benchmark could create practical shifts across the AI ecosystem. Testing models with transparent, scenario-based probes ranging from multilingual vignettes to counterfactual peace dialogues would give developers a clear map of where models fall short. Much as benchmarks like GLUE or SQuAD became credibility markers for general performance, a WPS suite could do the same for fairness and reliability in sensitive domains.
It would also empower donors and regulators. International funders already require ethical or technical compliance from grantees. A WPS benchmark could serve as a measurable standard, helping them ensure grantees meet inclusion and fairness thresholds. Regulators could reference it as a domain-specific fairness tool, linking ethics directly to accountability.
Academia and civil society would benefit too. Benchmarks foster collaboration among universities and think tanks, encouraging innovation in pre-training datasets, fine-tuning, and reinforcement learning, giving them a target that can be used to measure improvement. Collaborative leaderboards could track performance, spurring progress through transparency. Meanwhile, women's rights organizations and peacebuilders could use benchmark results to interrogate commercial AI tools and push for systems that reflect lived realities in conflict zones.
Most importantly, a WPS benchmark would create a continuous monitoring framework. Each model update could trigger reassessment: did tuning reduce bias in negotiation prompts? Did multilingual improvements enhance the representation of diverse voices? This iterative process ensures WPS principles remain central to AI development.
Why It Hasn't Happened Yet
Until recently, civil-society technology efforts focused mainly on online harms and content moderation, areas with higher visibility and funding. Meanwhile, budgets for gender-focused AI research have dwindled, creating a gap between global policy commitments and technical implementation. Although these considerations are increasingly integrated into security and humanitarian policy, it remains rare at the design and evaluation stages of AI systems.
A WPS benchmark will not erase bias overnight. But it can bridge the gap between policy and practice, embedding gender-aware evaluation into AI pipelines. Through multilingual, intersectional case studies, it can reveal harms that disproportionately affect women and guide concrete mitigation strategies. Funders, developers, governments, and civil-society actors can then demand higher standards and deploy AI tools more wisely.
Over time, this feedback loop could reshape how AI systems support peacebuilding, helping them amplify women's leadership. The ripple effects would extend beyond WPS to fields like healthcare, climate response, and economic justice. By improving how AI systems engage with human security, we increase the chances that emerging technologies contribute to safer, more inclusive, and more enduring peace.
The WPS AI Benchmark Project is run by Our Secure Future with the intention of designing with and for the WPS Community.
ⓒ 2025 TECHTIMES.com All rights reserved. Do not reproduce without permission.





