Publications | Yuxia Wang

2025

CLEF2025

Overview of PAN 2025: Generative AI Detection, Multilingual Text Detoxification, Multi-Author Writing Style Analysis, and Generative Plagiarism Detection

Janek Bevendorff, Daryna Dementieva, Maik Fröbe, and 8 more authors

In ECIR, 2025
Loki

Loki: An Open-Source Tool for Fact Verification

Haonan Li, Xudong Han, Hao Wang, and 7 more authors

In Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations, Jan 2025

Abs

We introduce Loki, an open-source tool designed to address the growing problem of misinformation. Loki adopts a human-centered approach, striking a balance between the quality of fact-checking and the cost of human involvement. It decomposes the fact-checking task into a five-step pipeline: breaking down long texts into individual claims, assessing their check-worthiness, generating queries, retrieving evidence, and verifying the claims. Instead of fully automating the claim verification process, provides essential information at each step to assist human judgment, especially for general users such as journalists and content moderators. Moreover, it has been optimized for latency, robustness, and cost efficiency at a commercially usable level. Loki is released under an MIT license and is available on GitHub. We also provide a video presenting the system and its capabilities.
Libra-leaderboard

Libra-Leaderboard: Towards Responsible AI through a Balanced Leaderboard of Safety and Capability

Haonan Li, Xudong Han, Zenan Zhai, and 32 more authors

In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations), Apr 2025

Abs DOI

As large language models (LLMs) continue to evolve, leaderboards play a significant role in steering their development. Existing leaderboards often prioritize model capabilities while overlooking safety concerns, leaving a significant gap in responsible AI development. To address this gap, we introduce Libra-Leaderboard, a comprehensive framework designed to rank LLMs through a balanced evaluation of performance and safety. Combining a dynamic leaderboard with an interactive LLM arena, Libra-Leaderboard encourages the joint optimization of capability and safety. Unlike traditional approaches that average performance and safety metrics, Libra-Leaderboard uses a distance-to-optimal-score method to calculate the overall rankings. This approach incentivizes models to achieve a balance rather than excelling in one dimension at the expense of some other ones. In the first release, Libra-Leaderboard evaluates 26 mainstream LLMs from 14 leading organizations, identifying critical safety challenges even in state-of-the-art models.
FIRE

FIRE: Fact-checking with Iterative Retrieval and Verification

Zhuohan Xie, Rui Xing, Yuxia Wang, and 5 more authors

In Findings of the Association for Computational Linguistics: NAACL 2025, Apr 2025

Abs DOI

Fact-checking long-form text is challenging, and it is therefore common practice to break it down into multiple atomic claims. The typical approach to fact-checking these atomic claims involves retrieving a fixed number of pieces of evidence, followed by a verification step. However, this method is usually not cost-effective, as it underutilizes the verification model’s internal knowledge of the claim and fails to replicate the iterative reasoning process in human search strategies. To address these limitations, we propose FIRE, a novel agent-based framework that integrates evidence retrieval and claim verification in an iterative manner. Specifically, FIRE employs a unified mechanism to decide whether to provide a final answer or generate a subsequent search query, based on its confidence in the current judgment. We compare FIRE with other strong fact-checking frameworks and find that it achieves slightly better performance while reducing large language model (LLM) costs by an average of 7.6 times and search costs by 16.5 times. These results indicate that FIRE holds promise for application in large-scale fact-checking operations.
ADNA

Arabic Dataset for LLM Safeguard Evaluation

Yasser Ashraf, Yuxia Wang, Bin Gu, and 2 more authors

In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Apr 2025

Abs DOI

The growing use of large language models (LLMs) has raised concerns regarding their safety. While many studies have focused on English, the safety of LLMs in Arabic, with its linguistic and cultural complexities, remains under-explored. Here, we aim to bridge this gap. In particular, we present an Arab-region-specific safety evaluation dataset consisting of 5,799 questions, including direct attacks, indirect attacks, and harmless requests with sensitive words, adapted to reflect the socio-cultural context of the Arab world. To uncover the impact of different stances in handling sensitive and controversial topics, we propose a dual-perspective evaluation framework. It assesses the LLM responses from both governmental and opposition viewpoints. Experiments over five leading Arabic-centric and multilingual LLMs reveal substantial disparities in their safety performance. This reinforces the need for culturally specific datasets to ensure the responsible deployment of LLMs.
Jailbreak survey

Against The Achilles’ Heel: A Survey on Red Teaming for Generative Models

Lizhi Lin, Honglin Mu, Zenan Zhai, and 8 more authors

Journal of Artificial Intelligence Research, Apr 2025
GenAI-MGT

GenAI Content Detection Task 1: English and Multilingual Machine-Generated Text Detection: AI vs. Human

Yuxia Wang, Artem Shelmanov, Jonibek Mansurov, and 23 more authors

In Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect), Jan 2025

Abs

We present the GenAI Content Detection Task 1 – a shared task on binary machine generated text detection, conducted as a part of the GenAI workshop at COLING 2025. The task consists of two subtasks: Monolingual (English) and Multilingual. The shared task attracted many participants: 36 teams made official submissions to the Monolingual subtask during the test phase and 27 teams – to the Multilingual. We provide a comprehensive overview of the data, a summary of the results – including system rankings and performance scores – detailed descriptions of the participating systems, and an in-depth analysis of submissions.
OpenFactCheck

OpenFactCheck: Building, Benchmarking Customized Fact-Checking Systems and Evaluating the Factuality of Claims and LLMs

Yuxia Wang, Minghan Wang, Hasan Iqbal, and 4 more authors

In Proceedings of the 31st International Conference on Computational Linguistics, Jan 2025

Abs

The increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the fac- tual accuracy of their outputs. Difficulties lie in assessing the factuality of free-form responses in open domains. Also, different pa- pers use disparate evaluation benchmarks and measurements, which renders them hard to compare and hampers future progress. To mitigate these issues, we propose OpenFactCheck, a unified framework for building customized automatic fact-checking systems, benchmarking their accuracy, evaluating factuality of LLMs, and verifying claims in a document. OpenFactCheck consists of three modules: (i) CUSTCHECKER allows users to easily customize an automatic fact-checker and verify the factual correctness of documents and claims, (ii) LLMEVAL, a unified evaluation framework assesses LLM’s factuality ability from various perspectives fairly, and (iii) CHECKEREVAL is an extensible solution for gauging the reliability of automatic fact-checkers’ verification results using human-annotated datasets. Data and code are publicly available at https: //github.com/yuxiaw/openfactcheck.
HumanEval MGT

Is Human-Like Text Liked by Humans? Multilingual Human Detection and Preference Against AI

Yuxia Wang, Rui Xing, Jonibek Mansurov, and 8 more authors

arXiv preprint arXiv:2502.11614, Jan 2025
Kazakh Safety

Qorgau: Evaluating LLM Safety in Kazakh-Russian Bilingual Contexts

Maiya Goloburda^*, Nurkhan Laiyk^*, Diana Turmakhan^*, and 8 more authors

ACL 2025 (Findings), Jan 2025
Kazakh SFT

Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh

Nurkhan Laiyk, Daniil Orel, Rituraj Joshi, and 4 more authors

ACL 2025, Jan 2025
KazMMLU

KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan

Mukhammed Togmanov, Nurdaulet Mukhituly, Diana Turmakhan, and 8 more authors

ACL 2025, Jan 2025
KazLLM

Llama-3.1-Sherkala-8B-Chat: An Open Large Language Model for Kazakh

Fajri Koto, Rituraj Joshi, Nurdaulet Mukhituly, and 8 more authors

arXiv preprint arXiv:2503.01493, Jan 2025
Unlearning survey

A comprehensive survey of machine unlearning techniques for large language models

Jiahui Geng, Qing Li, Herbert Woisetschlaeger, and 6 more authors

arXiv preprint arXiv:2503.01854, Jan 2025
SpeechDialogue

SpeechDialogueFactory: Generating High-Quality Speech Dialogue Data to Accelerate Your Speech-LLM Development

Minghan Wang, Ye Bai, Yuxia Wang, and 3 more authors

Interspeech 2025, Jan 2025
FAID

FAID: Fine-grained AI-generated Text Detection using Multi-task Auxiliary and Multi-level Contrastive Learning

Minh Ngoc Ta, Dong Cao Van, Duc-Anh Hoang, and 6 more authors

arXiv preprint arXiv:2505.14271, Jan 2025
UrduFactCheck

UrduFactCheck: An Agentic Fact-Checking Framework for Urdu with Evidence Boosting and Benchmarking

Sarfraz Ahmad, Hasan Iqbal, Momina Ahsan, and 6 more authors

arXiv preprint arXiv:2505.15063, Jan 2025
VSCBench

VSCBench: Bridging the Gap in Vision-Language Model Safety Calibration

Jiahui Geng, Qing Li, Zongxiong Chen, and 7 more authors

ACL 2025 (Findings), Jan 2025
FinChain

FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

Zhuohan Xie, Dhruv Sahnan, Debopriyo Banerjee, and 14 more authors

Jan 2025
HD-NDEs

HD-NDEs: Neural Differential Equations for Hallucination Detection in LLMs

Qing Li, Jiahui Geng, Zongxiong Chen, and 5 more authors

ACL 2025, Jan 2025

2024

M4

M4: Multi-generator, Multi-domain, and Multi-lingual Black-Box Machine-Generated Text Detection

Yuxia Wang, Jonibek Mansurov, Petar Ivanov, and 12 more authors

In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Mar 2024
M4GT

M4GT-Bench: Evaluation Benchmark for Black-Box Machine-Generated Text Detection

Yuxia Wang, Jonibek Mansurov, Petar Ivanov, and 11 more authors

In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Aug 2024

DOI
UrduNewsDetection

Detection of Human and Machine-Authored Fake News in Urdu

Muhammad Zain Ali, Yuxia Wang, Bernhard Pfahringer, and 1 more author

ACL 2025, Aug 2024
RethinkSTS

Rethinking STS and NLI in Large Language Models

Yuxia Wang, Minghan Wang, and Preslav Nakov

In Findings of the Association for Computational Linguistics: EACL 2024, Mar 2024
OpenFactCheck

OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs

Hasan Iqbal^*, Yuxia Wang^*, Minghan Wang, and 4 more authors

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Nov 2024

DOI
Factcheck-Bench

Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers

Yuxia Wang, Revanth Gangi Reddy, Zain Muhammad Mujahid, and 10 more authors

In Findings of the Association for Computational Linguistics: EMNLP 2024, Nov 2024

DOI
ASR

Exploring the Potential of Multimodal LLM with Knowledge-Intensive Multimodal ASR

Minghan Wang, Yuxia Wang, Thuy-Trang Vu, and 2 more authors

In Findings of the Association for Computational Linguistics: EMNLP 2024, Nov 2024

Abs DOI

Recent advancements in multimodal large language models (MLLMs) have made significant progress in integrating information across various modalities, yet real-world applications in educational and scientific domains remain challenging. This paper introduces the Multimodal Scientific ASR (MS-ASR) task, which focuses on transcribing scientific conference videos by leveraging visual information from slides to enhance the accuracy of technical terminologies. Realized that traditional metrics like WER fall short in assessing performance accurately, prompting the proposal of severity-aware WER (SWER) that considers the content type and severity of ASR errors. We propose the Scientific Vision Augmented ASR (SciVASR) framework as a baseline method, enabling MLLMs to improve transcript quality through post-editing. Evaluations of state-of-the-art MLLMs, including GPT-4o, show a 45% improvement over speech-only baselines, highlighting the importance of multimodal information integration.
SFT

Demystifying Instruction Mixing for Fine-tuning Large Language Models

Renxi Wang, Haonan Li, Minghao Wu, and 4 more authors

In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), Aug 2024

Abs DOI

Instruction tuning significantly enhances the performance of large language models (LLMs) across various tasks. However, the procedure to optimizing the mixing of instruction datasets for LLM fine-tuning is still poorly understood. This study categorizes instructions into three primary types: NLP downstream tasks, coding, and general chat. We explore the effects of instruction tuning on different combinations of datasets on LLM performance, and find that certain instruction types are more advantageous for specific applications but can negatively impact other areas. This work provides insights into instruction mixtures, laying the foundations for future research.
Factuality survey

Factuality of Large Language Models: A Survey

Yuxia Wang, Minghan Wang, Muhammad Arslan Manzoor, and 4 more authors

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Nov 2024

Abs DOI

Large language models (LLMs), especially when instruction-tuned for chat, have become part of our daily lives, freeing people from the process of searching, extracting, and integrating information from multiple sources by offering a straightforward answer to a variety of questions in a single place. Unfortunately, in many cases, LLM responses are factually incorrect, which limits their applicability in real-world scenarios. As a result, research on evaluating and improving the factuality of LLMs has attracted a lot of research attention recently. In this survey, we critically analyze existing work with the aim to identify the major challenges and their associated causes, pointing out to potential solutions for improving the factuality of LLMs, and analyzing the obstacles to automated factuality evaluation for open-ended text generation. We further offer an outlook on where future research should go.
SimulMT

Conversational simulmt: Efficient simultaneous translation with large language models

Minghan Wang, Thuy-Trang Vu, Yuxia Wang, and 2 more authors

arXiv preprint arXiv:2402.10552, Nov 2024
Uncertainty srvey

A Survey of Confidence Estimation and Calibration in Large Language Models

Jiahui Geng, Fengyu Cai, Yuxia Wang, and 3 more authors

In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Jun 2024

Abs DOI

Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks in various domains. Despite their impressive performance, they can be unreliable due to factual errors in their generations. Assessing their confidence and calibrating them across different tasks can help mitigate risks and enable LLMs to produce better generations. There has been a lot of recent research aiming to address this, but there has been no comprehensive overview to organize it and to outline the main lessons learned. The present survey aims to bridge this gap. In particular, we outline the challenges and we summarize recent technical advancements for LLM confidence estimation and calibration. We further discuss their applications and suggest promising directions for future work.
CDNA

A Chinese Dataset for Evaluating the Safeguards in Large Language Models

Yuxia Wang, Zenan Zhai, Haonan Li, and 6 more authors

In Findings of the Association for Computational Linguistics: ACL 2024, Aug 2024

Abs DOI

Many studies have demonstrated that large language models (LLMs) can produce harmful responses, exposing users to unexpected risks. Previous studies have proposed comprehensive taxonomies of LLM risks, as well as corresponding prompts that can be used to examine LLM safety. However, the focus has been almost exclusively on English. We aim to broaden LLM safety research by introducing a dataset for the safety evaluation of Chinese LLMs, and extending it to better identify false negative and false positive examples in terms of risky prompt rejections. We further present a set of fine-grained safety assessment criteria for each risk type, facilitating both manual annotation and automatic evaluation in terms of LLM response harmfulness. Our experiments over five LLMs show that region-specific risks are the prevalent risk type. Warning: this paper contains example data that may be offensive, harmful, or biased. Our data is available at https://github.com/Libr-AI/do-not-answer.
DNA

Do-Not-Answer: Evaluating Safeguards in LLMs

Yuxia Wang, Haonan Li, Xudong Han, and 2 more authors

In Findings of the Association for Computational Linguistics: EACL 2024, Mar 2024

Abs

With the rapid evolution of large language models (LLMs), new and hard-to-predict harmful capabilities are emerging. This requires developers to identify potential risks through the evaluation of “dangerous capabilities” in order to responsibly deploy LLMs. Here we aim to facilitate this process. In particular, we collect an open-source dataset to evaluate the safeguards in LLMs, to facilitate the deployment of safer open-source LLMs at a low cost. Our dataset is curated and filtered to consist only of instructions that responsible language models should not follow. We assess the responses of six popular LLMs to these instructions, and we find that simple BERT-style classifiers can achieve results that are comparable to GPT-4 on automatic safety evaluation. Our data and code are available at https://github.com/Libr-AI/do-not-answer
SemEval2024MGT

SemEval-2024 Task 8: Multidomain, Multimodel and Multilingual Machine-Generated Text Detection

Yuxia Wang, Jonibek Mansurov, Petar Ivanov, and 7 more authors

In Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), Jun 2024

Abs DOI

We present the results and the main findings of SemEval-2024 Task 8: Multigenerator, Multidomain, and Multilingual Machine-Generated Text Detection. The task featured three subtasks. Subtask A is a binary classification task determining whether a text is written by a human or generated by a machine. This subtask has two tracks: a monolingual track focused solely on English texts and a multilingual track. Subtask B is to detect the exact source of a text, discerning whether it is written by a human or generated by a specific LLM. Subtask C aims to identify the changing point within a text, at which the authorship transitions from human to machine. The task attracted a large number of participants: subtask A monolingual (126), subtask A multilingual (59), subtask B (70), and subtask C (30). In this paper, we present the task, analyze the results, and discuss the system submissions and the methods they used. For all subtasks, the best systems used LLMs.
Empathy

Can Machines Resonate with Humans? Evaluating the Emotional and Empathic Comprehension of LMs

Muhammad Arslan Manzoor, Yuxia Wang, Minghan Wang, and 1 more author

In Findings of the Association for Computational Linguistics: EMNLP 2024, Nov 2024

Abs DOI

Empathy plays a pivotal role in fostering prosocial behavior, often triggered by the sharing of personal experiences through narratives. However, modeling empathy using NLP approaches remains challenging due to its deep interconnection with human interaction dynamics. Previous approaches, which involve fine-tuning language models (LMs) on human-annotated empathic datasets, have had limited success. In our pursuit of improving empathy understanding in LMs, we propose several strategies, including contrastive learning with masked LMs and supervised fine-tuning with large language models. While these methods show improvements over previous methods, the overall results remain unsatisfactory. To better understand this trend, we performed an analysis which reveals a low agreement among annotators. This lack of consensus hinders training and highlights the subjective nature of the task. We also explore the cultural impact on annotations. To study this, we meticulously collected story pairs in Urdu language and find that subjectivity in interpreting empathy among annotators appears to be independent of cultural background. Our systematic exploration of LMs’ understanding of empathy reveals substantial opportunities for further investigation in both task formulation and modeling.
Llm-detectaive

LLM-DetectAIve: a Tool for Fine-Grained Machine-Generated Text Detection

Mervat Abassy, Kareem Elozeiri, Alexander Aziz, and 21 more authors

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Nov 2024

Abs DOI

The ease of access to large language models (LLMs) has enabled a widespread of machine-generated texts, and now it is often hard to tell whether a piece of text was human-written or machine-generated. This raises concerns about potential misuse, particularly within educational and academic domains. Thus, it is important to develop practical systems that can automate the process. Here, we present one such system, LLM-DetectAIve, designed for fine-grained detection. Unlike most previous work on machine-generated text detection, which focused on binary classification, LLM-DetectAIve supports four categories: (i) human-written, (ii) machine-generated, (iii) machine-written, then machine-humanized, and (iv) human-written, then machine-polished. Category (iii) aims to detect attempts to obfuscate the fact that a text was machine-generated, while category (iv) looks for cases where the LLM was used to polish a human-written text, which is typically acceptable in academic writing, but not in education. Our experiments show that LLM-DetectAIve can effectively identify the above four categories, which makes it a potentially useful tool in education, academia, and other domains.LLM-DetectAIve is publicly accessible at https://github.com/mbzuai-nlp/LLM-DetectAIve. The video describing our system is available at https://youtu.be/E8eT_bE7k8c.

2023

Collective STS

Collective Human Opinions in Semantic Textual Similarity

Yuxia Wang, Shimin Tao, Ning Xie, and 3 more authors

Transactions of the Association for Computational Linguistics, Nov 2023

DOI
PhD Thesis

Towards Accurate and Reliable Modelling for Semantic Textual Similarity

Yuxia Wang

The University of Melbourne, Nov 2023

2022

NoisyRegulation

Noisy Label Regularisation for Textual Regression

Yuxia Wang, Timothy Baldwin, and Karin Verspoor

In Proceedings of the 29th International Conference on Computational Linguistics, Oct 2022

Abs

Training with noisy labelled data is known to be detrimental to model performance, especially for high-capacity neural network models in low-resource domains. Our experiments suggest that standard regularisation strategies, such as weight decay and dropout, are ineffective in the face of noisy labels. We propose a simple noisy label detection method that prevents error propagation from the input layer. The approach is based on the observation that the projection of noisy labels is learned through memorisation at advanced stages of learning, and that the Pearson correlation is sensitive to outliers. Extensive experiments over real-world human-disagreement annotations as well as randomly-corrupted and data-augmented labels, across various tasks and domains, demonstrate that our method is effective, regularising noisy labels and improving generalisation performance.
NLI

Capture Human Disagreement Distributions by Calibrated Networks for Natural Language Inference

Yuxia Wang, Minghan Wang, Yimeng Chen, and 5 more authors

In Findings of the Association for Computational Linguistics: ACL 2022, May 2022

Abs DOI

Natural Language Inference (NLI) datasets contain examples with highly ambiguous labels due to its subjectivity. Several recent efforts have been made to acknowledge and embrace the existence of ambiguity, and explore how to capture the human disagreement distribution. In contrast with directly learning from gold ambiguity labels, relying on special resource, we argue that the model has naturally captured the human ambiguity distribution as long as it’s calibrated, i.e. the predictive probability can reflect the true correctness likelihood. Our experiments show that when model is well-calibrated, either by label smoothing or temperature scaling, it can obtain competitive performance as prior work, on both divergence scores between predictive probability and the true human opinion distribution, and the accuracy. This reveals the overhead of collecting gold ambiguity labels can be cut, by broadly solving how to calibrate the NLI network.
Diformer

Diformer: Directional Transformer for Neural Machine Translation

Minghan Wang, Jiaxin Guo, Yuxia Wang, and 8 more authors

In Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, Jun 2022

Abs

Autoregressive (AR) and Non-autoregressive (NAR) models have their own superiority on the performance and latency, combining them into one model may take advantage of both. Current combination frameworks focus more on the integration of multiple decoding paradigms with a unified generative model, e.g. Masked Language Model. However, the generalization can be harmful on the performance due to the gap between training objective and inference. In this paper, we aim to close the gap by preserving the original objective of AR and NAR under a unified framework. Specifically, we propose the Directional Transformer (Diformer) by jointly modelling AR and NAR into three generation directions (left-to-right, right-to-left and straight) with a newly introduced direction variable, which works by controlling the prediction of each token to have specific dependencies under that direction. The unification achieved by direction successfully preserves the original dependency assumption used in AR and NAR, retaining both generalization and performance. Experiments on 4 WMT benchmarks demonstrate that Diformer outperforms current united-modelling works with more than 1.5 BLEU points for both AR and NAR decoding, and is also competitive to the state-of-the-art independent AR and NAR models.
MT

The HW-TSC’s Speech to Speech Translation System for IWSLT 2022 Evaluation

Jiaxin Guo, Yinglu Li, Minghan Wang, and 9 more authors

In Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022), May 2022

Abs DOI

The paper presents the HW-TSC’s pipeline and results of Offline Speech to Speech Translation for IWSLT 2022. We design a cascade system consisted of an ASR model, machine translation model and TTS model to convert the speech from one language into another language(en-de). For the ASR part, we find that better performance can be obtained by ensembling multiple heterogeneous ASR models and performing reranking on beam candidates. And we find that the combination of context-aware reranking strategy and MT model fine-tuned on the in-domain dataset is helpful to improve the performance. Because it can mitigate the problem that the inconsistency in transcripts caused by the lack of context. Finally, we use VITS model provided officially to reproduce audio files from the translation hypothesis.
Uncertainty

Uncertainty Estimation and Reduction of Pre-trained Models for Text Regression

Yuxia Wang, Daniel Beck, Timothy Baldwin, and 1 more author

Transactions of the Association for Computational Linguistics, May 2022

Abs DOI

State-of-the-art classification and regression models are often not well calibrated, and cannot reliably provide uncertainty estimates, limiting their utility in safety-critical applications such as clinical decision-making. While recent work has focused on calibration of classifiers, there is almost no work in NLP on calibration in a regression setting. In this paper, we quantify the calibration of pre- trained language models for text regression, both intrinsically and extrinsically. We further apply uncertainty estimates to augment training data in low-resource domains. Our experiments on three regression tasks in both self-training and active-learning settings show that uncertainty estimation can be used to increase overall performance and enhance model generalization.
Speech

The HW-TSC’s Offline Speech Translation System for IWSLT 2022 Evaluation

Yinglu Li, Minghan Wang, Jiaxin Guo, and 9 more authors

In Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022), May 2022

Abs DOI

This paper describes the HW-TSC’s designation of the Offline Speech Translation System submitted for IWSLT 2022 Evaluation. We explored both cascade and end-to-end system on three language tracks (en-de, en-zh and en-ja), and we chose the cascade one as our primary submission. For the automatic speech recognition (ASR) model of cascade system, there are three ASR models including Conformer, S2T-Transformer and U2 trained on the mixture of five datasets. During inference, transcripts are generated with the help of domain controlled generation strategy. Context-aware reranking and ensemble based anti-interference strategy are proposed to produce better ASR outputs. For machine translation part, we pretrained three translation models on WMT21 dataset and fine-tuned them on in-domain corpora. Our cascade system shows competitive performance than the known offline systems in the industry and academia.
Speech

The HW-TSC’s Simultaneous Speech Translation System for IWSLT 2022 Evaluation

Minghan Wang, Jiaxin Guo, Yinglu Li, and 9 more authors

In Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022), May 2022

Abs DOI

This paper presents our work in the participation of IWSLT 2022 simultaneous speech translation evaluation. For the track of text-to-text (T2T), we participate in three language pairs and build wait-k based simultaneous MT (SimulMT) model for the task. The model was pretrained on WMT21 news corpora, and was further improved with in-domain fine-tuning and self-training. For the speech-to-text (S2T) track, we designed both cascade and end-to-end form in three language pairs. The cascade system is composed of a chunking-based streaming ASR model and the SimulMT model used in the T2T track. The end-to-end system is a simultaneous speech translation (SimulST) model based on wait-k strategy, which is directly trained on a synthetic corpus produced by translating all texts of ASR corpora into specific target language with an offline MT model. It also contains a heuristic sentence breaking strategy, preventing it from finishing the translation before the the end of the speech. We evaluate our systems on the MUST-C tst-COMMON dataset and show that the end-to-end system is competitive to the cascade one. Meanwhile, we also demonstrate that the SimulMT model can be efficiently optimized by these approaches, resulting in the improvements of 1-2 BLEU points.

2021

HI-CMLM

HI-CMLM: Improve CMLM with Hybrid Decoder Input

Minghan Wang, Guo Jiaxin, Yuxia Wang, and 6 more authors

In Proceedings of the 14th International Conference on Natural Language Generation, Aug 2021

Abs DOI

Mask-predict CMLM (Ghazvininejad et al.,2019) has achieved stunning performance among non-autoregressive NMT models, but we find that the mechanism of predicting all of the target words only depending on the hidden state of [MASK] is not effective and efficient in initial iterations of refinement, resulting in ungrammatical repetitions and slow convergence. In this work, we mitigate this problem by combining copied source with embeddings of [MASK] in decoder. Notably. it’s not a straightforward copying that is shown to be useless, but a novel heuristic hybrid strategy — fence-mask. Experimental results show that it gains consistent boosts on both WMT14 En\ensuremath<-\ensuremath>De and WMT16 En\ensuremath<-\ensuremath>Ro corpus by 0.5 BLEU on average, and 1 BLEU for less-informative short sentences. This reveals that incorporating additional information by proper strategies is beneficial to improve CMLM, particularly translation quality of short texts and speeding up early-stage convergence.
MT

Joint-training on Symbiosis Networks for Deep Nueral Machine Translation models

Zhengzhe Yu, Jiaxin Guo, Minghan Wang, and 8 more authors

arXiv preprint arXiv:2112.11642, Aug 2021
MT

How Length Prediction Influence the Performance of Non-Autoregressive Translation?

Minghan Wang, Guo Jiaxin, Yuxia Wang, and 6 more authors

In Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, Nov 2021

Abs DOI

Length prediction is a special task in a series of NAT models where target length has to be determined before generation. However, the performance of length prediction and its influence on translation quality has seldom been discussed. In this paper, we present comprehensive analyses on length prediction task of NAT, aiming to find the factors that influence performance, as well as how it associates with translation quality. We mainly perform experiments based on Conditional Masked Language Model (CMLM) (Ghazvininejad et al., 2019), a representative NAT model, and evaluate it on two language pairs, En-De and En-Ro. We draw two conclusions: 1) The performance of length prediction is mainly influenced by properties of language pairs such as alignment pattern, word order or intrinsic length ratio, and is also affected by the usage of knowledge distilled data. 2) There is a positive correlation between the performance of the length prediction and the BLEU score.
MT

Self-distillation mixup training for non-autoregressive neural machine translation

Jiaxin Guo, Minghan Wang, Daimeng Wei, and 8 more authors

arXiv preprint arXiv:2112.11640, Nov 2021
MT

HW-TSC’s Participation at WMT 2021 Quality Estimation Shared Task

Yimeng Chen, Chang Su, Yingtao Zhang, and 9 more authors

In Proceedings of the Sixth Conference on Machine Translation, Nov 2021

Abs

This paper presents our work in WMT 2021 Quality Estimation (QE) Shared Task. We participated in all of the three sub-tasks, including Sentence-Level Direct Assessment (DA) task, Word and Sentence-Level Post-editing Effort task and Critical Error Detection task, in all language pairs. Our systems employ the framework of Predictor-Estimator, concretely with a pre-trained XLM-Roberta as Predictor and task-specific classifier or regressor as Estimator. For all tasks, we improve our systems by incorporating post-edit sentence or additional high-quality translation sentence in the way of multitask learning or encoding it with predictors directly. Moreover, in zero-shot setting, our data augmentation strategy based on Monte-Carlo Dropout brings up significant improvement on DA sub-task. Notably, our submissions achieve remarkable results over all tasks.
MT

The HW-TSC’s Offline Speech Translation Systems for IWSLT 2021 Evaluation

Minghan Wang, Yuxia Wang, Chang Su, and 8 more authors

arXiv preprint arXiv:2108.03845, Nov 2021
MT

Incorporating complete syntactical knowledge for spoken language understanding

Shimin Tao, Ying Qin, Yimeng Chen, and 8 more authors

In Knowledge Graph and Semantic Computing: Knowledge Graph Empowers New Infrastructure Construction: 6th China Conference, CCKS 2021, Guangzhou, China, November 4-7, 2021, Proceedings 6, Nov 2021
MT

Incorporating Complete Syntactical Knowledge for Spoken Language Understanding

Weibin Meng, Yanghua Xiao, Jiaxin Guo, and 5 more authors

In Knowledge Graph and Semantic Computing: Knowledge Graph Empowers New Infrastructure Construction: 6th China Conference, CCKS 2021, Guangzhou, China, November 4-7, 2021, Proceedings, Nov 2021

2020

STS

Evaluating the Utility of Model Configurations and Data Augmentation on Clinical Semantic Textual Similarity

Yuxia Wang, Fei Liu, Karin Verspoor, and 1 more author

In Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, Jul 2020

Abs DOI

In this paper, we apply pre-trained language models to the Semantic Textual Similarity (STS) task, with a specific focus on the clinical domain. In low-resource setting of clinical STS, these large models tend to be impractical and prone to overfitting. Building on BERT, we study the impact of a number of model design choices, namely different fine-tuning and pooling strategies. We observe that the impact of domain-specific fine-tuning on clinical STS is much less than that in the general domain, likely due to the concept richness of the domain. Based on this, we propose two data augmentation techniques. Experimental results on N2C2-STS 1 demonstrate substantial improvements, validating the utility of the proposed methods.
STS

Learning from Unlabelled Data for Clinical Semantic Textual Similarity

Yuxia Wang, Karin Verspoor, and Timothy Baldwin

In Proceedings of the 3rd Clinical Natural Language Processing Workshop, Nov 2020

Abs DOI

Domain pretraining followed by task fine-tuning has become the standard paradigm for NLP tasks, but requires in-domain labelled data for task fine-tuning. To overcome this, we propose to utilise domain unlabelled data by assigning pseudo labels from a general model. We evaluate the approach on two clinical STS datasets, and achieve r= 0.80 on N2C2-STS. Further investigation reveals that if the data distribution of unlabelled sentence pairs is closer to the test data, we can obtain better performance. By leveraging a large general-purpose STS dataset and small-scale in-domain training data, we obtain further improvements to r= 0.90, a new SOTA.
Medical

A multi-pass sieve for clinical concept normalization

Yuxia Wang, Brian Hur, Karin Verspoor, and 1 more author

Traitement Automatique Des Langues, Nov 2020