Abstract
This paper introduces a solution to the problem of detecting whether a sequence of text is Vietnamese based on its orthography and contextual features. For those unfamiliar with the language, it is known that understanding the meaning of certain texts can be challenging, since Vietnamese is a complex language that uses Latin characters with diacritics, and many of its words rely heavily on accent marks for semantic distinction. In this paper, we provide insight into how these characteristics influence Transformer-based natural language processing models and propose an approach to address this issue. Transformer-based models are selected due to their superior performance compared to earlier architectures such as RNNs and LSTMs, as well as their widespread application in state-of-the-art NLP systems (GPT, BERT, T5). We examine the specific challenges posed by Vietnamese orthography and word formation, and propose a solution that enhances the model's ability to distinguish Vietnamese text. Our approach is evaluated on a benchmark dataset, demonstrating high accuracy and robustness in Vietnamese text detection, outperforming conventional methods. The results confirm that Transformer-based models can effectively learn orthographic and contextual patterns in Vietnamese, contributing to improved language identification and multilingual NLP processing.