A corpus approach to orthographic chunking: near-naive word separation in Swiss German text messages

基于语料库的正字法分块:瑞士德语短信中近乎朴素的词语分离

阅读:2

Abstract

A lot of importance is indirectly attributed to the orthographic word: it constitutes the basis of any task that is preceded by tokenization and presents material for stimuli in psycholinguistic experiments. But in many writing traditions, the orthographic word is representative of isolated entries in the lexicon and largely ignores phonological processes of production. This study examines near-naive word separation in Swiss German using a corpus of text messages, revealing distinct patterns of orthographic segmentation driven by phonological processes such as assimilation and epenthesis. Compared to Standard German, Swiss German exhibits fewer orthographic words, suggesting heightened representation of prosodic dependencies in writing. Writers prioritize phonology over syntax when deviating from standard German space insertion conventions. These findings increase doubts about the meaningfulness of orthographic representation for word-based comparative linguistic research and highlight the importance of integrating phonological information into natural language processing models.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。