Abstract
A lot of importance is indirectly attributed to the orthographic word: it constitutes the basis of any task that is preceded by tokenization and presents material for stimuli in psycholinguistic experiments. But in many writing traditions, the orthographic word is representative of isolated entries in the lexicon and largely ignores phonological processes of production. This study examines near-naive word separation in Swiss German using a corpus of text messages, revealing distinct patterns of orthographic segmentation driven by phonological processes such as assimilation and epenthesis. Compared to Standard German, Swiss German exhibits fewer orthographic words, suggesting heightened representation of prosodic dependencies in writing. Writers prioritize phonology over syntax when deviating from standard German space insertion conventions. These findings increase doubts about the meaningfulness of orthographic representation for word-based comparative linguistic research and highlight the importance of integrating phonological information into natural language processing models.