Deciphering Undersegmented Ancient Scripts Using Phonetic Prior

Most undeciphered lost languages exhibittwo characteristics that pose significant de-cipherment challenges: (1) the scripts arenot fully segmented into words; (2) the clos-est known language is not determined. Wepropose a decipherment model that handlesboth of these challenges by building on richlinguistic constraints reflecting consistentpatterns in historical sound change. We cap-ture the natural phonological geometry bylearning character embeddings based on theInternational Phonetic Alphabet (IPA). Theresulting generative framework jointly mod-els word segmentation and cognate align-ment, informed by phonological constraints.We evaluate the model on both decipheredlanguages (Gothic, Ugaritic) and an undeci-phered one (Iberian). The experiments showthat incorporating phonetic geometry leadsto clear and consistent gains. Additionally,we propose a measure for language close-ness which correctly identifies related lan-guages for Gothic and Ugaritic. For Iberian,the method does not show strong evidencesupporting Basque as a related language,concurring with the favored position by thecurrent scholarship. Read More

#nlp