NorthEuraLex is a large-scale lexicostatistical database which is being compiled within the EVOLAEMP project. It is unique among databases for providing lexical data from more than twenty language families in a unified IPA encoding, which is generated automatically from the orthographies or standard transcriptions, and will continue to be improved in the future. It is intended to serve as a basis for creating new benchmarks in computational historical linguistics, with the purpose of improving computational models of language relationship and language contact.
The current release version 0.9 covers a list of 1,016 concepts across 107 languages of Northern Eurasia, with a focus on Uralic and Indo-European, but also including all the language families conveniently summarized as Altaic/Transeurasian and Paleosiberian, a selection of Caucasian languages, some major contact languages from adjacent families, as well as the most well-known isolates of Northern Eurasia.
IMPORTANT: The current versions of the wordlists have been compiled by non-experts based on available resources, and are therefore guaranteed to contain many errors and inaccuracies. Therefore, they are not adequate for use as a primary reference or data source for any of the languages concerned, but only in computational frameworks where some noise can be dealt with. The next major version (planned for autumn 2020) will contain at least 80 additional languages, a first batch of etymological annotations for the larger families, as well as many updates and corrections based on the feedback of experts and native speakers.
Dellert, J., Daneyko, T., Münch, A. et al. Lang Resources & Evaluation (2019). https://doi.org/10.1007/s10579-019-09480-6 (version 0.9).