DIACHRONIC CORPUS ALGORITHMS
DOI:
https://doi.org/10.47390/SPR1342V5SI11Y2025N34Keywords:
diachronic corpus, pipeline, sentence segmentation, preprocessing, metadata, NLP.Abstract
This article describes the creation of a diachronic corpus of Uzbek fiction published between 1991 and 2021, along with its processing algorithms. Within the framework of corpus linguistics, the processes of text collection, preprocessing, sentence segmentation, metadata formation, and verification were scientifically implemented. As a result, a clean and standardized corpus comprising 116 works was obtained. Using the corpus algorithms, it is possible to analyze the temporal changes of linguistic units, perform statistical analysis by genre and demographic characteristics, and build n-gram models. This study serves as a reliable resource for diachronic research on the Uzbek language and practical investigations in the field of NLP.
References
1. Atabayeva N. B. “Mediamatnlar diaxronik korpusida til rivojining empiric tahlil tamoyillari” monografiya. Buxoro. 2024. 67-68.
2. Elov B.B., KHamroeva Sh.M., Xusainova Z.Y. NLP (tabiiy tilga ishlov berish) ning Pipeline konveyeri. Muhammad al-xorazmiy avlodlari ilmiy-amaliy va axborottahliliy jurnal. 2023. 181-182.
3. Elov B. B., Amirkulov M. Uzbeki-English Parallel Corpus Algorithm and Alignment Problem. Central Asian Studies. 2023. 71-76.
4. Xusainova Z.Y., Yangibayeva S.G. “Diaxron korpus yaratish bosqichlari” maqola. Toshkent. 2025. 165-166.
5. Xusainova Z., Yangibayeva S. Mustaqillik davri nashrlariga asoslangan diaxron korpus yaratishning lingvistik ta’minoti. International scientific-practical conference: Contemporary Technologies of Computational Linguistics – CTCL. 2025. 270-273.
6. Xusainova Z.Y., Yangibayeva S.G. “Diaxron korpus arxitekturasi” maqola. Qo‘qon. 2025. 1073-1078.





