Future of Information and Communication Conference (FICC) 2024
4-5 April 2024
Publication Links
IJACSA
Special Issues
Future of Information and Communication Conference (FICC)
Computing Conference
Intelligent Systems Conference (IntelliSys)
Future Technologies Conference (FTC)
International Journal of Advanced Computer Science and Applications(IJACSA), Volume 14 Issue 10, 2023.
Abstract: A bidirectional phonetizer, morphologizer, and diacritizer pipeline (FSPMD) for modern standard Arabic (MSA) that integrated pronunciation, concatenative and templatic morphology, and diacritization were developed. Grammar and segmental phonology rules were applied in the forward direction to ensure the order of the proper rules, which were supplemented with special backward direction rules. The FSPMD comprises bidirectional finite-state transducers (FSTs) consisting of an ordered composition of FSTs, unordered parallel FSTs, unioned FSTs, and for validity, finite-state acceptors. The FSPMD has unique, innovative features and can be used as an integrated pipeline or standalone phonetizer (FSAP), morphologizer (FSAM), or diacritizer (FSAD). As the system is bidirectional, it can be used in forward (generation, synthesis) and backward (analysis, decomposition) directions and can be integrated into systems such as automatic speech recognition (ASR) and language learning tools. The FSPMD is rule-based and avoids stem listings for morphology or pronunciation dictionaries, which makes it scalable and generalizable to similar languages. The FSPMD models authentic rules, including fine granularity and nuances, such as rewrite and morphophonemic rules, subcategory identification and utilization, such as irregular verbs. FSAP performance regarding text from the Tashkeela corpus and Wikipedia demonstrated that the pronunciation system can accurately pronounce all text and words, with the only errors related to foreign words and misspellings, which were out of the system’s scope. FSAM and FSAD coverage and accuracy were evaluated using the Tashkeela corpus and a gold standard derived from its intersection with the UD_PADT treebank. The coverage of extraction of root and properties from words is 82%. Accuracy results are roots computed from a word (92%), words generated from a root (100%), non-root properties (97%), and diacritization (84%). FSAM non-root results matched and/or surpassed those from MADAMIRA; however, root result comparisons were not conducted because of the concatenative nature of publicly available morphologizers.
Maha Alkhairy, Afshan Jafri and Adam Cooper, “An Integrated, Bidirectional Pronunciation, Morphology, and Diacritics Finite-State System” International Journal of Advanced Computer Science and Applications(IJACSA), 14(10), 2023. http://dx.doi.org/10.14569/IJACSA.2023.01410122
@article{Alkhairy2023,
title = {An Integrated, Bidirectional Pronunciation, Morphology, and Diacritics Finite-State System},
journal = {International Journal of Advanced Computer Science and Applications},
doi = {10.14569/IJACSA.2023.01410122},
url = {http://dx.doi.org/10.14569/IJACSA.2023.01410122},
year = {2023},
publisher = {The Science and Information Organization},
volume = {14},
number = {10},
author = {Maha Alkhairy and Afshan Jafri and Adam Cooper}
}
Copyright Statement: This is an open access article licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, even commercially as long as the original work is properly cited.