Pertanika Journal

Go to Pertanika

Go to JTAS Home

Go to Pertanika Facebook

Home / Regular Issue / JSSH Vol. 34 (2) Apr. 2026 / JSSH-9377-2025

An Analysis of the Vietnamese Dictionary from a Computational Linguistics Perspective

Trang Thi My Phan, Hai Van Ba Phan, Tri Quoc Do, Dien Dinh, and Phuong Thi Minh Tran

Pertanika Journal of Social Science and Humanities, Volume 34, Issue 2, April 2026

DOI: https://doi.org/10.47836/pjssh.34.2.20

Keywords: Computational linguistics, letter distribution, part-of-speech distribution, polysemy coefficient, Vietnamese dictionary

Published on: 2026-04-30

Abstract

Dictionaries are essential resources for exploring a language’s lexicon, providing insights into word formation, usage, and linguistic relationships. With the advancement of computational linguistics, applying statistical methods to dictionary data enables researchers to discover the lexical characteristics of a language. This study explored the Vietnamese Dictionary from a computational linguistics perspective, applying statistical techniques like frequency analysis, part-of-speech (POS) distribution analysis, multi-POS (words that can function as more than one part of speech) coefficient analysis, and polysemy coefficient analysis to investigate letter distribution, POS characteristics, and polysemy levels. The findings indicate that the most frequently occurring letters are n, h, a, i, t, g, c, and u, while letters like q, x, d, v, e, s, ă, k, and r occur less frequently. Letters like t, c, n, đ, b, l, and h occur most often in initial positions. Nouns account for the largest proportion of lexical entries (44.7%), followed by verbs (31.58%) and adjectives (21.22%). The multi-POS coefficient analysis shows that 90.11% of words have one part of speech, 8.84% can function in two, and fewer than 1% span three or more, highlighting the low syntactic flexibility of the Vietnamese lexicon in terms of POS variation. Polysemy coefficient analysis indicates that particles, pronouns, and verbs exhibit the highest degrees of polysemy. These findings reveal the distributional characteristics of the Vietnamese lexicon through statistical analysis, providing a valuable foundation for further research in lexical semantics, electronic dictionaries, part-of-speech tagging tools, and natural language processing applications.

ISSN 0128-7702

e-ISSN 2231-8534

Article ID

JSSH-9377-2025