From Linguistic Forms to Linguistic Contexts

 Lahousseine Id-youss

 KU Leuven




 The morphological problem whether Arabic is root-based or word-based is an old new debate. Root advocates defend the centrality of consonantal roots, and word advocates argue for the word-based nature of morphological processes in Arabic. This article suggests using word space models as a neutral statistical tool to determine which of these two approaches is more adequate.



 Understanding  the mystery of human Language and formulating adequate theories explaining its special character in comparison with many other social constructs is truly challenging. It is indeed amazing that linguists continuously fail to account for this phenomenon which children acquire and master in few years. The semantic component is perhaps one of the most challenging aspect of language. Meaning, being so challenging and slippery, has been excluded by some linguistic theories from their realm of interest, overlooking the fact that language may not have existed without the communicative function it performs. It is difficult to imagine the importance of linguistics structures had it not been for the meanings associated with them.

Different hypotheses and approaches to human language continue to develop raising people’s awareness of linguistic matters they know perfectly well but about which they hardly know anything. Given the diversity and the proliferation of these theories about different linguistic aspects at our disposal, especially as some of them do now and then contradict each other, it seems useful to come up with some form of neutral criterion to judge which them is more adequate. Some methods of the statistical framework of language can certainly be of great use in this regard.

It is a nice coincidence that context-based conception of meaning, which is of a statistical nature, comes up with useful findings as a possible solution to the problem of meaning. This quantitative framework is flourishing very rapidly benefiting from the increasing computational power (Geeraerts, 2009). Computer processing power is increasing almost every day thanks to the fast development of research in this respect, and thus opening new horizons for statistical approaches to Human Language.

The central issue, which the present article addresses and which perhaps requires some neutral standard to be settled, is the old/new ongoing debate over whether Arabic is root or word-based. While the root-based camp maintains that the root is the central morphological element in Arabic (McCarthy, 1979; Cantineau, 1950a), the word-based camp argues that morphological processes in the Arabic language are word-based (Mahadin, 1997; Benmamou, 1999). Both camps unfortunately come up with equally strong arguments and linguistic data supporting their views.

In this article, I propose using the new large scale quantitative corpus linguistics method, namely, word space models to settle this morphological issue. The technique has been employed in a variety of experiments and has produced remarkable results in the field of semantic similarity. The relevance of the technique to the problem at hand stems from the idea that these two morphological approaches somehow disagree about whether the different word forms sharing the same consonantal root should be semantically related or not.

The article will be organised as follows: in the first section, some aspects of the root-based  approach to Arabic will be addressed. In the second, the word-based view will be expounded. The third section will be devoted to outlining the method proposed to judge which of the two views is more adequate. Finally, the last section exposes the Latin transcription system that has been employed for the Arabic words cited in this article.


  1. Consonantal roots

One of the most influential approaches to Arabic morphology is  the root-based. This view emphasizes the pivotal role of the root in word formation, and it departs from a number of premises relating to the status of consonants as opposed to vowels and affixes. Consonants, vowels and affixes are assumed to form separate entities and play different morphological roles. The present section will explore this approach and expose some of its main arguments.

Unlike English morphology where words are build up compositionally from smaller elements, I.E., morphemes, Arabic morphology is of a non-concatenative nature where the consonantal root is subject to various vocalic operations. For instance, the English word ‘internationalization’, consisting of five distinct morphemes,  can be morphologically analyzed as inter-nation-al-iz-ation. Similarly, the word ‘teachers’ is composed of two morphemes, I.E., the base word ‘teacher’ and the plural morpheme’“-s’  . Such concatenative processes exhibited by these two examples do not exist in Arabic. Instead, inflectional and derivational rules function through processes of infixation and circumfixation and through vocalic alterations, taking the consonantal root as their input. For example, the words ‘kitaab‘ (book) and ‘kutub‘ (books) where the first is singular and the second is plural, can be broken into their consonantal root –k-t-b). The grammatical feature of number is realized by a change of vowels while the trilateral consonantal root remains intact.

The root can be defined as an abstract linguistic unit which often consists of three consonants common to a group of words related in terms of meaning (Cantineau 1950: 120). In the same vein, McCarthy (1979) stresses the idea that the root in Arabic is made up of three, four or five consonants, commonly referred to as radicals,  which revolve around a single semantic field and from which all the related verbal and nominal forms are derived. Accordingly, the words ‘katab’ (write), ‘kutib’ (written), ‘kaatib’ (writer), and ‘kaatab’ (write) all have the consonantal root “k-t-b” in common and revolve around a similar semantic field, namely, that of “writing”.

The two important points to be retained from this definition are (I) that the root consists mainly of consonants and (II) that these consonants revolve around a single semantic field. To further illustrate this consonantism aspect of the root and the semantic association amongst the different derived forms, let us look at the data below.

Verb form Affixe Consonantal root gloss Semantic field
katab   k-t-b write writing
kattab Doubling of the second radical (-t-) k-t-b cause x to write writing
kaatab -aa- k-t-b Write to/correspond Writing
takaatab ta-aa- k-t-b to write to each other writing
nkatab n- k-t-b subscribe writing
ktatab -ta- k-t-b be registered writing
staktab st- k-t-b make write writing

These data seem to clearly support McCarthy’s (1979) idea that words in Arabic can be morphologically related although they are not composed of discreet morphemes put together in a concatenative manner, and as one can easily notice, morphologically speaking, all these words indeed share nothing but the discontinuous consonantal radicals /k-t-b/ which cluster around the semantic field of writing.

Another strong feature of this root-based view of Arabic morphology is that the consonantal root, vowels, and affixes belong to separate morphologically segmental tiers and that vowels intercalate between the discontinuous consonantal root at a later stage in the derivation process of every single word (McCarthy, 1979). The implication of this stand is that vowels are not part of the root.

An even stronger—and probably more controversial—position with regard to these two morphological entities, which is in keeping with how the root is defined above, could perhaps be the association of semantic content with consonants and grammatical content with vowels. In other words, the root-based approach claims that in the Arabic language consonants and vowels play different roles: While the former are lexical, the latter are functional.

As a matter of fact, even if vowels are rarely represented in orthography, their morphological contribution in Arabic, compared to many other languages, is rather powerful. This feature is due to the huge number of permissible internal vowel alternations for every single root, and it is usually referred to in linguistic studies as apophony. The question now is whether this division of roles between vowels and consonants is well-founded.

Indeed, a significant amount of grammatical information such as number, case, aspect, voice ETC., is realized through vocalic alternations internal to the stem. Almost all so-called broken plurals such as the kitaab kutub example cited above could serve as good examples for this phenomenon. McCarthy (1979: 281-284) argues that certain verbal categories such as aspect and voice are always marked by regular and systematic vocalic alternations. The regularity is demonstrably clear in both the perfective and imperfective and also in the active participle and the passive participle. To methodologically illustrate the point, let us look at the data below:

Perfective Imperfective Participle
Active Passive Active Passive Active Passive
katab kutib yaktub yuktab kaatib maktuub
fataH futiH yaftaH yuftaH faatiH maftuuH
mazzaq muzziq yumazziq yumazzaq mumazziq mumazzaq
Darrab Durrib yuDarrib yuDarrab muDarrib muDarrab
kattab kuttib yukattib yukattab mukattib mukattab
daHraj duHrij yudaHrij yudaHraj mudaHrij mudaHraj

These verb forms clearly demonstrate that some vowel patterns seem to bear constant meaning, in the sense that while the open unrounded vowel [a], for instance, always occurs in the perfective active, the open rounded vowel [u] constantly appears in the perfective passive. Thus, it seems that vowels are used to only indicate the grammatical function of words.

Another argument for the functional role of vowels, according to this approach, would perhaps be McCarthy’s classification of Arabic verbal forms. He claims that Arabic recognizes fifteen trilateral verbal forms and four quadrilateral ones. In other words, every trilateral consonantal root gives rise to fifteen derivational categories and any quadrilateral root generates four categories. This is what is commonly referred to as augmented forms or “Derived conjugations” Schramm, (1978: 498). By simply alternating vowels and inserting affixes in a systematic manner, we shift from one form to another. The consonantal root, which is the essential part, is always present irrespective of what affixes and what vocalic melodies are present. Vowels, therefore, have no lexical contribution.

Now that I have exposed some of the basic assumptions of the root-based view of the Arabic morphology, I would like to retain a number of points that are inherent to this approach before I direct our attention to the word-based stand in the upcoming section. First, the most important morphological unit is the root, which consists of discontinuous consonants. Second, consonants and vowels have different functions. While consonants bear semantic information, vowels carry grammatical content.


2. Word-based conception of Arabic morphology

The word-based view of Arabic morphology has begun to gain considerable attention over the recent decades. Its main tenet is that word-formation processes function at the word level rather than at the root level. A morphological rule applies to an already existing word, and both the derived word and the word to which the morphological rule applies are members of the lexicon. In this section, I will present this morphological stand, and it will be shown that the association between consonants and semantic information is not adequate and that vowels have a lexical status and are part of words.

The morphological view which analyzes the meaning of a word as the sum total of the meanings of the morphemes it is composed of, where a morpheme is defined as the minimal meaningful element, was refuted by Lees (1961), Halle (1973), and Aronoff (1976). A ‘blackboard’ may not necessarily be black as its “black-“ morpheme constituent suggests. It can be white, and yet it can still be called a ‘blackboard’.

It is true that McCarthy (1979: 210) distances himself from such a morpheme-based school when he described his predecessors’ studies as being “taxonomic”, he paradoxically worked out Arabic morphology in the same taxonomic fashion. According to him, as explained above, words in Arabic consist mainly of discontinuous consonantal morphemes, affixes, and vowels, each of which bears some kind of information and is represented as a separate entity, and their meanings come together to form the meaning of the derived word.

Mahadin, (1987) argues that Arabic, despite the powerful influence of the root in its morphology, is word-based. He demonstrates that in Arabic there is a great deal of vocabulary which shares the same consonantal root but which is not semantically related. For instance, the words ‘aHmar’ (red) and ‘Himaar’ (donkey) both share the root ‘H-m-r’, but no form of semantic similarity could unfortunately be found between the two words.

To further illustrate this line of thinking, let us look at the data in the table below. Compare the words in Group 1, 2 and Group 3 and see which consonantal radicals each group shares and what semantic relationship there could be between those words.

  Word Consonantal root Gloss
1.A bayaaD b-y-D whiteness
1.B bayD b-y-D eggs
2.A jamaal j-m-l beauty
2.B jamal j-m-l camel
3.A Hadath H-d-th happen
3.B Haddath H-d-th Talk to
3.C staHdath H-d-th invent

The striking fact that these data demonstrate is that the words in 1.A, 1.B, those in 2.A, 2.B and those in 3.A, 3.B and 3.C, if reduced to a consonantal root, have the same radicals in common. The question then is: what could the semantic similarity be between the notions ‘whiteness’ and ‘eggs’ as in 1, between ‘beauty’ and ‘camel’ as in 2 and finally between ‘talk to’, ‘happen’ and ‘invent’ as in 3? Obviously, there is unfortunately no semantic relationship whatsoever between those forms. Therefore, the idea that the derived word forms of a particular consonantal root are always supposed to share some kind of meaning does not seem to work all the time.

It is now perhaps clear that the consonantal root, which is assumed to semantically distinguish words in McCarthy’s terms, does not seem to be able to always do so on its own. Thus, there should be another linguistic device responsible for this difference in meaning between these words with similar radicals. According to the word-based view of Arabic morphology, the most plausible solution to this issue comes from the vocalic part of the language, in the sense that, contrary to the root-based approach, vowels do not play the functional role alone, but they also have lexical contribution. As a result, advocates of this position argue that vowels form an inseparable part of the word, and thus rejecting the idea of discontinuous morphemes discussed in the previous section.

Arabic language is full of words, where without vowels, it is utterly impossible to see any semantic contrast between certain completely different lexical items. The data below further reinforces this idea that consonants are not the only semantic morphological tools; vowels too have a lexical role to play (Mahadin, 1987: 162).

  Word form Consonantal root gloss
1.A  wajad w-j-d find
1.B  wajid w-j-d love
2.A  Šaʕr Š-ʕ-r hair
2.B  Šiʕr Š-ʕ-r poetry
3.A  ʕaqd ʕ-q-d contract
3.B  ʕiqd ʕ-q-d decade
4.A  qadum q-d-m become old
4.B qadim q-d-m come
5.A ʕirD




5.B ʕarD





As can clearly be seen from these words, the difference between the Arabic verb ‘wajid’ and ‘wajad’ is the second vowel in the string. While the former takes the close front unrounded vowel, the latter takes the open front unrounded vowel. The same thing is true for the words ‘Siʕr’, ‘ʕiqd’ and ‘ʕirD’ on the one hand and ‘Saʕr’, ‘ʕaqd’ and ‘ʕarD’ on the other. As for the words ‘qadim’ and ‘qadum’, while the first takes the close front unrounded vowel, the second takes the close back rounded one. In fact, without this vowel contrast, it is perhaps impossible to semantically distinguish among these words. Thus, it seems that vowels in Arabic are lexical, and they form an integral part of the word.

To conclude this section, it is perhaps useful to highlight some of the main issues associated with the word-based view of Arabic morphology. First, the sharing of consonants between words cannot always be an indication for semantic similarity, as there are scores of words which have the same consonantal root in common but are not semantically related. Second, consonants are not the only morphological elements responsible for carrying semantic information. Vowels are lexical entities too; they help to semantically distinguish between a great deal of lexical items.


3. From linguistic forms to contexts

It seems clear that the traditional purely linguistic arguments offer no clear way to establish which of the two morphological models discussed above is more appropriate. Both root-based and word-based advocates are equipped with an arsenal of linguistic data supporting their viewpoints; and both are so strong that, if one is presented with both options, it can be a real challenge to determine which of them is more adequate.

Consequently, there is a strong need for an impartial statistical method to verify which of the two models fits better. New large scale quantitative corpus linguistics methods, especially word space models, I believe, can be of great assistance in this respect. The importance of the word space models technique manifests better once the root/word-based morphological issue is seen in the light of semantic similarity. As a matter of fact, since in both approaches, the focus is quite often on the meaning of word forms and how they semantically relate to each other, the problem can easily be reformulated as whether semantic similarity should be computed at the root or word level.

The word space model is used to geometrically represent semantically related words on the basis of their distributional patterns (Sahlgren, 2006). It is a corpus-based technique in that it utilizes the distributional properties of words collected over large text data, and the philosophy behind it is Haris’ distributional hypothesis that words with similar meanings tend to occur in similar linguistic contexts (Harris, 1954; Sahlgren, 2006). Linguistic contexts here refer to the surrounding linguistic items with which a particular word co-occurs.

Distributional hypothesis has been tested and validated by various authors, and the tests have shown impressive results. Rubenstein & Goodenough (1965) undertook a comparison of synonymy judgements made by university students and contextual similarities, and almost the same experiment was repeated by Miller & Charles (1991). Both experiments have demonstrated the correlation between semantically similar words and their contextual/distributional properties.

In word space models, semantic similarity is represented in terms of spatial proximity. In other words, the closer a word to another in the geometric space, the more semantically similar they are. Proximity between words is computed on the basis of their contextual properties i.e., the more contextual information they share, the closer they will be to each other in the geometric space. For more details on context vectors and how to proceed from distributional statistics to geometric space as well as the computational and mathematical apparatus behind the issue as a whole, see Sahlgren (2006).

An important aspect, which is a prerequisite to applying the word space models technique and upon which adequate results depend, is the extraction of distributional patterns of words from a reliably large and representative corpus. The term “corpus” refers to a large collection of electronic texts gathered by means of explicit criteria (Bowker, 2002). The nature of these criteria in general depends on the nature of the task at hand which the corpus is expected to be used for. Research in this field, thanks to the steady development of computational processing power, is making impressive progress and seems to have promising prospects.

To carry out the experiment I suggest in this article, Arabic Gigaword Fourth Edition, one of the largest publicly available corpus in Arabic, can certainly be of great use to researchers in this respect. The corpus, which is a comprehensive archive of newswire text data, is made and distributed by the Linguistic Data Consortium (LDC), catalog no. LDC2009T30.3 at the University of Pennsylvania, and it contains about 848 million tokens (Mohammed, Antonio, Lamia, Pavel & Josef, 2010). A token is a string of characters surrounded by two white spaces. It is obviously more appropriate to use tokens in this context rather than words because of the morphological nature of Arabic where Pronouns and definite articles as well as many other linking elements are attached to the word. For instance, while the English expression “his book” will be counted as two words, its Arabic translation “kitaabuhu” will be considered as one word only. With this morphological fact in mind, the number of words in the Arabic Gigaword Fourth Edition is far beyond the 848 millions mentioned above.

The underlying idea of this article is the proposal to test the root/word-based approaches to Arabic morphology against the Arabic GigaWord corpus using word space models to see which of them could be more appropriate from the point of view of semantic similarity. If word space models results manifest words’ relations to the consonantal root, then the root-based approach will be judged as more adequate for semantic similarity. In other words, if the closest words to a specific target word at the geometric space share the same consonantal root as the target word, then semantic similarity between words will legitimately be assessed at the root level rather than the word level. Otherwise, the word-based view must be more appropriate for this task, as the fact will then be that words with a high degree of proximity at the geometric space should exhibit no morphological relation. Since different parameters of the word-space model produce different results, then comparing the different resulting spaces could be used as extra evidence for the conclusions to be drawn from the analyses in favour of either the root-based or the word-based model or both.

The hypothesis for such a piece of research could be formulated as follows: If the consonantal skeleton root indeed plays such a central role in Arabic lexical semantics, then it can justifiably be assumed that among the most semantically similar words of a target word must necessarily be morphologically related to its consonantal root. It is worth noting that the word-based version of this hypothesis is straight-forward and goes as follows: If Words in Arabic are semantically independent units of their roots, as advocates of the word-based view argue, then it can justifiably be assumed that the most semantically similar words of a target word may not necessarily be morphologically related to its consonantal root. However, since these two hypotheses, so to speak, are two sides of the same coin, either can do.

The choice of this mathematical technique to resolve this problem is dictated by its strong scientific basis, in that it makes no a priori assumptions about what is being evaluated and that it is free of human intervention. In fact, it reflects an unbiased position towards these morphological matters. Which of the two views is more adequate is a matter of experiments to be carried out using a reliably huge corpus.

I believe that applying the technique is innovative and that it will offer ‘objective’ empirical evidence for this root-based versus word-based debate. I also believe that the settlement of the issue can have important contribution for many Arabic natural processing tasks, especially information retrieval and term extraction.


4. Transcription

In this section, I present the transcription system that I have adopted for Arabic letters used in the discussion. For easy access, I have used the graphemes of the most similar sounds in Latin alphabet to the Arabic sound with the exception of the voiced pharyngeal fricative consonant, which has no close similar sound for which I have resorted to the IPA symbol. The two tables below consist of three columns. The left column contains some Arabic letters. The middle column contains a description of the Arabic sounds in terms of their place and manner of articulation as well as their phonation, and the right column has the Latin similar symbles. Arabic sound description is provided to help non-Arabic speakers to distinguish among similar symbols like s and S, especially as Arabic orthography is not case sensitive.

Please note that the first table is devoted to consonants, and the second to vowels. Not all consonants are listed below as the focus is mainly on those sounds used in the discussion above. It is also worth noting here that the palatal approximant semivowel and the long closed front unrounded vowel have the same symble in Arabic, and that is the simple reason why the same symble is listed in both tables.


Arabic letter Features Transcription
ب Voiced bilabial stop b
د Voiced alveolar stop d
ض Voiced emphatic alveolar stop D
ف Voiceless labiodental fricative f
ح Voiceless pharyngeal fricative H
ج Voiced palato-alveolar sibilant j
ك Voiceless velar stop k
ل Velarized alveolar lateral approximant l
م Bilabial nasal m
ن Alveolar nasal n
ق Voiceless uvular stop q
ر Alveolar trill r
س Voiceless alveolar fricative s
ش Voiceless palato-alveolar sibilant S
ت Voiceless alveolar stop t
ث Voiceless interdental fricative th
و Labio-velar approximant w
ي Palatal approximant y
ز Voiced alveolar fricative z
ع Voiced pharyngeal fricative ʕ




Arabic Vowel Features Transcription
َ Open front unrounded a
ُ Close back rounded u
ِ Close front unrounded i
ا Long open front unrounded aa
و Long close back rounded uu
ي Long close front unrounded ii



 By way of conclusion, Meaning is more complex than how it is viewed by root-based approaches. For them, meaning can always be sought in roots, and in case of derivation, the meaning of the derived form/word can simply be calculated as the meaning of the root plus the meaning of the affix in question. Word-based views are somewhat more sensitive to the complexity of meaning, but they still believe that meaning could be found in words.

In this article, I have attempted to expose some aspects of these two approaches to Arabic morphology, proposing the use of the technique of word space models to judge which of them can be more adequate in terms of semantic similarity. The proposal is in its first stage and can still be further developed by more research in the field benefiting from the insights from contextual conception of meaning. The output of such a piece of research can be in favour of one of the two stands or of both  or can reject them both opening a new horizon for a new view of Arabic morphology and semantics.



