From Defining To Semi-Automated Detecting (AN) Two-Word Compounds. A Plan To Enrich A Database of Neologisms

1
From defining to semi-automated detecting (AN) two-word compounds.

A plan to enrich a database of neologisms
1. Introduction
The continuum from phrasal structures (i.e. free phrases, collocations, fixed expressions or idioms) to multi-word
compounds and/or one-word compounds has been in the focus of the scholarly interest over the past decades. In
numerous studies, the very same Adjective (A) + Noun (N) structures (henceforth: AN structures) are considered as a
type of syntactic1 or morphological structure, or they are being investigated under the umbrella term multi-word
expressions (MWE)2, thus circumventing the distinction between (morphological) compounds and/or syntactic
constructs. In other studies, an enhanced lexical autonomy is assigned to AN structures by employing specific morpho-
syntactic or/and semantic criteria.3
In this contribution, the need and the possibility (as well as its limits) for a discrimination between two-word
compounds and two-word non-compounds will be discussed. Besides the theoretical importance of a possible distinction
between words and expressions/phrases – a demarcation which lies in the core of the notion of ‘word’- there is also a
practical need for morphologists and lexicologists to define the limits of a word as a tangible, linguistic notion, when it
comes to the search for novel formations (neologisms). Neologisms refer, by definition, to new words, while it is rather
a disputable issue whether expressions can indeed exhibit neologicity. In this sense, the question arising should be:
under which criteria could a MWE be considered a word (multi-word compound) in order to be classified as a neologism
or not?
Although we recognize the enormous difficulty of operationalizing this distinction among linguistic units, which
rather seem to construct a continuum instead, we will try to shed some light on the issue, by following two directions,
which could function in a complementary manner:
A. On a core linguistic (morphosyntactic and semantic), qualitative, level.
B. On a corpus linguistic, mostly quantitative, level (mainly through corpus analysis)
In the following section, an analysis of MWEs will be conducted, according to the first exploratory axis.
2.1 Previous research
In Anastassiadi-Symeonidi’s comprehensive study (1986) on ‘lexical units’ (for multi-word compounds), the
distinction between syntactic phrases and ‘lexical phrases’ is based on phonological, morphological, syntactic, semantic,
as well as functional criteria. In the case of AN lexical phrases the most decisive feature seems to be the use of pseudo-
adjectives or else relational adjectives, as opposed to qualitative adjectives (1986: 138-43, 148).
More recent studies follow rather onomasiological approaches: Gaeta & Ricca (2009), studying mainly Italian
compounds, as well as from other languages, underline the ‘naming force’ of lexical compounds (p. 38). Their analysis
of compounds’ properties is based on the [±morphological] and [± lexical] parameters and separates the morphological
from the lexical parameter, since there are many ad hoc, purely context-dependent one-word-constructs, especially in
languages like German, which nevertheless do not have the status of a lexical item (e.g. the German ad hoc one-word
structure Entscheidungsort ‘decision place’, p. 37).
Booij (2013) adopts a syntactic interpretation for the Greek AN multi-word compounds, proposing the term
‘syntactic compounds’ for a subset of them as opposed to ‘morphological compounds’ such as nixtopuli ‘night bird’,
mainly by adherence to the criterion of (non) internal inflection (Principle of Lexical Integrity). 4 Nevertheless Booij
(2009) in his constructionist –and onomasiological– analysis comes up with a proposal according to which “The Greek
and Dutch A+N phrasal names discussed above need to be listed in the lexicon, even though they are formed according
to the rules or schemas of phrasal syntax, if they are the conventional names for certain concepts” (Booij 2009: 235, see
also Booij 2013: 31).
1 For Greek see Booij (2013: 181). See also below.

2 For Greek see Mpakakou-Orphanou (2005: 70). For the ‘confusion’ on terminology see Mini & Fotopoulou (2009: 491).
3 For Greek see Anastassiadi-Symeonidi (1986), Ralli (2007, 2013), Christofidou (2012). For English see Lieber (2005) among others. For further
discussion on these perspectives see Christofidou (2012).

4 Booij (2013: 181-2).
2
Ralli & Stavrou (1998) examine further AN structures with relational adjectives and discriminate between AN
syntactic constructions (constructs) such as theatriki kritiki ‘drama review’ and morphological formations (compounds)
such as e.g. pedhiki xara ‘playground’, employing mainly syntactic and semantic criteria5, since one shared feature
between the so-called morphological multi-word structures and one-word compounds is the loss of their compositional
meaning.6 Ralli (2007) approaches the phenomenon as a continuum (p. 245) and proposes a similar set of
formal/syntactic criteria by adopting the term ‘loose multi-word compounds’. She further proposes that the constructs
with argument-denoting or ‘taxonomic’ modifiers (see also Ralli 2013) should not be regarded as loose multi-word
compounds (phrasal compounds in Ralli 2013) but as compound-like phrases (or constructs in Ralli 2013).
In Table 1 the main criteria for syntactic autonomy of multi-word (‘loose’) compounds, employed by Ralli 2007,
are given (see also 2013, cf. Anastassiadis-Symeonidis 1986). According to Ralli’s analysis the asterisk signifies
unacceptability and therefore the AN structures involved denote loose/phrasal compounds whilst the structures in bold
are not compounds (cf. 1', 2', 3').
1. inseparability: *to uranio, kirios, tokso [the celestial, mostly, bow] ‘mostly the rainbow’ vs. i theatriki, kirios,
kritiki [the drama, mostly, review’] ‘mostly the drama review’ or *i pedhiki i xara [the play the ground] ‘the
playground’) vs. i theatriki i kritiki [the drama the review] ‘the drama review’
1'. But: i dhimosii, kirios, ipalili me enoxlun [the civil, mostly, servants bother me]
to steghno to katharizma ine pio asfales ‘[the] dry [the] cleaning is more safe’
2. non-modification: *kirios uranio tokso vs. kirios viomixaniki zoni), [mostly celestial bow] ‘mostly rainbow’ vs.
[mostly industrial area]
2'. But: ston kirios trito kozmo simveni na… [in mostly third world happens to]
3. fixed order *xara pedhiki [ground play] ‘playground’ vs. kritiki theatriki [review drama] ‘drama review’
3'. But: ipalilos dhimosios dhen ise [servant civil you are not] ‘you are not a civil servant’
4. inability to use the modifier in a predicative position (*to tokso ine uranio [the bow is celestial (for rainbow)]
vs. i kritiki ine theatriki [the review is theatrical (for drama review)]
4'. But: to katharizma itan steghno [the cleaning was dry]
Table 1. Main criteria for syntactic autonomy of AN multi-word compounds in the literature (in 1,2,3, 4) and counter-
examples (in 1'2'3'4')
Christofidou (2012) on a par with Ralli & Stavrou (and Ralli 2007, 2013) attempts to further discriminate between
multi-word compounds (Ralli: loose/phrasal compounds) and multi-word compound-like structures (or constructs in
Ralli 2013) by assuming that there is a kind of qualitative difference between them. Nevertheless, in Christofidou (2012)
the proposed by Ralli (2013: 247, cf. Ralli & Stavrou 1998, Ralli 2007) criteria for ‘syntactic autonomy’, although
important, are challenged as not sufficient and strict enough for the distinction between multi-word compounds and
multi-word compound-like structures (‘constructs’). For instance, concerning the distinction between the multi-word
compounds tritos kozmos or uranio tokso vs. multi-word ‘constructs’ like atomiki vomva or theatriki kritiki, Christofidou
(2012) discusses that many of the ‘compounds’ could be used syntactically in the same way as the ‘constructs’. There
is no reason to accept the string i theatriki i kritiki or the string i viomixaniki, kirios, zoni (which according to Ralli 2007
is acceptable) but not the strings to steghno to katharizma or i dhimosii, kirios, ipalili (see Table 1.) which are
unacceptable according to Ralli’s analysis.7
According to most recent approaches (Ralli 2013, cf. 2007, Booij 2013, Gaeta & Ricca 2009, Christofidou 2012):
a. the structures under discussion are not morphological units / ‘objects’ (Ralli 2013: 250) (since they have
internal inflection) but they are “phrasal objects bearing an atomic status”, so they consist in specific syntactic
units with lexical function (see also Booij 20138, Gaeta & Ricca 2009)
b. The notion of word can be formally diverse within the same language (e.g. black bird and eagle, uranio tokso
‘rainbow’ and sinefo ‘cloud’) as well as among the languages of the world. For example, German or Dutch ad
5 Ralli & Stavrou (1998: 244-9).

6 This criterion is crucial even in the case of formations comprising the same relational adjective (cf. the morphological expression pedhiki xara
‘playground’ vs. the construct pedhiki parastasi ‘performance for/of children’, ibid. (244, 258).
7 Most examples in Ralli and Christofidou belong to rather emphatic contexts.
8 They function as names for concepts.
3
hoc one-word compounds could have phrase status (+morphological – lexical according to Gaeta & Ricca
2009) and they are not lexical units obligatorily (Booij 20099, 2013, Downing 197710).
The above discussion points to the assumption that neither morphological nor syntactic analysis could afford to
investigate sufficiently the phenomenon. Therefore, concerning the discrimination between multi-word compounds and
multi-word constructs and other multi-word structures, the criteria should be primarily semantic on the
qualitative/theoretical level (on a quantitative level see section 3).
Within a semantic analysis the discussion could be based either:
on onomasiological approaches (e.g. Štekauer 2005, see also Gaeta & Ricca 2009: 38, Booij 2009, 2013,
Mpakakou 2005)
or
on more analytical semantic approaches, like semantic transparency (Dressler 2006, Gries 2015) and/or
compositionality.
Although onomasiological approaches are quite appealing, mostly for ad hoc or nonce formations (see
Christofidou 1994: 24-31) and we do not deny the importance of the naming function as a criterion we think it cannot
be the main one, since, there are difficulties:
a. in defining intersubjectively whether/when a structure has a naming function or not, mostly when it comes to
verbs or two-word adjectives (e.g. orthio xiliometro [standing kilometer] ‘very tall’)
b. in showing that the multi word-compounds exclusively and not the other mw-expressions exhibit a naming
function. According to Booij, an adherent of the onomasiological approach: “In conclusion, phrases can
function as names, and there is no simple one-to-one correlation between the form of a linguistic expression
and its function as name or description” (Booij 2009: 224, cf. Downing 1977),
c. the proponents of onomasiological approaches rarely explain how the naming function criterion could be
applied. It is worth mentioning that Körtvélyessy, Štekauer & Zimmermann (2015: 85ff) select semantic
transparency as criterion for their onomasiological approach, which brings them back to more analytical
semantic approaches.
We propose (following Christofidou 2012)11 to further enrich the natural parameter of semantic transparency
(within Natural Theory, Dressler 1987, 2006 among others), a notion closely connected to compositionality. Frege’s
well known principle has been broadly employed over the decades for discriminating between compounds and other
free structures. Since the term compositionality as well as transparency have been (over)used in various ways we will
also employ the notions of predictability (Panagl 1987: 133 and Gries 2015: 136) and motivation -a notion similar to
analyzability (Langacker 1987: 298, 2013: 61, Bybee 2010: 45, cf. Dressler 1987:11412) in order to achieve a clearer
semantic and linguistically differentiating description.
2.2 Motivation, analyzability, predictability and transparency
Among Natural Theory linguists Panagl (1987) matches the notion of transparency with predictability (see also
Gries 2015, cf. Dressler 2005: 271), which to our mind should be discriminated from analyzability (Langacker 1987)
and/or morphosemantic motivation (Dressler 1987, 2005). As Bybee (2010: 45) points out many expressions or idioms
may be non-compositional but analyzable, since “analyzability would include the language user’s recognition of the
individual words and morphemes of an expression as well as its morphosyntactic structure”. For example the meaning
of ilektroniko taxidhromio (‘e-mail’) can be motivated/analyzed by the semantics of its components (‘electronic
exchange of messages/letters’) but its meaning is not uniquely predictable on a semantic level (but cf. Dressler 2005:
271), since it could also mean e.g. ‘an electronically organized post office’. In this sense, we propose to discriminate
between two aspects of transparency of a multi word structure: a. the semantic motivation of its parts/components and
9 Booij (2009: 224): “The use of appositional constructions without determiners and prepositions for names can also be observed in Dutch personnel
advertisements. Dutch compounds are right-headed, and hence the following left-headed expressions must be seen as phrases with a special syntax
used for coining names for particular jobs (source NRC-Handelsblad 28 Sept. 2008): senior-adviseur installatietechniek ‘senior adviser equipment
technics’ manager kredietrisicomanagement ‘manager credit risk management’ sectormanager infrastructuur ‘sector manager infrastructure’.
10 As early as 1977, Downing discussed the problem posing the question how the string ‘apple-juice seat’, referring to a chair with an apple-juice
package on it, could be treated linguistically.

11 cf. Ralli & Stavrou (1998) on Greek.
12 Dressler 1987, p. 114: “the meaning of the complex word should be invariantly related to the whole form, irrespective of semantic or
morphological compositionality (motivation). And this allows semantic opacity”.

4
b. the predictability of the meaning as a whole. The meaning of most AN structures can be analyzed, so that the semantic
motivation of the parts can be recognized but it cannot always be uniquely predictable. Nevertheless, if a structure is
semantically predictable the analyzability/motivation is presupposed. Thus, the key notion for compoundhood seems to
be the contextless non-predictability, since the important question about compounds consists not in the recognition of
the contribution of the parts to the whole but in the recognition of which aspect of the (metaphorical) meaning of the
part is activated within the compound as a whole.13
In our analysis, we will employ the notion of semantic transparency, by discerning two kinds of transparency:
motivational14 transparency and predictable transparency. The transparency that enables realization of the semantic
motivation of the compound components could be called motivational transparency whilst the transparency which does
not only enable the components motivation but it also predicts the actual (not potential!) meaning of the compound -and
only that- will be called predictable transparency15.
In this sense, we hypothesize that the non-predictability of a structure meaning (from its parts), i.e. the semantic
inseparability of the components, could be the main criterion for the discrimination between multi word-compounds and
other multi word-structures, since morphosyntactic inseparability is a more relevant criterion (as we have shown in 2.1).
The structures which exhibit a semantic predictability from the meaning of its parts are closer to phrases, while the multi
word-compounds’ meaning should not be uniquely predictable, i.e. the compound should be semantically inseparable,
even if motivated.
Nevertheless, it is broadly accepted that the phenomenon of compoundhood is of a rather gradient nature, thus
we can assume that a prototypical analysis within Natural Morphology (Christofidou 2012, see also Dressler 2006 and
Christofidou 1994: 63-82) would be most appropriate to capture and describe the notion of compoundhood.
2.3. A prototypical approach to AN compounds within Natural Morphology
Natural theory adopts the gradient nature of most linguistic phenomena like the continuity of derivation and
inflection (Dressler 1987, Christofidou, Doleschal & Dressler 1989). We assume that in the core of our investigation on
identifying the AN compounds should primarily be the search for the properties of the one-word compound, i.e. of the
compoundhood in Greek (cf. Dressler 2006: 43-44). Concluding on our previous analysis it seems that the main
prototypical property of compoundhood should be the property of inseparability, both formal and semantic.
• formal (morphosyntactic) inseparability (morphonological unity and syntactic autonomy)

prototypical: one-word compounds with morphonological unity16, i.e. one accent in Greek, no internal inflection
(arxondospito ‘house of rich people’) / less prototypical: multi-word compounds, i.e. only syntactic autonomy (omorfi
pedhiki xara ‘nice playground’ but *pedhiki omorfi xara)
note:
peripheral cases of morphological unity panepistimiupoli ‘university campus’, Aleksandrupoli (a Greek city) etc.
peripheral cases of (non) syntactic atomicity, e.g. i dhimosii, kirios, ipalili ‘the civil, mainly, servants’ etc.
• semantic inseparability (non-predictable, from the component parts, meaning / holistic meaning):
prototypical: non-motivated and non-predictable compounds/ less prototypical: motivated but non-predictable
compounds (see Table 2.)
peripheral case of predictable compounds (in series formations with classifying adjectives), see Table 2
and in a secondary prototypical hierarchy (see Dressler 2006):
prototypical: non-motivated for both components / less prototypical: non-motivated only for the head and even less
prototypical only for the modifier (see also Libben 1998)
13 In the German classic example Putzfrau (‘cleaning lady’) it is the durative and classifying meaning of the first compound component that is
selected, i.e. not every cleaning woman is a Putzfrau but the woman who does cleaning as a job (Herbermann 1981, Kastovsky 1982, 2005).
14 Henceforth we will use only the term motivation, since our theoretical argumentation stays within Natural Theory approaches while analyzability
comes from the cognitive theoretical paradigm.

15 Although the notion is similar to the notion of compositionality we prefer this term in the sense we analyzed above.
16 In Greek two-word compounds (i.e. non-morphological compounds) cannot be as prototypical as one-word compounds.
5
According to the prototypical property of semantic inseparability (for formal inseparability see discussion in
section 2.1.) the following, first and secondary (see a, b, c, d), hierarchy for unpredictable AN structures would emerge
in Table 217:
I. Prototypical hierarchy of (AN) compoundhood according to motivational and predictable transparency
Semantically non-predictable AN compounds
a. Non-literally motivated – thus non-predictable (non-literal meaning of both components)

xrisi tomi ‘golden mean’
prasino fos ‘green light’
skiodhis kivernisi ‘shadow government’
vathi kratos ‘deep state’
b. Non-literally motivated - thus non-predictable (non-literal meaning of the head only)

pedhiki xara ‘playground’
uranio tokso ‘rainbow’
kratiki mixani ‘state machinery’
ekdhotikos ikos ‘publishing house’
c. Non-literally motivated - thus non-predictable (non-literal meaning of the modifier only)

mavri aghora ‘black market’
petrina xronia ‘a tough period in a man’s (or country’s) life’
tritos kozmos ‘third world’
d. Literally motivated but still non- (uniquely) predictable (literal meaning both of the modifier and the head-
mostly semantic narrowing)
emboriko kendro (‘mall’ – not the shops in the city center)
kendriki aghora (‘the main city food-market’ - not every market in the center of a city)18
aghrotiko iatrio (‘provincial medical center’)
II. Peripheral compound-like structures
Semantically motivated and predictable (mostly (non-qualitative) classifying adjectives)
theatriki/musiki/kinimatoghrafiki… kritiki ‘drama/music/movie…review’

iliaki/thermiki/piriniki/atomiki/ kimatiki…energhia ‘solar/thermal/nuclear/atomic/wave…power or energy’
elastiko/plires/miomeno orario ‘flextime/full-time/part-time…work’
piriniki/thermopiriniki/ orologhiaki/atomiki …vomva ‘nuclear/thermonuclear/time/atomic…bomb’
ikistiki/viomixaniki/emporiki zoni ‘residential/industrial/commercial…zone’
steghastika/ epixirimatika / katanalotika…dhania ‘housing/business/consumer…loans
dhiethnes / emboriko / astiko / dhiikitiko… dhikeo ‘international/commercial/civil/administrative…law’
Table 2. Prototypical hierarchy of (AN) compoundhood
17It is important to note that semantically motivated can be considered only the compound the component of which has kept its literal meaning
(conventional meaning) as found in the dictionaries and every non-literal (unconventional) or metaphoric, metonymic use or sense not found in
the dictionaries should be considered as not semantically motivated.
18 cf. predictable structures like ipethria aghora ‘open market’, topiki aghora ‘local market’ etc.
6
The transparency of these compound-like structures is enhanced -so that it becomes predictable- by the series-formation
to which they belong, since the formation schema enables a meaning assignment in a more unique manner of all members
of the schema following a kind of product-oriented analysis (Bybee 1995) with a rather fixed semantic relation between
the compound components within this schema (see Christofidou 1994: 66-82, mainly 77-81 and 2001 for ‘series
formation’ within the same formation patterns and Schreuder & Baayen 1997 for ‘family frequency’ p. 122).
Summarizing the previous analysis, we propose that the more inseparable (morphosyntactically autonomous and
semantically unpredictable) an AN structure is, the more prototypical it is concerning the property of compoundhood.
In proposing a gradient analysis based on prototypicality
• first, we employ a hierarchically gradient approach instead of a continuum approach or multi-level analysis19.
The first approach remains rather vague and the level centered analysis is often confronted with the problem
of transitional cases.
• second, we propose a more unifying approach for the type of compound-like AN structures with classifying
adjectives (for the status of which there is a controversy) placing them at the periphery of our prototypical
hierarchy
In the second part of our study (sections 3 and 4) we test our main theoretical proposals on authentic data extracted
from sections of the Centre’s monitor newspaper corpus (cf. below). Both neologic and conventionalized AN candidate
compounds are analyzed with the aid of corpus linguistic methods and software, aiming at the exploration of further
criteria which could delineate on empirical grounds the behavior of neologic AN compounds, as compared to other AN
structures in the corpus.
3. Quantitative analysis
3.1. Related Work
It is common ground for most researchers that it is highly important to represent adequately MWE in the lexicon
and build MWE pools/ repositories accessible by NLP techniques. The task of MWE automatic detection attracts
significant attention both as an independent problem and as a subtask for resolving other issues including semantic role
labeling, word sense disambiguation, POS-tagging, parsing, machine translation, information retrieval, and so on. In
the last decades, several MWE encoding theories have been suggested and partially implemented, such as Lexical
Conceptual Structure (Jackendoff 1990), Generative Lexicon (Pustejovsky 1995), Lexical Functions (Mel’cuk 1996)
and FrameNet (Fillmore & Baker 2001).
Several research attempts on parsing Greek MWE have been delivered mostly based on NLP techniques and
computational models; Maynard & Ananiadou (2000) presented the TRUCKS model. This model made use of different
types of information - syntactic, semantic, terminological and statistical - seeking particularly to identify those parts of
the context which are most relevant to terms. It identified, disambiguated and ranked candidate terms from an initial
corpus of sublanguage texts. Gotsoulia et al. (2007) proposed an initial network of Greek words and frame-semantic
descriptions that would reliably contribute to the multilingual dimension of the FrameNet; their methodological issues
related to the development of a Greek lexical resource were based on the theory of Frame Semantics and supported by
corpus evidence. Michou & Seretan (2009) presented a tool for extracting multi-word expressions from corpora in
Modern Greek, which was used with a parallel concordancer to augment the lexicon of a rule-based machine translation
system. The tool was part of a larger extraction system that contained the Greek parser, its lexical database, the extraction
and concordancing system. Fotopoulou et al. (2009) introduced an approach to the development of an algorithm for the
automatic detection of MWEs.
Furthermore, there are studies on Greek MWE detection that take also advantage of specific morpho-semantic
and/or syntactic criteria. Ananiadou & Zervanou (2004) introduced an automatic detection of MWE terms with NLP
techniques. More specifically, they reviewed some statistical methods of indexing techniques, like FASIT, linguistic
methods that use language pro-processing tactics and text annotations (LEXTER, EMPathIE, PASTA), hybrid methods
that exploit the advantages of the aforementioned methods (TERMS, ACABIT, TRUCKS) and machine learning
19 See e.g. Foufi’s (2010: 656-657) proposal on fixed and semi-fixed multi-word compounds and cf. with our proposal on peripheral compound-
like structures.
7
techniques. Linardaki et al. (2010) employed also morpho-syntactic criteria focusing on investigating nominal MW
constructions towards a dictionary of Multiword Expressions for Greek via automatic extraction and human validation.
They investigated the use of a knowledge-poor statistical approach based on specific association measures. Their manual
evaluation showed that the automatic approach performs well enough to help in the construction of a lexical resource.
Samaridi & Markantonatou (2014) reported an effort to integrate verb MWE in LFG grammar of Modern Greek based
on an annotated text. Their lemmatized tagged texts were fed to a MWE filter and the output was formatted to feed an
LFG/XLE grammar that has been developed independently.
The challenges of MWE for Greek is still a field of ongoing research work. Nevertheless, none of
the aforementioned carried out an attempt to deal specifically with multi-word neologisms. This effort is even a more
complex task within an already demanding linguistic area.
3.2 ΝΕΟΔΗΜΙΑ: Collecting, cataloguing and analyzing neologisms
ΝΕΟΔΗΜΙΑ is a research programme of the Research Centre for Scientific Terms and Neologisms [ReCSTeN] at
the Academy of Athens, designed to accomplish the tasks of (semi-automated) detection, linguistic analysis and
monitoring of Greek Neologisms and Terminology (see Christofidou et al. 2013, Christofidou, Κarasimos &
Afentoulidou 2014 on the architecture of the database and on lexicographical, morphological, textlinguistic, webometric
levels of analysis and queries). The system focuses on the identification of one-word neologisms; candidate lists of
more-than-one-word formations are provided for manual inspection only for n-word grams connected with dashes. Any
further extension of the system to cover the identification of multiword novel formations would require a solid
extraction/identification strategy taking into consideration two parameters, primordial to our research on Greek
neologisms: the lexicological, which can be paraphrased as ‘identify multi-word sequences which have a word status
i.e. build a system able to make intelligent guesses about multi-word compounds’ and the neological, i.e ‘identify multi-
word compounds which fulfil the criteria for novelty, as predefined in our research”. For the purposes of the present
study, the two parameters are explored in detail, making use of data from the corpus component of ΝΕΟΔΗΜΙΑ consisting
of newspaper articles continuously collected through web crawling from the online editions of five Greek newspapers,
Ethnos, H Kathimerini, ToVima, TaNea, ProtoThema (cf. below)20.
3.3. From Bigrams to list comparisons and application of criteria
Due to the (till now) absence of a linguistically fully annotated option in our newspapers corpus, our realistic goal
is to detect relatively high-frequency bigrams with the help of an online corpus software (Sketch Engine21). For this
software, we selected two periods of our corpus (which contains metadata) in an untagged version: midOct-Noe-Dec
2015 (sub-corpus 1) and Feb-Mar-midApr 2016 (sub-corpus 2), the most recent versions of our corpus as of April 2016,
when the selection took place.
Tokens Words
Sub-corpus 1.
Oct 2015 4.749.848 3.859.413
Noe 2015 8.825.562 7.161.301
Dec 2015 7.110.945 5.750.718
Sub-corpus 2.
Feb 2016 8.806.501 7.146.834
Mar 2016 12.834.274 10.380.398
Apr 2016 5.595.016 4.502.509
Table 3. Tokens, words and sentences in Sub-corpus 1 and 2
20 For issues on sampling, collection, and pre/post-processing of the corpus see Afentoulidou & Christofidou in press.
21 Available at https://www.sketchengine.co.uk
8
We created two lists (one per sub-corpus) of bigrams. A large amount of more than 12.000 most frequent bigrams
contained a variety of structures (like N + N, Adj + N, V + ART + N)22.
The absence of annotation and parsing for the sub-corpora does not allow to design a specific extraction of AN
bigrams. However, a minimal “parsing” to exclude some elements such as articles, prepositions and conjunctions
provided us with shorter lists, after omitting all function words and ignoring specific bigrams as named entities like
Evropaiki Enosi ‘European Union’, Nea Dhimokratia ‘New Democracy party’, Dhiethnes Nomizmatiko Tamio
‘International Monetary Fund’ etc.
Applying the aforementioned qualitative (section 2) criteria as an extra filter we came up with a (much shorter)
list of candidate AN compounds (neologic or not) which appeared in both subcorpora. In Table 2 we give some examples
of candidate AN compounds with their relative frequency in the first and in the second period:
Sub-corpus 1: Sub-corpus 2:
Oct/Nov/Dec 2015 Feb/Mar/April 2016
proti katikia23 ‘main residence’ 56.19 per million 9.77 per million
kokina dhania ‘past due loans’ 56.10 per million 37.79 per million
mesos oros ‘average’ 56.00 per million 50.80 per million
dhiefthinon simvulos ‘chief executive’ 50.48 per million 60.40 per million
prosfighices roes ‘refugee flows’ 44.58 per million 75.20 per million
aksiomatiki andipolitefsi ‘main opposition’ 43.57 per million 47.18 per million
fisiko aerio ‘gas’ 39.21 per million 28.85 per million
kinoniki dhiktiosi ‘social networking’ 28.87 per million 31.10 per million
mistikes ipiresies ‘secret service’ 24.03 per million 17.48 per million
kini ghnomi ‘public opinion’ 23.50 per million 28.80 per million
prasino fos ‘green light’ 22.39 per million 16.74 per million
ethniki odhos ‘national road’ 19.44 per million 60.80 per million
ghenikos ghramateas ‘general secretary’ 18.00 per million 22.59 per million
kokini ghrami ‘red line’ 13.01 per million 15.35 per million
odhikos xartis ‘road map’ 11.17 per million 7.19 per million
xorika idhata ‘territorial sea’ 5.56 per million 14.91 per million
ithiko pleonektima ‘moral advantage’ 4.04 per million 5.00 per million
Table 4. Examples of candidate AN compounds out of two subcorpora and their relative frequency in descending
order (from subcorpus 1.)
According to our proposal (see section 2.), the AN structures we present of on Table 4 should be considered as
compounds since they are semantically non-predictable, except for prosfighikes roes and ithiko pleonektima. These
structures are interesting cases, since they appear together quite frequently in newspaper discourse, but we would be
reluctant to label them compounds, because they are semantically predictable. Especially for such cases the following
quantitative analysis will be most helpful.
After filtering the candidate AN compounds on theoretical grounds out of the list, we need to face the issue of
discriminating between neologic and non-neologic AN compounds. Concerning the quantitative results, however, the
major issue is twofold: a. some candidate multi-word neologisms are almost at the bottom of the bigram lists (e.g.
odhikos xartis), since they are recently introduced terms and b. their frequency measurements can be also misleading,
since they are strongly dependent upon the current news in newspapers (topicality), (see below sections 5 and 6)24. For
example the occurrences of kokina dhania were decreased, of prosfighikes roes was increased and of posotiki xalarosi
(‘Quantitative Easing’) disappeared in Sub-corpus 2. Such issues boost the necessity of a combined research25 within
the quantitative level – besides the qualitative one.
For the needs of the quantitative analysis we selected ten AN structures from our short list (see examples in Table
4) under the following order of criteria: the first eight according to our theoretical proposal are expected to exhibit
statistical support for compoundhood. Four of them (Table 5 in bold) seem to be neologisms, since they are not listed
22 Or word strings without syntactic structure (V + COMP, COMP + ART, P + ART) etc.
23 In the measurements, all inflectional types of each multi-word compound are included.
24 Note the fluctuation of the relative frequency between the subcorpora.
25 See also Kilgariff et al. 2015.
9
in the main Modern Greek dictionaries26. The last two structures (in Table 5 with question marks) are not expected to
gain statistical support as to their compound status, since they are still semantically predictable though very frequent (to
our native speaker’s intuition as well). The four compounds in bold kokina dhania, kinoniki dhiktiosi, odhikos xartis,
prasino fos are considered as candidates for neologic compounds, since they fulfil, besides the semantic criteria stated
above, the classic lexicographical criterion: they are not attested in the dictionaries.27
proti katikia ‘main residence’: semantically non-predictable modifier (it is not one’s first/main house, it is the house one declares to the state
as one’s main residence)
dhiefthinon simvulos ‘chief executive’: semantically non-predictable head (since the literal meaning of simvulos is ‘consultant’)
aksiomatiki andipolitefsi ‘main opposition party’: semantically non-predictable modifier (since the literal meaning of aksiomatikos-i-o is
‘based on axiom’)
kokini ghrami ‘red line’: semantically non-predictable compound meaning: ‘limit one cannot overstep’ (metaphorical use of both modifier
and head)
kokina dhania ‘past due loans’: semantically non-predictable modifier (metaphorical use of the adjective kokinos ‘red’)
kinoniki dhiktiosi ‘social networking’: semantically non-predictable head (the noun was not used before to designate human contacts for
fun using internet)
prasino fos ‘green light’: semantically non-predictable compound, meaning ‘permission’ (metaphorical use of both modifier and the head)
odhikos xartis, ‘road map’: semantically non-predictable compound in the sense of ‘a plan for achieving a goal’ (mainly metaphorical use
of both modifier and head)
?prosfighikes roes ‘refugee flows’: predictably transparent, since the words prosfighikes and roes are used in their literal meaning as given
in the dictionaries. 28
?ithiko pleonektima ‘ethical advantage’: predictably transparent, since both components are used in their literal meanings .
Table 5. Selected AN structures from Subcorpus 1 and Subcorpus 2 based on semantic criteria for further
quantitative analysis
4. Evidence for Compoundhood
We proceed to the quantification of our theoretical proposal in inferential statistical terms, through a corpus-based
approach. Research on the semi-automatic extraction of MWEs has unanimously emphasized on the determining role
of statistical association measures employed for the study of collocational phenomena in natural languages, in order to
both filter and rank candidate MWEs. The notion of collocation, i.e. the tendency of certain words to occur near each
other with a frequency far greater than chance stands as an analytical intermediary between multiword status and raw
corpus data. The literature on collocations is vast; one point of convergence between various operationalizations of the
term, is that the concept can be employed empirically as a textual phenomenon i.e. the links between pairs of words
emerge as recurrent patterns in the texts, in language corpora, providing evidence for the predictability29 of the
connection between words. To this aim multiple lexical association measures of collocational strength (henceforth AMs)
have been proposed (Evert 2007, Gries 2015) to quantify the degree of attraction (or repulsion) between co-occurring
words (lexical collocations), or words, syntactic patterns and constructions (grammatical collocations, colligations,
collocational frameworks, collostructions), by calculating – very roughly – how the observed distributions in the dataset
differ from the expected distributions in case there was no association between them. Different AMs highlight different
aspects of collocativity as a function of statistical predictability and assign scores of association to word co-occurrences.
A mathematical distinction is made between AMs that emphasize (1) on statistical significance, for instance, the Log-
likelihood Ratio (G2) (cf. Dunning 1993) and measures that emphasize (2) on effect size, for instance, the Mutual
Information (MI) formulas. Furthermore, some statistical procedures display a strong bias for the promotion of exclusive
word combinations (i.e. rare, but strong co-occurrences) as opposed to methods that assign greater scores to frequent
co-occurrences, with large observed frequencies (for instance, MI vs. t-scores, the latter promoting function words in
26 Dictionary of Modern Greek Language, Babiniotis and Dictionary of Koini Modern Greek, Triantafyllidis Foundation. Prasino fos was
lemmatized in phrases dino/perno to prasino fos (‘I give/take permission’) but we assume that it has recently gained a more autonomous word
status meaning ‘permission’ even in other contexts.
27 Some of them occur only in the Dictionary of the Academy of Athens, 2014, which contains a great amount of recent neologisms.
28 The Dictionary of Modern Greek Language, Babiniotis gives for the lemma roes as one of its main (with the indication general) meanings the
meaning of ‘continuous movement of news/ information/ passengers etc.’, since roes is fully conventionalized in this sense (e.g. xrimatikes roes
‘money flow’ etc.).
29 We should note that predictability of co-occurrence (and contingency) in the literature on word combinations is a statistical notion: when word
n
occurs, another wordn+1 can be predicted (and vice versa), that is, collocates are more likely to co-occur. Thus, the term differs from the notion of
semantic predictability (cf. sections 2-3).
10
the ranked lists of collocates produced). There are, of course, measures that try to compensate both for the exclusivity
and the low frequency bias (one of them is the Dice coefficient). The selection of one measure over the other is dependent
on the research objectives, since no measure is considered better than the other and all of them have limitations 30. For
instance, experimental research on the performance of different AMs (with frequency-based measurements as a baseline
for the comparison) suggests that “some AMs are much better suited for identifying some classes of collocations than
others” (Evert 2008, Evert & Krenn 2001, Krenn & Evert 2001)31. It is advised to examine linguistic phenomena under
the umbrella of multiple methods, since each one of them can offer a different perspective to the phenomena under
investigation.
4.1 Collocational behaviour of the noun heads
For the needs of this study we make use of the corpus analysis methods as implemented in the WordSmith Tools
Software Package v. 7.0 (Scott, 2016, henceforth WST)32. More specifically, we follow the WST collocation
relationships procedure in order to compute AMs’ scores for the ten target AN structures. Concordance lines are firstly
generated for the nouns (i.e. the node words) and lists of their collocates are retrieved (a) with a window span size of
+/-5 words (to account of positional variability), (b) delimited by sentence boundaries, and (c) with a frequency of co-
occurrence threshold of f ≥ 2. WST compares the lists of collocates for a given node with the WordList of the whole
corpus and computes AMs’ values. The collocation matrices can be rearranged to provide multiple rankings according
to different AMs depending on the needs of the user (emphasizing either on statistical significance or on the strength of
relationship between collocate/node). The program calculates seven AMs, three of which were judged relevant to our
research questions: the Log Likelihood Ratio (G2) reported as “the most appropriate and convenient measure” of
statistical significance in research on collocation (Evert 2007), Dice (Evert 2008), and MI3. The three measures create,
in our view, a ‘balanced’ collocational profile of the selected nouns: all of them take contingency information into
account and measure the relationships in both directions, as a whole (from wa to wb and wb to wa). G2 ranks word pair
types by statistical significance, MI3 promotes inferences on effect size (without the strong low frequency bias of MI).
After experimentation, MI and t-scores produced ‘noise’ to the collocate rankings, since the first gave high scores to
very infrequent modifiers e.g. learned sequences, sequences with proper names etc. whilst the second ranked function
words to the top33. The exclusion of z-scores was compensated by the inclusion of the Dice coefficient metric. Dice and
log-Dice have a strong lexicographic orientation, and are quite promising for the extraction of terms and “rigid word
combinations with almost total association” (Evert 2007, Kilgarriff & Kosem 2012). We used an enhanced XML version
of our corpus, which besides annotation for metadata categories common to all XML versions (name of newspaper, date
of publication, URL, title of the article) had two extra features vital to our research questions: a filesystem architecture
where each computer file includes just one newspaper article and not a collection of articles per source per day (a
requirement for the sound application of dispersion measures in section 5) and information on text categories (a
prerequisite for section 6). As it is computationally demanding both for computers of medium capacities and software
applications to process lots of small files, we piloted the enhanced XML version only to the newspaper feed we had the
most data, i.e. Proto Thema. The corpus was partitioned into successive periods of time, following the distinctions that
were initially made in section 3, but with the inclusion of all data until the latest additions (end of 2016), in order to gain
a wider scope from the comparison of patterns, seamlessly, across time:
30 The most important limitation being that of bidirectionality (Gries 2015, Levshina 2015), a more notable constraint, in our view, than the
assumptions of randomness that AMs make for a phenomenon (that is, linguistic behavior) that is not random at all, or of normality (or not) of
distributions. Almost all AMs are symmetrical in that “they do not distinguish whether word1 is more predictive of word2 or the other way round”
(Gries 2015: 139). Moreover, AMs cannot be straightforwardly adapted to word co-occurrences beyond the bigram.
31 See also Inger Lyse and Andersen (2012) regarding the extraction of MWEs for the purposes of the Norwegian Newspaper Corpus.
32 Both the Sketch Engine and WordSmith Tools are very powerful commercial corpus processing solutions. For the remaining sections of this
study, corpus analyses were conducted with the latter, because the program runs locally on a PC, can process up to 2 billion TEXT/XML files per
corpus without concatenating them in one document, offers enhanced tokenization options as compared to the Sketch Engine and implements
multi-word extraction utilities, as well as diachronic analysis through time-lined dispersion statistics. With the Sketch Engine, web access is a
prerequisite. Moreover, as far as the analysis of Greek corpora is concerned, one cannot make immediate use of the two tools most relevant to our
research, Word Sketches and Trends.
33 Log-Likelihood Ratio is also expected to rank frequent grammatical words on the top, but WST’s formula implements a “correction for
collocation span” proposed by Stefan Evert (Mike Scott, personal communication, cf. http://lexically.net/downloads/version7/HTML/
index.html?formulae.htm).
11
Proto Thema Tokens (running words excluding Number of Newspaper

punctuation) Articles
P1: Oct-Nov-Dec34 6.072.678 20.458
P2: Jan-Feb-Mar-Apr 9.669.920 30.935
P3: May-June-Jul-Aug 9.015.884 30.450
P4: Sept-Oct-Nov-Dec 10.963.664 34.564
Total 35.752.146 116.40735
Table 6. Details of the pilot corpus
From close examination of the lists for the 10-top collocates for each target node the following observations can
be made:
(1) The adjectival collocates forming the target AN structures that were manually selected from the word-bigrams’
list on the grounds of human judgement and according to the theoretical proposal, came up to the very first
position in the extensive collocate list for every noun node as far as the Log-Likelihood statistic is concerned (see
below Table 7). This pattern was replicated across other two measures with a 100% consistency.
(2) The lists were fairly consistent across the three AMs, providing empirical validation for a core of primary
collocates surrounding each node in the specific genre.
(3) The Log-Likelihood scores, as expected, were highly significant, surpassing the 15.13 critical value for p < 0,0001
by thousands36; i.e. for the top collocates there was more than enough evidence to reject the hypothesis of
independence i.e. that there is no association between the word types as combined together. Similarly, MI3 and
Dice scores were many times above the values recommended in WST default thresholds. Therefore, test values
per se exhibit statistical soundness both in terms of significance and effect size.
Convergence between different association measures in the top-10 highly ranked lists of collocates, computed
during the collocation identification process for the selected nodes, provides strong cumulative support to the assumption
that the target AN structures are strongly associated in quantitative terms and confirms on empirical grounds the high
level of their collocational strength both in the corpus and in journalistic discourse (at least in the selected newspaper).
Collocate Node Log_L Dice MI3 Texts Total (raw Total Total
frequencies) Left Right
aksiomatikis andipolitefsis 20253,62 0,53 34,18 1138 1514 1511 3
kinonikis dhiktiosis 16420,76 0,47 33,77 1177 1409 1409 0
kokina dhania 13669,74 0,47 33,19 677 1103 1093 10
dhiefthinon simvulos 13288,70 0,54 33,50 720 929 924 5
prosfighikon roon 9606,19 0,71 33,42 509 621 621 0
prasino fos 7443,94 0,26 31,11 557 656 656 0
protis katikias 5622,02 0,23 30,26 342 541 541 0
kokines ghrames 3582,71 0,28 29,97 211 267 266 1
odhiko xarti 2398,69 0,24 28,69 174 191 190 1
ithiko pleonektima 2208,16 0,26 28,67 123 170 170 0
prasinon dhanion 177,66 0,02 19,20 17 24 20 4
Table 7. Top collocates of the target noun nodes (heads) (AN structures ranked according to the Log-L Ratio)
4.2 Collocational behaviour of the adjective modifiers
Furthermore, the results were replicable in a second case study involving the reverse, corpus-driven, procedure:
extraction and identification with WST of all adj1-9 + wn pairs in the corpus (adj= all inflectional forms of the nine
adjectives in Table 7). A WordList Index of the whole corpus was computed. Word bigrams of the selected adjectival
forms were retrieved within a span of 5 words to the right of the node only, by respecting sentence boundaries, with a
frequency of co-occurrence ≥ 5 and a maximum frequency cut-off point at 0,1% (i.e. words more frequent in the corpus
than the % indicated were ignored). The pairs were finally ranked according to the three statistical association measures
of the first case study (concerning the collocational status of the noun nodes). The hypothesis was that if the particular
34 Actually, mid-October, see section 3.

35 A shell script eliminated 7,328 articles written in English (ASCII encoding).
36 There is no way to output specific p values for LL scores as computed with WST for candidate collocates.
12
AN structures kept coming up in those lists as well, that preference could be interpreted as a further statement on their
statistical predictability. Indeed, that was the case; all of them were ranked between first and second position in the lists
of candidate collocations, leaving room for other salient AN structures to appear as significant, for instance prosfighiki
krisi ‘refugee crisis’ or proti fora ‘first time’). So, in their competition with other nodes, our target AN structures
displayed high degrees of association, being thus in balance with the theoretical arguments upon which they were
selected. The procedure can be generalized to the extraction of all co-occurring word pairs within a window span and a
frequency threshold preselected by the researcher. The ranking according to various AMs, although computationally
‘expensive’ as compared to the extraction of immediately adjacent word pairs ranked according to mere frequency of
co-occurrence, moved highly plausible word combinations to the top of the list and almost eliminated the non-
meaningful word associations reported in section 3. The present study follows specific software implementations which
of course impose their own restrictions, but the method, combined with linguistic information added to the texts (POS
tagging) and a recursion mechanism37 could thus serve as a general heuristic for the extraction of all meaningful word
sequences (of various lengths) in our corpus38.
4.3. Collocation networks
The mere computation of AMs as displayed in Table 7, although it provides necessary evidence for collocational strength
on statistical grounds -thus high degree of form contingency- still seems not to be sufficient as a criterion for multi-word
compoundhood as analyzed in section 2. Therefore, a closer analysis was conducted taking into account the dynamics
of the positional patterns among the target AN structures, as identified in WST collocation matrices.
Corpus linguistics software packages are quite capable of illustrating the order and positional variation of a node’s
collocates, a useful diagnostic tool for the fixedness of its components (Herold & Stathi 2007) L(eft)-1 is by far the
preferred position for all ten cases of AN collocations. The L2-L5 and R(ight)1-R5 span is positionally dispreferred,
with the exception of PROSFIGHIKI ROI39, which displays positional variation mostly in L3. It is interesting that
PROSFIGHIKI ROI is the only collocation with exhibits the alternative N+Ngen ( ROI PROSFIGHON) structure using the
same thematic components.
Table 8 is an enhanced version of Table 7 in the paradigmatic axis; only the first and the second more strongly
associated adjectives for each node are included, as filtered from the corpus and ranked by LL, MI3 and Dice. The lists
are ordered by decreasing Log-Likelihood scores - word pairs at the top are ‘more collocational’.
Collocate (ADJs) Node Log_L Number of Number of Hits MI3 Dice

Texts in the Corpus
aksiomatikis andipolitefsis ‘opposition party’ 20253,62 1138 1514 34,18 0,53
elasonos ‘minor’ andipolitefsis 330,92 24 27 22,26
esokomatiki ‘intraparty’ andipolitefsi 30 34 0,02
kinonikis dhiktiosis ‘networking’ 16420,76 1177 1409 33,77 0,47
epaghelmatikis ‘professional’ dhiktiosis 36,96 6 7 14,14 0,01
kokina dhania ‘loans’ 13669,74 677 1103 33,19 0,47
steghastika ‘housing’ dhania 2135,63 113 169 27,89
omologhiaku ‘bond’ dhaniou 18 27 0,11
dhiefthinon simvulos ‘executive’ 13288,70 720 929 33,50 0,54
dhimotikos ‘municipal’ simvulos 1612,32 97 125
dhimotiki’ simvuli 55 72 27,45 0,21
prosfighikon roon ‘flows’ 9606,19 509 621 33,42 0,71
metanasteftikon ‘immigrant’ roon 4558,67 255 301 31,24 0,42
prasino fos ‘light’ 7443,94 557 656 31,11 0,26
apleto ‘ample’ fos 695,24 52 56 24,47 0,03
protis katikias ‘residence’ 5622,02 342 541 30,26 0,23
eksoxiki ‘country’ katikia 1347,67 73 92 27,32
eksoxikes katikies 41 50 0,14
kokines ghrames ‘lines’ 3582,71 211 267 29,97 0,28
mesea ‘middle’ ghrami 1490,46 158 158
meseas ghramis 76 76 26,11 0,13
odhiko xarti ‘map’ 2398,69 174 191 28,69
odhikos xartis 85 105 0,44
37 See Wahl & Gries (in progr.)

38 For comparative experimental work on MWE extraction involving both techniques cf. inter alia Bartsch & Evert (2014); Krenn & Evert (2001).
39 Small capitals are the typographic equivalent to ‘all inflectional forms of the AN structure’.
13
dhasikon ‘forest’ xarton 619,61 19 36 27,26 0,35

ithiko pleonektima ‘advantage’ 2208,16 123 170 28,67
ithiku pleonektimatos 39 43 0,62
sighritika ‘comparative’ pleonektimata 800,21 59 65
sighritiko pleonektima 40 43 25,59
sighritikon pleonektimaton 10 10 0,38
Table 8. The top-2 strongly associated adjectival collocates for the structures under investigation according
to three AMs (Log-Likelihood Ratio, MI3, and Dice).
The following recurrent pattern in AMs values can be observed: for rankings computed with the LL statistic and
for each noun node, the scores between the top-2 adjectival collocates (which exemplify its behavior on the paradigmatic
axis) decreased considerably. More specifically, there was a considerable distance in collocational strength between the
AN structures under investigation for compoundhood and the AN structures immediately following them (although in
both cases the results are accepted as statistically significant, LL scores in the second case provided much less evidence
to reject the null hypothesis). Differences in the degree of ‘statistical confidence’ were observed for the other collocation
metrics as well (which focus on effect-sizes i.e. the relative strength of associations), for all of the aforementioned
structures. The three collocations with a much ‘smoother’ difference in LL rankings between the first and the second
top adjective collocate are kokines ghrames, prosfighikon roon, ithiko pleonektima (also, odhiko xarti –though
marginally). The next adjectival collocate was statistically closer to the first on the list for the nodes of these four
structures and no great discrepancies were observed in their scores of collocational strength. A similar pattern of
‘collocational distance’ in the rankings of adjective collocates for the selected noun nodes was observed for Dice and
MI3.
It seems, then, that an analysis focused on the paradigmatic axis i.e. by gathering corpus evidence for word
associations through the generation of ranked lists of collocates restricted to specific lexical categories (adjectives, nouns
etc.) could serve as a justified starting point for our research on the multiword status of candidate neologisms. The need
to take into consideration not only the strength for a given collocational relationship but also the level of competition
for the slots around the node word from other collocate types is recently explored in the literature on collocations through
the notion of ‘collocation networks40’ (Brezina, McEnery & Wattam 2015). A comparison of the adjectival base forms
‘preferred’ for the three AMs for the selected noun nodes (top-2 ‘best’ collocates) revealed that both LL and MI3 ‘select’
exactly the same adjectives in these positions. Only the Dice coefficient highlighted two new adjective collocates which
were placed lower in MI3 and LL lists (90% agreement41).
Under the hypothesis that differences in distribution reflect differences in functional characteristics, such as word
status, the distances in AMs’ scores, as reported above, provided strong evidence for varying degrees of collocativity
between the target AN structures we could not neglect; it was an indication that the word-pairs under investigation did
not share a uniform behaviour as far as their collocational prominence was concerned in the paradigmatic (and
syntagmatic) axis. Statistical measures of association and collocation matrices, however, do just that; their estimations
serve to identify and rank collocates, or place them within a specific collocation span. The score values ‘resist’ common
mathematical transformations42 and not all of them can be used to compare words (or word pairs) of different corpora.
For the purposes of the present study, in order to quantify more explicitly the comparison between the lists of the highest-
scoring AN structures which emerged from the computation of AMs in our pilot corpus, given their collocational
strength status, relative frequencies for all inflectional forms of the combinations were calculated and comparisons were
made by calculating relative frequency ratios (RFRs). RFRs are reported as a reliable method for comparing multiword
expressions between corpora (Damerau, 1993), being a statistically intuitive continuation of the log-likelihood ratio
(Manning & Schütze, 1999, p. 175)43. Table 9 reports the results of this comparison:
40 The research hypothesis that collocates of words do not occur in isolation, instead, they are part of a network of semantic relationships which
reveals their meaning and the semantic structure of a text or a corpus.
41 If we compare collocate-types (with no extrapolation to lemmas), alignment with LL scores decreases: 85% for MI3 and 60% for Dice,
similarities which are actually much more than expected, given the competing characteristics in the mathematical foundations of the three measures
of association.
42 It is ‘meaningless’ and mathematically unsound to calculate, for instance, ratios between different scores of the logarithmic G2 and MI3 formulas.
Many thanks to the mathematician George Konstantakis for guidance on this matter.
43 We adapt the RFR comparison method for the same MWEs between corpora to the comparison of different MWEs within corpora.
14
Collocation pairs to be contrasted RFRs between the top-2 adjectival

collocates for each target node
KINONIKI - EPAGHELMATIKI DHIKTIOSI 202
AKSIOMATIKI -ELASON ANDIPOLITEFSI 64,65
KOKINO – OMOLOGHIAKO DHANIO 44,7
AKSIOMATIKI- ESOKOMATIKI ANDIPOLITEFSI 44,3
PRASINO – APLETO FOS 11,71
KOKINO - STEGHASTIKO DHANIO 6
PROTI- EKSOXIKI KATIKIA 4,92
DHIEFTHINON - DHIMOTIKOS SIMVULOS 4,83
ODHIKOS – DHASIKOS XARTIS 4,33
KOKINI – MESEA GHRAMI 2,34
PROSFIGHIKI- METANASTEFTIKI ROI 2,04
ITHIKO - SIGHRITIKO PLEONEKTIMA 1,8
Table 9. Ranking of collocation pairs (RFRs from highest to lowest)
The right column can be read as follows: KINONIKI DHIKTIOSI is 202 times more frequent in the corpus than
EPAGHELMATIKI DHIKTIOSI, the co-occurrence frequency of AKSIOMATIKI ANDIPOLITEFSI is 64,65 times higher than
ELASON ANDIPOLITEFSI and so on. ITHIKO PLEONEKTIMA on the other hand is only 1,8 times higher than SIGHRITIKO
PLEONEKTIMA.
What measures of collocational significance as well as the RFRs seem to suggest is that if we depict collocational
phenomena on a continuum, from stronger to ‘looser’ connections, there are some collocates in the collocational profiles
that ‘stand out’ in terms of saliency, i.e. in their tendency to cluster with a given node multiple times more as compared
to other collocates. This gradient structure, with some collocates highly associated with a given node inside a word’s
collocational paradigm and others with a much looser connection, seems to bear evidence for increased predictability
of co-occurrence. Some AN collocations are thus salient as units on the network of multiword combinations for a given
head and statistically idiosyncratic. Henceforth the AN structures which display the ‘salience characteristic’ as
exemplified in our newspaper corpus seem to bear sound evidence for ‘unit-hood’, at least on statistical grounds.
Support for this line of argumentation comes from the type of AN structures that are at the end of the list in Table
9, namely prosfighiki roi and ithiko pleonektima: for the AN structures which, according to our theoretical criteria, fail
the diagnostic conditions for compoundhood, mainly because of their semantic predictability, there is no considerable
‘salience characteristic’ between the first and the second highly ranked collocates44 (see Table 9). The node develops,
of course, a network of collocations of various strengths but none stands out considerably from the other if we order
those collocations by strength of link (see below Graph 1b). Furthermore, it was only PROSFIGHIKI ROI that has
developed a collocational network of alternative phrasings (NN-gen structures with the same thematic components),
albeit considerably weaker in terms of effect size and significance:
Collocate Node Dice MI3 Log_L Ratio of reduction in collocational strength

(computed for Dice)
prosfighikon roon 0,71 33,42 9606,19
prosfighikes roes 0,52 32,56 7901,87
prosfighon roi 0,04 24,63 1270,24 17,75
prosfighon roes 0,03 23,67 982,57 9,78
Table 10. Alternative phrasings of PROSFIGHIKI ROI
Graphs 1a45 and 1b46 visualize the results obtained through collocational analysis, for two different word pairs
selected from Table 9; one displaying the ‘salience characteristic’, as opposed to a second one, much lower in the cline
of associations. Indeed, as compared to the other collocates of their respective nodes, the salience of the structure
44 KOKINI GHRAMI according to our semantic criteria was selected as a candidate compound, and according to our quantitative analysis is situated
nearly at the end of the collocational salience cline. A tentative explanation would take into consideration the tendency of the head to form series
of abstract metaphorical terms (see section 2, I). A concordance analysis and a closer look at the texts themselves revealed the rather specific
metonymic use of MESEA GHRAMI in football-related journalistic jargon. Moreover, the AN collocation immediately following mesea ghrami was
proti ghrami which is also most metonymically/metaphorically used. Whether members of the same class are expected to exhibit similar
distributional behaviour in their instantiations, is an open question for a new study taking the prototypicality parameter into account.
45 Exported with the help of WST on the basis of LL scores.
46 Generated for the Dice coefficient, with the help of #LancsBox (Lancaster Desktop Corpus Toolbox), an open source software package for the
analysis of language data and corpora (Brezina et al. 2015).

15
KOKINO DHANIO is clearly illustrated in the graphs, especially in contrast to the non-salience of the structure ITHIKO
PLEONEKTIMA.
The findings seem to support our theoretical analysis. All ten preselected AN structures were ranked to the top
(1 position) for three different lexical association measures. Moreover, seven out of the eight AN structures47 which
st
were manually selected as candidate compounds (see criteria in section 2), displayed a relative salience characteristic in
terms of the adjective modifiers of the noun heads (adjectival collocates as compared to the immediately following
adjective modifiers of the same head). On the other hand, the adjectives of the two remaining structures (PROSFIGHIKI
ROI and ITHIKO PLEONEKTIMA) displayed no outstanding difference as compared to the immediately following
frequently associated adjective modifiers of the same noun node (andaghonistiko/sighritiko and metanasteftikon
respectively). We should note that PROSFIGHIKI ROI and ITHIKO PLEONEKTIMA were selected in order to test the
theoretical predictions because of their topicality albeit their semantical predictability (for both modifier and head, see
sections 2-3).
Graph 1a. Word cloud graph for KOKINO DHANIO vs. ITHIKO PLEONEKTIMA (LL, 5LR, f ≥2)
Graph 1b. Collocation network graph for KOKINA DHANIA vs. ITHIKO PLEONEKTIMA (Dice, 5LR, f≥5, f≥4)
47 Except for the structure kokini ghrami (see above).

16
These observations point to the assumption that the relation of the modifier (collocate) to the head (node) in the
first seven cases48 (three already attested in MG dictionaries, and four unattested, see section 3.3) could be of a different
status in terms of compoundhood as compared to the other two (also unattested) collocations (prosfighiki roi, ithiko
pleonektima), thus supporting our theoretical proposal. We can, therefore, assume that the statistical criterion of word
association salience seems to bear evidence for compound status for the seven selected AN structures, providing
additional support to the theoretical semantic criterion of predictable transparency. The assumption that outstanding
differences in the rankings of collocates for a given head (node) seem to be an effective predictor not only for the
statistical but also for the linguistic strength of association between words seems promising to us and needs to be further
investigated with a larger set of multi-word combinations.
Peripheral, compound-like structures
Leaving the salience characteristic as a predictor for compoundhood aside, we will next examine the ‘bigger picture’,
i.e. the wider network of collocates for a given node: the collocate ranking by AMs for the eight selected AN structures,
highlighted on statistical grounds a series of adjectival modifiers of a specific semantic category. Those adjectives occur
in AN formations such as e.g. epixirimatika / katanalotika / steghastika … dhania, i.e. [ADJpurpose+loans] or
evropaikos / edhafikos / pankozmios xartis, i.e. [ADJarea+map]. If we look at the graphs, these structures form a kind
of circle of semantically similar adjective-collocates ‘concentric’ to the target node dhania and xartis49 respectively-
though still away from the center (node) and its salient collocate50, see Graph 1a (above) and 1c (below):
Graph 1c. Word cloud graph for the node xartis

(according to LL 5LR f≥2)
The relative proximity of a series of semantically similar classifying modifiers related to the same head/node as
illustrated above, seems to support our theoretical assumption that there is a type of less prototypical, i.e. peripheral
compound-like structures (see section 2.3, II). They represent less ‘tight’ structures than the compounds, since they are
predictably transparent within a (productive) formation schema (see section 2.3, II), but yet tighter structures than others,
since they tend to denote terms with classifying modifiers. In this sense, they should constitute a transitional area
between compounds and other structures, since they belong to the periphery within the prototypical scale of
compoundhood as we structured it in section 2 (see also Christofidou 1994, 2012, cf. the notion of constructs in Ralli
201351).
Combining the results of the AMs and the relative frequency ratios between top-2 adjective collocates of the
target nodes in respect to our theoretical proposal for compoundhood we could conclude to the following proposal:
48 KOKINI GHRAMI is a marginal case and further monitoring will be needed.

49Proniakos (‘welfare’) is the only adjective of xartis within the nearest to the
center concentric circle of adjectives which does not designate area.
50 The modifier eksipiretoumena is considered part of the structure mi-eksipiretoumena dhania (‘non-performing loans’), thus a word trigram with
a participle so it was excluded from the analysis.
51 The term construct is analyzed and used by Ralli 2013 in a similar but not the same sense mostly because of the different criteria and the overall
description of the phenomenon (see also Ralli 2007).

17
a. if an AN structure is non-predictably transparent and statistically salient in the collocation network it should be
considered as a compound, e.g. kokina dhania
b. if an AN structure is predictably transparent and statistically non-salient it should not be considered as a
compound (yet), e.g. ithiko pleonektima
c. if an AN structure is predictably transparent exhibiting more than two classifying modifiers for the same head
with similar scores relatively closely associated to the head it should be considered as a peripheral (non-
prototypical) compound-like structure, e.g. epixirimatika/katanalotika/steghastika… dhania
5. Beyond word frequency
As mentioned in section 3, our definition of neological formations to be recorded in ΝΕΟΔΗΜΙΑ for subsequent
monitoring and publication to the Bulletin of Scientific Terms and Neologisms, sets a crucial prerequisite: Emphasis is
not given to nonce, ad hoc formations, produced only for occasional use. Instead, research focuses on new words in a
process of consolidation, i.e. those that seem to undergo a path which could possibly lead to complete assimilation into
the lexicon of the speech community52. Moreover, the candidate neologisms, besides their dynamic character must fail
the lexicographic condition i.e. inclusion in reference dictionaries of Modern Greek (see section 2). As a first empirical
diagnostic criterion of the ongoing consolidation process, Christofidou et al. (2013) developed a series of
methodological steps making use of page count information returned for specific queries to popular search engines of
the web (Google and Bing). The second – and more robust – line of evidence is based on corpus data from our monitor
corpus of journalistic texts. Frequencies of word occurrences, as well as their change patterns over periods of time can
now directly be observed and statistically assessed53. The degree to which their use increases or decreases reflects
(gradual or sudden) growth, stability or obsolescence. Furthermore, normalized frequency thresholds can categorize
words into frequency bands; all AN combinations examined in the previous sections had a total frequency of ≤ 10 per
million words54. Although this kind of word ‘profiling’ is valuable for making further extrapolations, for instance how
novel or specialized vocabulary compares to more frequently-used words in the specific genre, it cannot always
efficiently discriminate between words included in the same frequency band, in our case, between the AN structures
themselves.
Frequency distributions from newspaper corpora are also notoriously affected by topicality spikes, which are
often responsible for numerous fluctuations in average frequencies. In those cases, trends computed by word frequencies
only, may not be representative of the process of consolidation that is underway. For instance, in Figure 1 there is a clear
fall in the relative frequency (and popularity) of KOKINO DHANIO, within a 15-month period, especially if one considers
the overall growth of the size of the corpus55. Should KOKINO DHANIO be recorded as a neological compound in our
database? In search for answers, we conducted a contrastive examination with PRASINO DHANIO (‘green loan’), which
is also a semantically non predictable AN structure (not manually selected though). PRASINO was much lower in the list
of collocates of DHANIO, albeit statistically significant (Log-Likehihood 139,70 p < 0,0001). The relative frequency
curve of PRASINO DHANIO (see Graphs 2 and 3) seems to be exponential in the beginning56 (P2 to P3) but after the end
of July 2016 the MW-compound falls into disuse by the journalists of the specific newspaper and is no longer attested
in the corpus57. Should PRASINO DHANIO be recorded in our database then? And what about PRASINO FOS (Graph 3) and
its frequency fluctuations, which are clearly connected to specific circumstances that authorize (‘give green light to’)
52 According to Schmid’ s (2008, 2015) emergentist dynamic sociocognitive model of Entrenchment and Conventionalization cognitive processes
in the mind of language users interact with social processes taking place in social situations and speech communities, but are at the same time
distinct. Entrenchment (an essentially ‘Langackerian’ notion) “is concerned with the degree to which lexemes are holistically established on the
microlevel of the mental systems of individuals”, while “Conventionalization refers to the degree to which lexemes are established on the
macrolevel of the speech community” (Kerremans, 2015: 58). Language corpora, therefore, can only give us insights to the latter process.
53 For instance, through the computation of trends. There is a Trends function in the Sketch Engine for Neologisms and Diachronic Analysis
(https://www.sketchengine.co.uk/user-guide/user-manual/trends/), but it is available only to preselected corpora. As a result, we chose to explore

variations in frequency over time with the help of WST Time-line function, as well as dispersion statistics.
54 In the 35-million words pilot corpus, spanning across fifteen months of news coverage.
55 We did not compute any correlation statistics (Kendall’ τ) to see if we observe significant fluctuations across time, the statement is based on the
mere sub-doubling of the relative frequencies (2,41 times less frequent) through varying corpus sizes i.e. between P1 and P4 (where P4 corpus
data >> P1 corpus data).
56 Which reminds us of “the curve of an ideal neologism in a time-sliced corpus” (Cabré & Nazar 2012). In order to extract novel formations they
compare the relative frequency distribution of words in the corpus with an exponential function.
57 This terminus ante quem seems to be reliable enough, at least as far as 2016 is concerned. In the general corpus of all (five) newspapers,
mentioned in section X, which as of December 31, 2016 comprised 100.003.873 words, only two occurrences of PRASINO DHANIO were recorded,
the first in August 2016 (HKathimerini), the second in October 2016 (Ethnos).
18
further actions or events in a country’s financial system or even a football team’s agenda? We cannot decide on
frequency profiles alone. Frequency considerations are important, but not decisive predictors of the consolidation
process of novel formations.
Corpus size vs. Relative Frequency Corpus size vs. Relative Frequency
12.000.000 100,00 12.000.000 25,00
93,70
10.000.000 22,10
80,00 10.000.000 20,10 20,00
8.000.000 8.000.000 17,00
60,00 15,70 15,00
6.000.000 45,92 6.000.000
43,43 38,95 40,00 10,00
4.000.000 4.000.000
2.000.000 20,00 5,00
2.000.000
3,11
1,55 3,11 1,55
0 0 0 0,00 0 0 0 0,00
Period 1 Period 2 Period 3 Period 4 Period 1 Period 2 Period 3 Period 4
Corpus size Corpus size

KOKINO DHANIO (per mwords) PRASINO FOS (per mwords)
PRASINO DHANIO (per mwords) PRASINO DHANIO (per mwords)
Graphs 2 and 3. Relative frequency fluctuations in the news - moving average
It has long been observed in the literature that insights from frequency data should be complemented – and-
augmented with information on the dispersion of elements in the corpus as an indicator of their overall importance (cf.
Gries 2008)58. As opposed to occurrence repetition (frequency), dispersion is a measure of occurrence regularity and
predictability of co-occurrence (Gablasova, Brezina & McEnery 2017: 5-7). A uniform distribution of a word in the
news’ feeds, as they are naturally, linearly, produced through time, not only bears evidence for recurrent, almost
permanent, topicality, but mainly reflects wide (i.e. not occasional) use, besides any fluctuations in the frequencies of
occurrences. WST computes time-lined dispersion plots59, using Juilland’s D formula for the calculation of spread
values60. If the value is close to 1, then the word is evenly distributed throughout the year, given the size of the amount
of word-data searched. As the value approaches to 0, the theoretical minimum, we infer that the word is under-dispersed
in the corpus and shows a preference for some periods only. Each newspaper article is temporally ordered within the
corpus according to its timestamp in the RSS news feed. Chronological information for each corpus file is extracted
with the help of a WST utility program which can parse corpus filenames to recognize dates using a user-defined regular
expression (to that purpose, special naming conventions have been implemented at the stage of the compilation of the
corpus). The following table reports the dispersion statistics results for the entire 15-month period, in decreasing order.
KINONIKI DHIKTIOSI(0,921) ITHIKO PLEONEKTIMA (0,786)

DHIEFTHINON SIMVULOS (0,901) PROSFIGHIKI ROI (0,785)
PRASINO FOS (0,899) PROTI KATIKIA (0,711)
AKSIOMATIKI ANDIPOLITEFSI (0,893) PRASINO DHANIO (0,544)
KOKINO DHANIO (0,851)
ODHIKOS XARTIS (0,843)
KOKINI GHRAMI (0,833)
Table 11. Dispersion results (spread) for target AN structures in the corpus
Most MW-structures are evenly dispersed in the corpus, since D values approximate the theoretical maximum of 1
(maximally even distribution). Both candidate neological compounds and conventionalized ones have similar D values
(cf. KINONIKI DHIKTIOSI, PRASINO FOS, KOKINO DHANIO, ODHIKOS XARTIS vs. DHIEFTHINON SIMVULOS, AKSIOMATIKI
ANDIPOLITEFSI, KOKINI GHRAMI, PROTI KATIKIA), a fact that supports the argument of an ongoing consolidation process
taking place. At the end of the cline, PRASINO DHANIO, is the most unevenly dispersed (and under-mentioned as grey
58 “Simple word-frequency counts can be misleading, since a word might have a high frequency because it is overused in a much smaller number
of texts, or parts of texts, within the corpus” (Leech, Rayson & Wilson 2001: 17).
59 In order to study change though time, diachrony in synchrony (Renouf 2002) and for WST Timeline function see
http://lexically.net/downloads/version7/HTML/index.html?text_dates_and_timelines.htm).
60 Like all statistic measures, the calculation of Juilland’s D has important limitations (Biber, Reppen, Schnur & Ghanem 2016; Gries 2008), which
in our case are eliminated as possible (for instance, small number of corpus-parts). In general, Juilland’s D, is a popular, highly appreciated
dispersion measure (Leech et al. 2001: 18).
19
lines suggest) in the corpus: from the visualized timelines below (Figure 3), its distribution is rather skewed and
occasion-specific (first emergence in February, under-mentioned from February to July 2016, and then no further
occurrences), so albeit its analyzable semantics, statistically speaking, there might be speakers who are not at all
familiarized with the term.
Graph 4. (from top to bottom): The distribution of PRASINO DHANIO, KOKINO DHANIO, PRASINO FOS in the corpus (output from WST: time-lined
per month with frequency normalization to thousands, green lines= relative frequency moving average, grey lines/rectangles=amount of text
being examined).
The columns below show the percentage change between P1 and P2, P2 and P3, P3 and P4, as well as P1 and P4. For
KINONIKI DHIKTIOSI: spread increased 52% between P1 and P2, then decreased 3% between P2 and P3. Later on,
between P3 and P4 there were no fluctuations in dispersion across the corpus. Within a 15-month period there was a
total 49% increase, the AN structure was at the end of 2016 more uniformly dispersed at the corpus than it was in
October 2015.
Target Change Dispersion Change P2 Dispersion Change Dispersion Change Overall

AN structures P1 > P2 Trend1 > P3 Trend2 P3 > P4 Trend3 P1 > P4
KINONIKI 1,49
1,52 Increase 0,97 Decrease 1,00 Stability Increase
DHIKTIOSI
PRASINO FOS 1,47 Increase 0,96 Decrease 1,18 Increase 1,66 Increase
KOKINO DHANIO 1,39 Increase 1,01 Increase 1,03 Increase 1,46 Increase
ODHIKOS XARTIS 1,47 Increase 0,67 Decrease 1,50 Increase 1,49 Increase
First Last No Inconclusive
PRASINO DHANIO N/A 0,0061 N/A
Evidence Evidence evidence N/A evidence
Table 12. Percentage Changes in Dispersion Values as computed with WST per neological AN compound
To answer our practical considerations in the beginning of this section: on the basis of further evidence from
dispersion statistics, both PRASINO FOS and KOKINO DHANIO will be recorded as neologic formations in ΝΕΟΔΗΜΙΑ.
PRASINO DHANIO, on the contrary, will be excluded from the catalogue; distributional information will be stored and
monitored further until the specific candidate compound re-emerges and then the process of evaluation will reiterate.
After sufficient bottom-up experimentation with our data, we will be in a position to impose specific thresholds
concerning significant changes in distributional profiles. As a conclusion, every time we assess a candidate MWE for
inclusion in our system we make sure that we select not only words which are frequent but words with wide usage
(evenly dispersed in the corpus). And a way to do that is to compute frequency trends across time with dispersion
information, all of which can be observed and stored.
6. Contextual expansion
Both duration and extent of exposure to MW coinages, as measured by dispersion statistics in newspaper corpora (see
preceding section), are not the sole indicators of an ongoing consolidation process within a community of discourse
receivers (newspaper readers) and producers (journalists, audience commentators). As reported in Christofidou et al.
(2014), monitoring the life-cycle of neological formations involves exploration of purely contextual variables, such as
diffusion into different contexts of use. In that study, webometric analysis (Thelwall, in progr.) was introduced as a
61100% change: WST nullifies D values when there is not enough data: for P2 there were 15 occurrences in 9 texts out of 30.935, for P3 there
were 28 occurrences in 20 texts out of 30.450
20
method of capturing contextual variation through web impact reports (WIRs, cf. Christofidou et al. 2014, chapter 3
onwards). For the needs of this section, a webometric analysis was only conducted on a preliminary/exploratory basis,
in order to (1) capture any fluctuations in web domain counts for all MW-compounds under monitoring between
different periods of measurement (i.e. how many different websites mention the terms, calculated as an abstraction from
all individual matching URL counts) (2) to provide an assessment of their presence in online documents. WIRs were
generated through queries to the Bing Web Search API (now part of Microsoft Cognitive Services) for (a) end of April
2016 (b) beginning of June 2016 and (c) end of January 2017. The matrices were post-processed with the computation
of Shift-Ratios and Shift-Scores between (a) and (b), (b) and (c), (a) and (c) (Eu, 2008) and Christofidou et al.
2014:1856). Due to space limitations, we only report a summary of the results: There were no excessive fluctuations in
web domain counts. Moreover, the overall hit counts were high for both conventionalized and neologistic structures.
The web as corpus analysis revealed the popularity, topicality, as well as stable presence of the MW-compounds under
scrutiny in multiple sources and web documents.
When implementing the reverse procedure, a web for corpus analysis (corpus compilation from web resources,
i.e. crawling, in our case, of selected newspaper RSS feeds), contextual expansion of neological formations as a predictor
of an ongoing consolidation process is limited by the sampling decisions and the type of the corpus. So, as an equivalent
to web domains, newspaper section information was added to all texts of the enhanced version of our pilot corpus62. For
the 116.407 Greek articles in UTF-8 and UTF-16 (for processing with WST) a @URLclass attribute was added to every
<Article> element with multiple values (one for each Newspaper Section). The annotation process was conducted in a
rule-based, semi-automatic way, following classificatory information from link semantics i.e the URL-naming patterns
that Proto Thema employs to publish content in different sections of its web platform (for details see Afentoulidou &
Christofidou, in progr.). The following Newspaper Sections were identified:
"Greece_Politics", "Greece_Society", "Greece_Business_and_Finance", "World_News", "Culture_and_Arts", "Science_and_Technology",
"Sports", "Health", "Environment", "Leisure_Activities_Travel", "Leisure_Activities_Car_and_Speed", "Life & Style", "Food_and_Recipes",
"Opinion", "Advertorial".
Section-specific searches were performed with WST and the figures below visualize the results. Graph 5 displays
in decreasing order all AN structures according to the number of newspaper sections they appear. Since all of them
belong to domains which enjoy widespread popularity in journalistic discourse (business, finance, politics, society), the
more consolidation of those terms in journalistic discourse is taking place, the more their diffusion is accommodated
within different text types and fields of discourse (discourse domains) and the less they seem to be restricted to specific
semantic domains63. Domain expansion could of course be a by-product of topicality and news salience, but in any case,
it represents the degree of exposure to repeated communicative events (thus constitutes evidence about the process of
consolidation) that acts like a ‘spreading-activation’ factor, by fostering the network of novel word associations to
specific situations across various text types (e.g. opinion vs. news articles) or thematic sections (politics vs. lifestyle)64.
For instance, Graph 6 suggests that ODHIKOS XARTIS seems to be mainly used in political and economic contexts (and
in opinion articles as well), but has at the same time developed a preference for other domains, such as society or
international relations65. Conversely, PRASINO DHANIO, seems to be more specialized and restricted to specific economic
contexts, notably if we compare it with KOKINO DHANIO, which has expanded into international relations, society, even
into domains marginal to its meaning, for example lifestyle66.
62 Most newspapers host online editions and we consider newspaper sections in those editions to denote a purely ‘navigational’ notion, devoid of
any special implications in terms of genre theory and text typologies. Most sections of a newspaper are domain-specific, but the creation of a
section is not always topic-oriented, since different text-types, such as opinion articles or advertorials (which encompass many topics) are
considered autonomous sections (whilst interviews are dispersed within different sections and are not published in a separate newspaper column).
For a genre-aware taxonomy of newspaper articles see Afentoulidou and Christofidou (in progr.).
63 Words seem to become more established in the lexicon of the speakers (microlevel) and the language community (macrolevel) when they appear
in multiple types of source (see Kerremans 2015: 53 who also reviews psycholinguistic studies on the facilitatory role of the diversity of contexts
in language processing).
64 “Diffusion models portray society as a huge learning system” (Hamblin, Miller & Saxton 1979 cited by Kerremans 2015: 64).
65 Or -as witnessed by many hapax legomena (denoted by *), health and fitness, environment, civilization, even advertorials.
66 The role of overt co-textual variables in the consolidation process is also very strong (but out of the scope of this paper); markers of punctuation,
metalinguistic markers i.e. explanatory phrases, reduce ambiguity and make more explicit to the readers the meaning and use of novel formations.
Their presence in the corpus seems to be inversely proportional to the degree of lexicalization of the terms they modify: [..] δεν γλιτώνουν ούτε οι
δανειολήπτες [..] που είναι ενήμεροι, έχουν, δηλαδή, «πράσινα» δάνεια. […] trans. The economic meltdown will inevitably affect borrowers […]
even those who are creditworthy and have no outstanding obligations, i.e. those who have “green loans”. (Proto Thema, 22-05-2016). [..] στις
εταιρείες «distress funds» [..], οι οποίες, σαν κυνηγοί κεφαλών, θέλουν να αγοράσουν τα λεγόμενα «κόκκινα δάνεια» (δηλαδή τα ληξιπρόθεσμα
ενυπόθηκα τραπεζικά δάνεια). trans. The Government paves the way for distress fund companies which, acting as headhunters, want to purchase
the so-called “red loans” (that is, the overdue mortgage loans). (Proto Thema, 28-04-2016). On the role of further textual factors influencing the
21
Newspaper Sections per AN structure
No of Sections
KINONIKI DHIKTIOSI 15
DHIEFTHINON SIMVULOS 14
PRASINO FOS 13
ODHIKOS XARTIS 11
PROSFIGHIKI ROI 11
AKSIOMATIKI ANDIPOLITEFSI 8
Graph 5. All AN structures
KOKINI GHRAMI 8
KOKINO DHANIO 7
PROTI KATIKIA 6
ITHIKO PLEONEKTIMA 4
PRASINO DHANIO 3
Frequencies (PMW) per Newspaper Sections

80
70
60
Freq. PWM
50
40
30
20
10 Graph 6.
0 ODHIKOS XARTIS
Greece_
Leisure_ Culture_
Greece_ Business Greece_ Life & Environ Advertor
World Opinion Activitie Health* and_Arts
Politics _and_Fin Society Style ment* ial*
s_Travel *
ance
Period 1 22,18 33,40 3,56 10,13 11,71 0 0 0 0 2,35 67,3
Period 2 15 16,62 2,28 2,31 0 1,98 0 0 0 0 0
Period 3 22,10 29,65 5,95 5,14 6,33 0 0 0 0 0 0
Period 4 45,55 37,37 5,41 5,77 19 0 15,67 6,06 15,35 0 46,49
Frequencies (PMW) per Newspaper Sections per Time

Period
25
20
Freq. (PWM)
15
10
5
0
Period 1 Period 2 Period 3 Period 4 Graph 7. PRASINO DHANIO
Greece_Business_and_F
0 19,39 23,2 0
inance
Greece_Politics 0 0,58 4,55 0
Opinion 0 0 9,5 0
7. Concluding remarks
After exploring aspects of the two decisive parameters (compoundhood and neologicity of words) as spelled out in
section 3, which determine the final inclusion (or not) in ΝΕΟΔΗΜΙΑ of candidate MWEs, we are in position, on the
basis of the corpus-based perspective adopted, to conclude that the behaviour of seven of the ten MWEs discussed in
comprehension of novel formations (e.g. cohesive devices like root base or whole-word repetitions, semantically related lexemes etc, see
Christofidou (1994). PRASINO DHANIO, for instance, appears in the texts always in a contrastive function with its antonym, KOKINO DHANIO.
22
this study, as evidenced in our newspaper corpus, bears strong evidence for compoundhood: AKSIOMATIKI
ANDIPOLITEFSI, PROTI KATIKIA, DHIEFTHINON SIMVULOS, KOKINO DHANIO, KINONIKI DHIKTIOSI, PRASINO FOS,
ODHIKOS XARTIS (section 4). Moreover, the last four, on the basis of time-lined dispersion statistics and contextual
variability fulfill all requirements to be recorded in our database, in the specialized fields for neologic multi-word
compounds (sections 5 and 6). On the other hand, PROSFIGHKI ROI and ITHIKO PLEONEKTIMA, albeit novel formations,
do not fulfil the compoundhood requirement yet (sections 2-4).
We have discussed how the two parameters can be operationalized and further refined both in theory and practice,
through purely theoretical-linguistic criteria, under the epistemological paradigm of Natural Theory (sections 1-3), as
well as empirical corpus-linguistic factors (sections 4-6). For the semi-automatic extraction, identification and
exploration of all AN structures in our data we used available commercial corpus processing systems and software (The
Sketch Engine and WordSmith Tools). A small set of ten AN structures was selected from word bigram lists after the
application of specific filters to remove ‘noise’ together with human judgment, according to our theoretical predictions.
The AN structures were then subjected to a corpus linguistic analysis, using a collocate-node approach within a
preselected window span.
The computation of three Lexical Association Measures revealed that our target AN structures were the most
prominent AN co-occurrences within the network of collocational preferences of their head nouns. Although there was
actually a cline of associations, some collocations, were much more salient, ‘standing out’ in relation to all the others in
the same network. With the help of further comparisons (Relative Frequency Ratios), we interpreted the existence of
these statistically idiosyncratic, high-salient collocates as empirical evidence for compoundhood, confirming the
theoretical hypothesis upon which they were selected. Further research is in progress at our Centre to explore the notion
of prototypicality hierarchies (see section 2)67, in relation to corpus data.
The integration into our line of thinking of distributional and contextual factors, such as time-lined dispersion
statistics and contextual expansion provided us with valuable insights concerning the dynamic notion of the
consolidation process which was taking place and allowed us to make further decisions concerning the parameter of
neologicity of words. It seems, then, that the degree of conventionalization of a novel linguistic unit – simple or complex-
can be determined in terms of its frequency in the sample, dispersion across the corpus across time and by the type of
source and field of discourse it occurs in.
The usefulness of such methods combining qualitative and quantitative research has led us to the conclusion
that the traditional database approach to the study of neology is not enough; by carefully studying and delineating the
various parameters that shed light on different linguistic phenomena, in an era dominated by the abundance of data, we
must move to prediction-based models and machine-learning techniques68.
7.1. Future implications
On the basis of this study, further improvements of our system of identifying, recording and monitoring neologisms in
the Greek Press can be made on a bigger scale: first of all, pre-linguistic analysis (POS tagging and lemmatization) is
needed, in our view, as a prerequisite for a more focused application of the quantitative methods. In that way collocation
statistics could be calculated on-demand, for specific word classes and grammatical relations. The collocational profile
of every word in our corpus could then be used to produce collocationality scores in terms of lexical association rankings
(see Renouf, 1993; Collier, 1993 on an already implemented large-scale procedure concerning English neologisms). On
the basis of this information, users of such a system would be able to observe how collocational rankings have changed
across time, emerging or obsolete collocates, as well as to detect significant differences in their distributional patterns.
A ‘compounding filter’ would spot highly-salient collocates which, according to the present study, have also good
chances to be true multi-word compounds. Further work is needed to assess the effectiveness of this approach; for
instance, to experiment with the effectiveness of more recent association measures reported in the literature in order to
extract meaningful and novel word co-occurrences, and of course to adapt AMs to other types of word combinations,
beyond the bigram. But this would be a research topic of its own, combined with the rest of the criteria as explored in
this paper.
67Especially for marginal cases such as KOKINI GHRAMI.

68To this line of research, we continue our collaboration with Iraklis Varlamis, Professor of Informatics and his team at Harokopio University of
Athens. The exploration and fine-tuning of the quantitative variables as set out in this study (Afentoulidou ms. in progress) to build such an NLP
system, can only be a line of research of interdisciplinary nature.
23
References
Afentoulidou, V., & Christofidou, A. (in progr.). A monitor corpus of Modern Greek Neologisms. Design criteria and
Genre considerations. In A. Christofidou (Ed.), Bulletin of Scientific Terms and Neologisms 14. Academy of
Athens.
Ananiadou, S. & Zervanou, Κ. (2004). Problems and methods of automatic term recognition in computational systems
[in Greek]. In Μ. Katsoyannou & Ε. Efthymiou (Eds.), Greek Terminology: research and implementations
(Institute for Language and Speech Processing). Athens: Kastaniotis.
Anastassiadi-Symeonidi, Α. (1986). The neology in Modern Greek Language [in Greek]. Yearbook of the School
of Philosophy of Aristotle University of Thessaloniki, app. 65. Τhessaloniki: Aristotle University.
Bartsch, S., & Evert, S. (2014). Towards a Firthian Notion of Collocation. In A. Abel & L. Lemnitzer (Eds.),
Vernetzungsstrategien, Zugriffsstrukturen und automatisch ermittelte Angaben in Internetwörterbüchern (Vol. 2,
pp. 48-61). Mannheim: Institut für Deutsche Sprache.
Biber, D., Reppen, R., Schnur, E., & Ghanem, R. (2016). On the (non)utility of Juilland’s D to measure lexical dispersion
in large corpora. International Journal of Corpus Linguistics, 21(4), 439–464.
Booij, G. (2009). Phrasal names: a constructionist analysis. Word structure, 2, 219-240. Edinburgh: Edinburgh
University Press
Booij, G. (2013). Construction morphology. Oxford: Oxford University Press.
Brezina, V., McEnery, T., & Wattam, S. (2015). Collocations in context: A new perspective on collocation networks.
International Journal of Corpus Linguistics, 20(2), 139-173.
Bybee, J. (1995). Regular morphology and the lexicon. In Language and cognitive processes, 10 (5).
Bybee, J. (2005). Morphology. Amsterdam: Benjamins
Bybee. J. (2010). Language, Usage and Cognition. Cambridge: CUP
Cabré, T., & Nazar, R. (2012). Towards a new approach to the study of neology. Neologica, 6, 63-80.
Christofidou, A, Doleschal, U. & Dressler, W.U. (1989) Gender agreement via derivational morphology in Greek.
Glossologia, 9, 69-79.
Christofidou, A. (1994). Okkasionalismen in poetischen Texten: Eine Fallstudie am Werk von O. Elytis. Tübingen: G.
Narr.
Christofidou, A. (2001). Poetic neologisms and their functions [in Greek]. Athens: Gutenberg.
Christofidou, A. (2012). Multi-word compounds, series formations and paleonymic pairs. A contribution to the Greek
notion of word [in Greek]. Ζ. Gavriilidou, Α. Efthymiou, Ε. Thomadaki & P. Kambakis-Voughiouklis (Eds.),
Selected Papers of the 10th International Conference of Greek Linguistics (pp. 1235-1245). Komotini: University
of Thrace.
Christofidou, A., Afentoulidou, V., Karasimos, A. & Dimitropoulou, E. (2013). The electronic program ΝεοΔημία. In
A. Christofidou (Ed.) Creation and form in language, Βulletin of Scientific Terms and Neologisms 12 (pp. 198-
243). Athens: Academy of Athens.
Christofidou, A., Karasimos, A., Afentoulidou, V. (2014). Monitoring and classification of neologisms using the
electronic program ΝΕΟΔΗΜΙΑ. An approach to novel loanwords [in Greek]. Proceedings of the 11th International
Conference of Greek Linguistics, online resource. Rhodes: University of the Aegean.
Collier, A. (1993). Issues of Large-scale Collocational Analysis. Paper presented at the 13th International Conference
on English Language Research on Computerized Corpora, Nijmegen 1992.
Damerau, F. (1993). Generating and evaluating domain-oriented multi-word terms from texts. Information Processing
and Management: an International Journal, 29(4), 433-447.
Downing, P. (1977). On the creation and use of English compound nouns. Language 53, 810-842.
Dressler, W.U. (1987). Word formation (WF) as part of natural Morphology. In W.U. Dressler (Ed.), Leitmotifs in
natural morphology (pp. 99-126). Amsterdam: Benjamins.
Dressler, W.U. (2005). Universal, system-independent morphological naturalness. In P. Štekauer (2005) & R. Lieber
(Eds.), Handbook of Word-Formation (pp. 267-284). Dordrecht: Springer.
Dressler, W.U. (2006). Compound types. In G. Libben & G. Jarema (Eds.) The Representation and Processing of
Compound Words (pp. 23-44). Oxford: Oxford University Press.
Dunning, T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19(1),
61-74.
Eu, J. (2008). Testing search engine frequencies: Patterns of inconsistency. Corpus Linguistics and Linguistic Theory,
4(2), 177-207.
24
Evert, S. (2007). Corpora and collocations (Extended Manuscript). [A shorter version was published in 2009: In A.
Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International Handbook (Vol. 2, pp. 1212-1248). Berlin: Walter
de Gruyter]. Retrieved from http://www.stefan-evert.de/Publications.html
Evert, S. (2008). A Lexicographic Evaluation of German Adjective-Noun Collocations. In Proceedings of the LREC
Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), held at Marrakech, Morocco.
Evert, S., & Krenn, B. (2001). Methods for the Qualitative Evaluation of Lexical Association Measures. In Proceedings
of ACL 2001: 39th Annual Meeting on Association for Computational Linguistics, pp. 188-195, Association for
Computational Linguistics.
Fillmore, C.J. & C.F. Baker (2001). Frame Semantics for Text Understanding. In Proceedings of WordNet and Other
Lexical Resources Workshop, pp. 371-375, held at North American Association for Computational Linguistics,
Pittsburgh.
Fotopoulou, A., Zourari, M., Mini, Μ. & Giannopoulos, G. (2009). Automatic recognition and Multi-Word Nominal
Expression Extraction from corpora [in Greek]. Studies in Greek Linguistics, 29, 620-633. Thessaloniki: Institute
for Modern Greek Studies.
Foufi, V. (2010). Automatic recognition of multi word compounds in Modern Greek texts. [in Greek] MEG 30. INS:
Thessaloniki.
Foufi, V. (2014). Morphological, semantic and syntactic description of MW AN compounds [in Greek]. Enyalio Idryma:
Thessaloniki
Gablasova, D., Brezina, V., & McEnery, T. (2017). Collocations in Corpus-Based Language Learning Research:
Identifying, Comparing, and Interpreting the Evidence. Language Learning, 67(1), 1-25.
Gaeta, L. & Ricca, D. (2009). Composita solvantur: Compounds as lexical units or morphological objects? Rivista di
Linguistica 21(1), 35-70.
Gotsoulia, V., Desipri, E., Koutsombogera, M., Prokopidis, P., Papageorgiou, H. & Markopoulos, G. (2007). Towards
a frame semantics lexical resource for Greek. In K. De Smedt, J. Hajič & S. Kübler (Eds.), Proceedings of the Sixth
International Workshop on Treebanks and Linguistic Theories (NEALT Proceedings Series) 1, 55-59.
Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4),
403-437.
Gries, S. T. (2015). 50-something years of work on collocations. What is or should be next…*. In S. Hoffmann, B.
Fischer-Starcke, & A. Sand (Eds.), Current Issues in Phraseology (Vol. 74, pp. 135-164): John Benjamins
Publishing Company.
Hamblin, R. L., Miller, J. L. L., & Saxton, E. D. (1979). Modeling Use Diffusion. Social Forces, 57(3), 799-811.
Herbermann, C.-P. (1981). Wort, Basis, Lexem und die Grenze zwischen Lexicon und Grammatik. München: Fink.
Herold, A., & Stathi, K. (2007, 27–30 July 2007). Measuring syntagmatic Fixedness of Multi-Word Expressions. Paper
presented at the Corpus Linguistics Conference 2007, University of Birmingham, UK.
Inger Lyse, G., & Andersen, G. (2012). Collocations and statistical analysis of n-grams. Multiword expressions in
newspaper text. In G. Andersen (Ed.), Exploring Newspaper Language: Using the web to create and investigate a
large corpus of modern Norwegian (pp. 79–110). Amsterdam / Philadelphia: John Benjamins Publishing Company.
Jackendoff, R. (1990). Semantic Structures. Cambridge, Mass.: MIT Press.
Kastovsky, D. (1982). Word-formation. A functional view. Folia Linguistica, 16, 181-198.
Kastovsky, D. (2005). Hans Marchand and the Marchandeans. In P. Štekauer (2005) & R. Lieber (Eds.), Handbook of
Word-Formation (pp. 99-124). Dordrecht: Springer.
Kerremans, D. (2015). A Web of New Words. A Corpus-Based Study of the Conventionalization Process of English
Neologisms. Frankfurt am Main: Peter Lang.
Kilgarriff, A., Busta, J. & Rychlý, P. (2015). Diacran: a framework for diachronic analysis. Retrieved from
www.sketchengine.co.uk/wp-content/uploads/Diacran_CL2015.pdf
Kilgarriff, A., & Kosem, I. (2012). Corpus tools for lexicographers. In S. Granger & M. Paquot (Eds.), Electronic
Lexicography: Oxford University Press.
Körtvélyessy, L., Štekauer, P. & Zimmermann, J. (2015) Word-formation strategies: semantic transparency vs. formal
economy. In L. Bauer, L. Körtvélyessy & P. Štekauer (Eds.), Semantics of Complex words (pp. 85-113). Dordrecht:
Springer.
Krenn, B., & Evert, S. (2001). Can we do better than frequency? A case study on extracting PP-verb collocations. Paper
presented at the ACL/EACL 2001: Workshop on the Computational Extraction, Analysis and Exploitation of
Collocations, Toulouse, France.
Langacker, R. (1987). Foundations of cognitive grammar. Vol. I. Stanford: Stanford university Press.
Langacker, R. (2013). Essentials of cognitive grammar. Oxford: Oxford University Press.
25
Leech, G., Rayson, P., & Wilson, A. (2001). Word Frequencies in Written and Spoken English. Based on the British
National Corpus: Routledge.
Levshina, N. (2015). How to do Linguistics with R. Data exploration and statistical analysis: John Benjamins.
Libben, G. (1998). Semantic transparency in the processing of compounds: consequences for representation, processing
and impairment. Brain and Language 61, 30-44.
Lieber, R. (2005). English Word-Formation Processes. In P. Štekauer & R. Lieber (Eds.), Handbook of Word-Formation
(pp. 375-427). Dordrecht: Springer.
Linardaki, E., Ramisch, C., Villavicencio, A., & Fotopoulou, A. (2010). Towards the Construction of Language
Resources for Greek Multiword Expressions: Extraction and Evaluation. In S. Piperidis, M. Slavcheva & C. Vertan
(eds), Proceedings of the LREC Workshop on Exploitation of multilingual resources and tools for Central and
(South) Eastern European Languages, Seventh International Conference on Language Resources and Evaluation,
Valetta, Malta, p. 31–40.
Manning, D. C., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. United States of
America: The MIT Press.
Maynard, D. & Ananiadou, S. (2001). Trucks: A Model for Automatic Multi-word Term Recognition. Journal of
Natural Language Processing, 8(1), 101-125 doi: 10.5715/jnlp.8.101
Mel’čuk, I.A. (1996). Lexical Functions: A tool for the description of lexical relations in the lexicon. In L. Wanner (ed.),
Lexical Functions in Lexicography and Natural Language Processing (pp. 37–102). Amsterdam: John Benjamins.
Michou, A. & Seretan, V. (2009). A Tool for Multi-Word Expression Extraction in Modern Greek Using Syntactic
Parsing. In Proceedings of the EACL 2009 (pp. 45–48), Association for Computational Linguistics.
Mini, M. & A. Fotopoulou (2009) Typology of MW verbal phrases in MG Dictionaries [in Greek]. Proceedings of the
18th International Symposium on Theoretical and Applied Linguistics. Τhessaloniki: Aristotle University, p. 491-
503.
Mpakakou-Orfanou, K. (2005). The Word of Modern Greek in the Language System and the Text [in Greek]. Journal
Parousia, Monograph Series n. 65. Athens: Kapodistrian University.
Panagl, O. (1987). Productivity and diachronic change in Morphology. In W.U. Dressler (ed.), Leitmotifs in natural
morphology (pp. 127-151). Amsterdam: Benjamins.
Pustejovsky, J. (1995). The Generative Lexicon. Cambridge, MA: MIT Press.
Ralli, A. & Stavrou, M. (1998). Morphology-Syntax Interface: A-N Compounds vs A-N Constructs in Modern Greek.
G. Booij & J. van Marle (Eds.) Yearbook of Morphology 1997 (pp. 243-264). Dordrecht: Kluwer.
Ralli, A. (2013). Compounding in Modern Greek. Dordrecht: Springer.
Ralli, Α. (2007). Word Compounding: Cross-linguistic morphological approach [in Greek]. Αthens: Patakis.
Renouf, A. (1993). Sticking to the text: a corpus linguist's view of language. Aslib Proceedings, 45(5), 131-136.
doi:doi:10.1108/eb051316
Renouf, A. (2002). The Time Dimension in Modern English Corpus Linguistics. Paper presented at the Fourth
International Conference on Teaching and Language Corpora, Graz 19-24 July, 2000.
Samaridi, N. & Markantonatou, S. (2014). Parsing Modern Greek verb MWEs with LFG/XLE grammars. In The 10th
Workshop on Multiword Expressions (MWE 2014), Workshop at EACL 2014 (pp. 33-37). Gothenburg, Sweden.
Schmid, H.-J. (2008). New Words in the Mind: Concept-formation and Entrenchment of Neologisms. Anglia -
Zeitschrift für englische Philologie, 126(1), 1-36.
Schmid, H.-J. (2015). A blueprint of the Entrenchment-and-Conventionalization Model. Yearbook of the German
Cognitive Linguistics Association, 3(1), 3-25.
Schreuder, R., & Baayen, R. H. (1997). How complex simplex words can be. Journal of Memory and Language, 37,
118-139.
Scott, M. (2016). WordSmith Tools version 7.0. Lexical Analysis Software. Liverpool.
Štekauer, P. (2005). Onomasiological approach to word-formation. In P. Štekauer & R. Lieber (Eds.) Handbook of
word-formation (pp. 207-232). Dordrecht: Springer.
Thelwall, M. (in progr.). Big Data and Social Web Research Methods . An updated and extended version of the book:
Introduction to Webometrics. [in-progress draft]: University of Wolverhampton.
Wahl, A., & Gries, S. T. (in progr.). Multi-word expressions: A novel computational approach to their bottom-up
statistical extraction. In P. Cantos (Ed.), Lexical collocation analysis: advances and applications. Berlin & New
York: Springer.

From Defining To Semi-Automated Detecting (AN) Two-Word Compounds. A Plan To Enrich A Database of Neologisms

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

From Defining To Semi-Automated Detecting (AN) Two-Word Compounds. A Plan To Enrich A Database of Neologisms

Uploaded by

Copyright:

Available Formats

1

From defining to semi-automated detecting (AN) two-word compounds.

2.1 Previous research

1 For Greek see Booij (2013: 181). See also below.

discussion on these perspectives see Christofidou (2012).

5 Ralli & Stavrou (1998: 244-9).

2.2 Motivation, analyzability, predictability and transparency

package on it, could be treated linguistically.

morphological compositionality (motivation). And this allows semantic opacity”.

2.3. A prototypical approach to AN compounds within Natural Morphology

• formal (morphosyntactic) inseparability (morphonological unity and syntactic autonomy)

comes from the cognitive theoretical paradigm.

Semantically non-predictable AN compounds

a. Non-literally motivated – thus non-predictable (non-literal meaning of both components)

b. Non-literally motivated - thus non-predictable (non-literal meaning of the head only)

c. Non-literally motivated - thus non-predictable (non-literal meaning of the modifier only)

II. Peripheral compound-like structures

Semantically motivated and predictable (mostly (non-qualitative) classifying adjectives)

theatriki/musiki/kinimatoghrafiki… kritiki ‘drama/music/movie…review’

Table 2. Prototypical hierarchy of (AN) compoundhood

3.2 ΝΕΟΔΗΜΙΑ: Collecting, cataloguing and analyzing neologisms

3.3. From Bigrams to list comparisons and application of criteria

Table 3. Tokens, words and sentences in Sub-corpus 1 and 2

4. Evidence for Compoundhood

4.1 Collocational behaviour of the noun heads

Proto Thema Tokens (running words excluding Number of Newspaper

4.2 Collocational behaviour of the adjective modifiers

34 Actually, mid-October, see section 3.

4.3. Collocation networks

Collocate (ADJs) Node Log_L Number of Number of Hits MI3 Dice

37 See Wahl & Gries (in progr.)

dhasikon ‘forest’ xarton 619,61 19 36 27,26 0,35

Collocation pairs to be contrasted RFRs between the top-2 adjectival

Collocate Node Dice MI3 Log_L Ratio of reduction in collocational strength

analysis of language data and corpora (Brezina et al. 2015).

47 Except for the structure kokini ghrami (see above).

Peripheral, compound-like structures

Graph 1c. Word cloud graph for the node xartis

48 KOKINI GHRAMI is a marginal case and further monitoring will be needed.

description of the phenomenon (see also Ralli 2007).

5. Beyond word frequency

(https://www.sketchengine.co.uk/user-guide/user-manual/trends/), but it is available only to preselected corpora. As a result, we chose to explore

Corpus size Corpus size

Graphs 2 and 3. Relative frequency fluctuations in the news - moving average

KINONIKI DHIKTIOSI(0,921) ITHIKO PLEONEKTIMA (0,786)

Target Change Dispersion Change P2 Dispersion Change Dispersion Change Overall

Newspaper Sections per AN structure

Frequencies (PMW) per Newspaper Sections

Frequencies (PMW) per Newspaper Sections per Time

7.1. Future implications

67Especially for marginal cases such as KOKINI GHRAMI.

You might also like