This repository contains Ancient Greek texts which have been tokenized, POS-tagged, sentence-splitted, and lemmatized automatically. The texts come from the following repositories, which currently contain most of the Ancient Greek texts freely accessible over the internet:
As for the tokenization, POS tagging and sentence splitting, the data rely on those provided in:
Refer to these repositories for further documentation. In the present repository, the POS tag + the word form of a token have been automatically linked to those contained in Morpheus (see the "Morpheus" folder) and MorpheusUnderPhilologic. Since the latter databases also contain lemmata, this allowed their automatic extraction.
The XML structure of each file is self-explanatory and solutions of abbreviations are provided at the beginning of each file. For convenience I give an example here:
<s n="2">
<t p="4" n="1" a="[1]" o="p-s---mn-" u="1">
<f>ὃς</f>
<l i="234">
<l1 o="pr-s---mn-">ὅς</l1>
</l>
</t>
<t p="4" n="2" a="[1]" o="p-p---fa-" u="2">
<f>τάσδε</f>
<l i="5901">
<l1 o="pd-p---fa-">ὅδε</l1>
<l2>ὅδε</l2>
</l>
</t>
<!-- further t elements -->
</s>
Read the above xml fragment this way:
s
element: sentence element, where @n
is the sentence numbert
element: token element, which contains a number of values providing
its morphological analysis:
@p
: passage-level cts urn@n
: position of the token in @p
@a
: nth occurrence of that token in @p
@o
: morphological analysis of the token as provided automatically
by the Mate tagger (this analysis follows the Morpheus format
explained below)@u
: position of the token within the s(entence) elementf
element: the word form of the tokenl
element: possible lemmata extracted from Morpheus (<l2/>
) and
PerseusUnderPhilologic (<l1/>
) found by matching
their word forms AND
POS tags with those found in the
present database. in <l1/>
@o
contains the
original PerseusUnderPhilologic POS tag (see solutions
below), which can be more informative than the Morpheus
one. For example, ὃς in the above example is analyzed in
PerseusUnderPhilologic as a relative pronoun
(o="pr-s---mn-"
: see "r" in second position).
Similarly, ὅδε is analyzed as a demonstative pronoun, while
Morpheus simply treats it as a pronoun. One token may have
more than one <l1/>
and/or <l2/>
elements
associated.The Morpheus POS tag in t/@o
consists of 9 characters, each of which has
an unambiguous meaning:
1: part of speech
n
: nounv
: verba
: adjectived
: adverbl
: articleg
: particlec
: conjunctionr
: prepositionp
: pronounm
: numerali
: interjectionu
: punctuation2: person
1
: first person2
: second person3
: third person3: number
s
: singularp
: plurald
: dual4: tense
p
: presenti
: imperfectr
: perfectl
: pluperfectt
: future perfectf
: futurea
: aorist5: mood
i
: indicatives
: subjunctiveo
: optativen
: infinitivem
: imperativep
: participle6: voice
a
: activep
: passivem
: middlee
: medio-passive7: gender
m
: masculinef
: femininen
: neuter8: case
n
: nominativeg
: genitived
: dativea
: accusativev
: vocativel
: locative9: degree
c
: comparatives
: superlativeThe meaning of abbreviations in t/l/l1/@o (used in MorpheusUnderPhilologic) is the same as that in Morpheus (see above) except for the first two characters. Read them like this:
ae
: proper adjective (e.g., Ἀθηναῖος).ne
: proper noun (eg., Ζεύς)d-
: adverb" (eg., οὐ)dd
: demonstrative adverb (eg., ταύτῃ)de
: proper name adverb (eg., Ἀθήναζε)di
: interrogative adverb (eg., ποῦ)dr
: relative adverb (eg., οἷ)dx
: indefinite adverb (eg., που)c-
: conjunction (eg., καί)r-
: prepositionp-
: pronounpa
: definite articlepc
: reciprocal pronoun (eg., ἀλλήλους)pd
: demonstrative pronoun (eg., οὗτος)pi
: interrogative pronoun (eg., τίς)pk
: reflexive pronoun (eg., σεαυτόν)pp
: personal pronoun (eg., με)pr
: relative pronoun (eg., ὅς)ps
: possessive pronoun (eg., ἐμός)px
: indefinite pronoun (eg., τις)m-
: numerali-
: interjection (eg., ὀτοτοί)e-
: exclamationy-
: math term or abbrev for all of Euclid's ΑΒΓ geometrical figuresg-
: particlegm
: modal particle" (eg., κε)In version (1.2.5):
In version (1.2.4):
In version (1.2.3):
In version (1.2.2):
In version (1.2.1):