"Understanding the Chicago Homer" is one of three help documents. It discusses the goals, features, and limitations of the Chicago Homer in an analytical and historical manner. For instructions on how to use the site, consult Using the Chicago Homer. The tutorial, What can you do with the Chicago Homer? develops in detail two scenarios of use for a student of intermediate Greek and a reader without Greek.
While you can navigate the Chicago Homer successfully without knowing too much about how the individual parts work and fit together, for serious work you will want to understand its architecture and the query potential of the data. In the light of possible future revisions of the interface, it is especially useful to know what kinds of questions can in principle be addressed to the data as opposed to only the questions that it is possible to ask through the current interface.
The Chicago Homer is a bilingual database that uses the search and display capabilities of electronic texts to make the distinctive features of Early Greek epic accessible to readers with and without Greek. Its component parts are
The most salient feature of the Chicago Homer is its ability to make visible the network of phrasal repetition that is so distinctive a feature of Homeric poetry. We reserve the rest of this introduction to a brief discussion of repetitions before turning to a detailed account of the texts and translation, the database and its parts, and the user interface.
In his short book Corpus, Concordance, Collocation (Oxford, 1991), the British linguist John Sinclair discusses the comparative merits of accounting for linguistic usage by the "open choice principle" or the "idiom principle." On the former principle you assume that the author of an utterance can at any point choose from a large range of options to continue or complete a statement, subject only to the constraints of grammaticalness. But this principle does not seem to provide a sufficient explanation for the actual constraints faced by speakers, and a second principle is needed:
The principle of idiom is that a language user has available to him or her a large number of semi-preconstructed phrases that constitute single choices, even though they might appear to be analyzable into segments. To some extent, this may reflect the recurrence of similar situations in human affairs; it may illustrate a natural tendency to economy of effort; or it may be motivated in part by the exigencies of real-time conversation. (Sinclair, 110)
It is apparent from the slightest acquaintance with Homeric poetry that it depends in a particularly extensive and distinctive fashion on the "idiom principle." It is not much of an exaggeration to say that the Homeric Question is a question about phrasal repetition. Imitators and parodists from Vergil through Fielding and Joyce recognized this when they focused on phrasal repetition as the essence of Homeric poetry. In this century, Milman Parry, building on generations of earlier research on conventional elements in Greek epic, gave a comprehensive and systematic account of Homeric poetry by organizing it under the paradigm of "oral poetry," and despite many attacks and modifications, this paradigm has not shifted very much.
The Chicago Homer for the first time gathers the phenomena of phrasal repetition in a systematic manner and makes them accessible to analysis by scholars with technical philological interests as well as by nonspecialist readers and critics. There is no other reference work that lets a user draw up lists of repeated phrases by different parameters, such as words contained, location, length, or frequency. More broadly, the Chicago Homer is the only tool that lets a user evaluate a repeated phrase against the resonances as well as the mere noise of other repetitions. It can do so because of the deliberately mechanical way in which the index of repetitions was generated in the first place and because it includes devices not only for displaying repeated elements in the text but for filtering them by various criteria.
For any sequence of two or more words that is repeated anywhere in the corpus of Early Greek epic, the Chicago Homer contains a database entry identifying its locations and other properties. It does not matter whether the repeated sequence is a mere bit of syntactic glue such as "but he," a noun-epithet formula such as "rosy-fingered Dawn," or an extended narrative stretch, such as the story of the shroud Penelope wove for Laertes. The database treats them all as instances of one word followed by one or more words in more than one location.
The index of repetitions is thus not restricted to "formulae," or phrases in the narrower sense of idiomatic collocations. It does not claim to pick out the verbal patterns humans process as meaningful, and it contains only the word strings a computer can be taught to recognize as repetitions, but it does contain all of those. This is a disadvantage for a user who wants to focus immediately on repetitions that play a significant role in the poetic economy of the epics. But it is a great advantage for a user who wants to know how formulaic repetition functions against the broader background of verbal repetition. The Chicago Homer gives a user an opportunity to see how much "repeated stuff" there is in the first place, how much of it is just noise, and how much of it resonates in the environment of a distinct mode of composition.
The display features of the Chicago Homer are invaluable for making judgments of this kind. In one of Milman Parry's papers, there is a famous page that marks the repetitions in the opening lines of each epic and seeks to give a visual representation of the ways in which every passage of Homeric verse is shot through with formulaic materials. Parry uses different kinds of underlining to mark different types of repetition, and the footnotes at the bottom of the page act as hyperlinks.
Effective as this demonstration was, it suffers from two disadvantages:
The Chicago Homer lets the reader turn any segment into a fuller version of "Parry's page." You can display a line range of any chosen length with all of the repetitions that occur in it. In practice, this is rarely advisable since the dense network of overlapping repetitions clutters up the screen and makes the text unreadable. But you can filter repetitions by criteria of length or frequency, and a general filter allows you to screen out the large set of two- or three-word strings that are dominated by function words and for many purposes can be dismissed as "junk repetitions, " although for some purposes they provide crucial evidence.
Since the repetitions appear as links on the screen, you can go from one occurrence to another, traveling along the neural networks of bardic memory, or simulating the experience of an ancient listener for whom many phrases in the text would resonate with a general or specific memory of that phrase in other contexts.
The Chicago Homer can also generate a list of repeated phrases that appear in one section of a poem and another. This is a particularly useful feature for a reader without Greek who wants to compare different passages. It is impossible for any translator of Homer, however faithful, to replicate the network of repetitions with any precision. But a reader who is struck by resemblances between the deaths of Hektor and Patroklos can easily establish the powerful network of phrases that uniquely link those two passages.
More of the scholarly and technical details are discussed in the special section on repetitions. We conclude this introductory discussion by drawing attention to the fact that very specific capabilities of electronic search and display technologies are used here to represent the most distinctive features of Homeric poetry as a form of composition that has its roots in a preliterate world. It would not be possible to do anything like it in a print medium. By the same token, there probably is not another set of texts for which so precise and exhaustive a charting of phrasal repetition would be either possible or illuminating.
The Greek Texts and English Translations
Except for fragments, the Chicago Homer includes all the 32,468 lines of hexametric verse that make up the corpus of Early Greek epic: the Iliad (15,693), the Odyssey (12,110), the poems attributed to Hesiod (2,333), and the Homeric Hymns (2,332).
The texts in the Chicago Homer are derived from the electronic texts used in the Perseus Project. For the Iliad, Hesiod, and the Homeric Hymns, the Perseus texts are the digital transcriptions of the Oxford Classical Texts that are also used in the Thesaurus Linguae Graecae. The Perseus text of the Odyssey is a version of the Loeb Library text scanned at the University of Chicago in 1989. The basic SGML encoding for all these texts was done by Lauren Burka and Chiara Thayer for the Perseus Project. Craig Berry and Bill Parod developed a modified version of the TEI drama DTD (i.e. the "document type definition" "drama," using the Text Encoding Initiative guidelines) to encode additional information, in particular speeches and speaker identities.
We collated the Perseus texts of Homer with the electronic version of Helmut von Thiel's text, and where the texts diverge, we followed that text in most instances. Von Thiel's edition has a marked preference for the readings of the vulgate text, on the sensible ground that this is the text that was read through much of antiquity.
In order to make searches more predictable, we have standardized orthographic conventions across the texts. This has affected capitalization (reserved for proper names), the treatment of diacritical marks, and occasional spelling variants. For example, whereas such forms as "polemon de" or "hala de" (to war, towards the sea) appear in our source texts as single words (polemonde) or as word strings (polemon de), we have consistently treated such forms as single word forms and classified them as adverbial forms.
The resultant text is best described as a standard edition tweaked a little to make it compatible with the database environment in which it functions.
The Chicago Homer uses Richmond Lattimore's translations of the Iliad and Daryl Hine's translations of the Theogony, Works and Days, and the Homeric Hymns. Lattimore's translations and Hine's translation of the hymns were digitally transcribed by the University of Chicago Press. The translations of the Theogony and Works and Days were specially prepared for the Chicago Homer.
In the rare instances where the text of Lattimore's translation differs from ours (mainly in the order of lines), we have adjusted the translation to fit our text. The most significant of these is Iliad 8.547-553. Iliad 8.548 and 8.550-552 are usually omitted, and Lattimore does not translate them. We have included the lines and a translation of them by Ahuvia Kahane.
The current version of the Chicago Homer does not include a translation of the Odyssey or of The Shield of Herakles, a short narrative poem transmitted under the name of Hesiod but certainly not by him.
The Chicago Homer includes the classic German translations of the Iliad and Odyssey by Johann Heinrich Voss. The texts were taken from the German Project Gutenberg.
Text Display and Character-Set Problems
In the Chicago Homer, you can display the original, the translation, or both in interlinear fashion. The interlinear display draws attention to the exceptional fidelity of Lattimore's translation: in a high proportion of cases, the semantic content of the English will sit in the same part of the line as in the Greek, and in most cases most of the Greek content is caught in the same English line. This is very useful for the student with little Greek. Daryl Hine's translations of Hesiod and the Homeric Hymns do not maintain quite so strict a line-to-line correspondence, but the general point remains.
In German translations of Homer line-by-line correspondence has traditionally been easier to achieve because German supports serious hexametric verse in a way in which English does not. The Voss translations work very well in an interlinear display, and English readers whose German is as good as or better than their Greek may find Voss useful for making sense of difficult passages in the Odyssey.
For a reader without Greek, the display of the original text in transliteration is a great help. In an ordinary day in the modern world, educated speakers of English may use hundreds of words derived from Greek, and with a little effort it is possible to put that tacit knowledge of the meaning of many Greek words to use through a joint navigation of the translation and the original. Transliteration is the key to such an effort because it removes the barriers of a foreign alphabet that looks more different than it is.
A transliterated version of an ancient Greek text is not necessarily a second-best solution or Greek at one remove. The modern Greek alphabet developed from the practices of medieval scribes and Renaissance typesetters. Thucydides or Plato would have a very hard time making sense of a page of their writings in an Oxford Classical Text edition. They would not recognize many of the characters and would be puzzled by the elaborate system of accents and breathing marks bequeathed to us by late antiquity as a classic example of a priestly cure that is worse than the disease.
Modern transliteration depends on Roman conventions of representing Greek writing and is in some key respects closer than our Greek to what the tyrant Peisistratos, Thucydides, or Plato saw on a scroll. First, transliteration emphasizes the commonality between the Greek and Roman alphabets, which are both derived from Semitic alphabets and are variants of a character set in which roughly two-thirds of some two dozen signs have kept remarkably stable associations with particular sounds (e.g. A, B, M, O, T), while the remaining third has wandered (e.g. H, P, X). The forms of Greek and Roman capital letters were much closer to each other than the medieval and modern minuscule and cursive scripts in which Roman "a" and Greek "alpha" try their best to hide the fact that they are the same letter.
The letter "h" plays a special role in the difference between Greek and Roman scripts. The Romans quite intelligently used the combinations "ch," "ph," and "th" to mark the aspirated stops "chi," "phi," and "theta." This recognizes the presence of systematic variance and is arguably superior to the Greek practice of using different signs, which suggest no relationship between the phonemes. The Romans also used "h" to mark the rough breathing on an initial vowel. In this they followed what was standard Greek practice until the fifth century. In the earliest spelling of the name "Hektor" on a Greek vase of the sixth century, it appears as HEKTOR.
Fifth-century Greeks felt less need for marking initial breathings than for distinguishing between short and long vowels. They used the sign "h" to mark "eta" or a long "e." Thus Roman transliteration is actually more faithful than the modern ancient Greek alphabet to the kind of distinctions made or not made prior to the fifth century. Modern Roman transliterations add the distinction between long and short vowels, thus creating a hybrid of Old and New Attic spelling. It is a clean, accurate, and familiar system.
The point is worth making since the display of ancient Greek on a computer screen is still limited. There is a scale problem with the glyphs of the Greek character set: screen resolution is simply too coarse for a legible and pleasing representation of the squiggly little accents and breathing marks on top of letters of ordinary size. This problem is likely to persist for some time.
Be that as it may, readers who do not know Greek or find Greek on the screen hard on their eyes can take comfort in the fact that a transliterated Iliad on a computer screen is no further from the texts read by Thucydides, Herodotus, and Plato than an Oxford or Teubner text. In some important respects it is closer.
On a modern computer, ancient Greek texts are stored according to a protocol known as "beta code." It was developed in the sixties as a way of storing Greek texts on mainframe computers using only the character set available on a standard IBM keyboard (which then was the only character set available).
In beta code, the letters of the Greek alphabet are mapped to their closest Roman equivalents. This produces straightforward correspondences for the seventeen letters A, B, D, E, G, I, K, L, M, N, O, P, R, S, T, U, and Z. The use of H for long "e" and X for the aspirated guttural depends on quite ancient orthographic conventions. The letters C, F, Q, W, X, and Y are respectively mapped to the x-sound "xi," the aspirated labials and dentals "phi" and "theta," the long o "omega," and the double consonant "psi."
In beta code, the accents and breathing marks of later scribes are represented by various symbols on the modern typewriter: the opening and closing parentheses ( ) stand for rough and smooth breathing. The forward and backward slashes / and \ stand for acute and grave accents. The = sign is used for the circumflex. Sixties mainframes were like ancient scribes in having no lowercase letters. Thus capitalization had to be marked in a special way. Beta code uses the asterisk * to mark a capital letter.
The resultant code is extremely ugly to the human eye and very difficult to use on a keyboard without errors. It is, however, very easily processed by a computer. The difficulty of correctly entering accented beta code has led developers of search programs to provide users with the option to enter a search term in unaccented beta code. The search is then executed on a text with the instruction to ignore the special characters. Most users depend on unaccented beta code searches because the time cost of correctly entering accented beta is not worth the benefit of avoiding occasional ambiguity in search results. This throws an interesting retrospective light on the cost/benefit aspects of accents and breathing marks in the representation of ancient Greek.
Reading a Greek text in uppercase unaccented beta is an illuminating experience. It presents the text with a level of orthographic information very close to what the sixth-century Athenian tyrant Peisistratos would have seen, but in a familiar graphic environment and with word division, a crucial feature unknown to ancient Greek readers anywhere. Here for instance are the opening lines of the Iliad:
MHNIN AEIDE QEA PHLHIADEW AXILHOS
Here are the same lines without word division:
Unaccented beta code, crude workaround as it may be, turns out to capture the orthographic structure of the "Urtext" with surprising precision. It is certainly a lot more like the writing on a putative sixth-century scroll than the ancient Greek we see on a modern printed page.
There are some special problems involved in the representation of Greek characters on the Web. If you only read the text, you can ignore them. But if you enter Greek terms for searches, it helps to know a little about the steps by which a character appears on the screen. It is also useful to know that some awkward features of the Chicago Homer result from the deep constraints still imposed on multilingual environments by the basic conventions computers use to represent written characters.
In any computing environment the alphabetical and numerical characters we read are mapped to binary numbers that the computer can process. Because these mappings are fixed in a protocol known as ASCII, we can forget about them in the ordinary business of using (or even programming) computers. On the other hand, because the letters of the alphabet are mapped to "eight-bit characters," there is only a limited set of them (255 to be precise). If you stick to English and math, these limits are theoretical. But as soon as you venture beyond the computer's Adamic language, i.e. English as spelled on an IBM keyboard of the fifties, you bump into the limits of the mapping scheme, and until very recently you have had to rely on various and never quite satisfactory workarounds of one kind or another. Unicode, a "sixteen-bit" mapping scheme, has 64,000 possibilities and permits the mapping of pretty much every symbol in every writing scheme to a sixteen-bit character that we can "set and forget" once the scheme has been implemented. But especially for interactive environments, Unicode is still a work in progress. We have used it for the display of the Greek text, but we have stored the information in the data tables in plain ASCII because in any environment where text must roundtrip from user input to the output of search results, too many things can and do go wrong if one moves outside of ASCII. But chiefly we cannot yet take for granted widespread user competence in generating Unicode input from a standard keyboard.
We assume that for most readers with some Greek, inputting words and phrases in unaccented betacode is much the easiest way to execute a search. You only need to learn seven characters, most of which have some logic or mnemonic aspect to them:
As for the rest, you enter 'a' for 'alpha', 's' for 'sigma', and so forth.
A stated above, accented betacode is very difficult to enter accurately and hardly ever of much use in weeding out false positives.
In transliterated Greek, characters with diacritical marks are stored as entity references: the character you see as 'ê' is "really" 'ê' . Thus a search for 'psuchê' will yield no results. Clever programmers have designed a workaround for that. You can enter the search term "psuche^," a string that consists entirely of lower ASCII characters. The computer will map this representation to "psuchê" and retrieve the appropriate string.
You only need to remember two conditions or four separate sequences to input transliterated chracters:
Do not try to trick the computer by entering the search term "psuch_"; on the assumption that this would be the appropriate search term for the string "psuch" followed by any single character. You might think that "psuchê" is "psuch" followed by a single character, but since the computer stores "psuchê" it will treat it as the string "psuch" followed by seven characters.
And so on. This is very tedious business, and we dwell on it here in some detail to make the point that modern computers, however powerful they are in many ways, are still poorly equipped to deal with more than one alphabet at a time. Seemingly simple problems bump up against constraints that sit very low on the evolutionary tree of modern computing, and the solutions typically involve awkward workarounds.
The English-Greek Index
The Chicago Homer identifies English words and phrases that are used to translate nouns and adjectives in Lattimore's translations of the Iliad and Odyssey. The idea behind this feature is simple. If readers without Greek could access nouns and adjectives in the original they would have a fairly reliable tool for tracing semantic fields with some precision. In practice the English-Greek index is a rough-and-ready tool with limitations, compromises, and a margin of error on the order of 10 percent for false positives and false negatives. Even so, the index is a useful feature for many purposes, especially when it is used with an awareness of its shortcomings, which ultimately result from the impossibility of coordinating the corresponding parts of original and translation with a high degree of precision.
We constructed the index by first creating a list of Greek nouns and adjectives that occur more than once, ignoring all hapax legomena, or nonce words. We then created a table that showed for each word occurrence the Greek and English texts of the relevant line. Going through that table word by word, we recorded the English words or phrases used to translate the Greek word.
The result of this effort is a table that associates each Greek word with one or more English terms. Lattimore translates "mênis" sometimes as "anger" and sometimes as "wrath." Thus the table includes two rows for the word "mênis":
But there are ten rows for the word "anger" because Lattimore uses that word to translate a variety of Greek words:
In a final step, this table was used to generate links in the display of the English text. When a reader clicks on the word "anger" in the English text, the click triggers a request to the database to look up all the rows that contain "anger" in the English column and retrieve a list of all the Greek words. It is not difficult even for Greekless readers to identify the Greek word that is used in the particular line that triggered the request, and readers can then follow the occurrences of the Greek word rather than the English word. English readers can also think of the word list generated by "anger" as a Greek thesaurus for the semantic field "anger."
The words of a translation almost never map against the original in a very precise or consistent fashion. Often a word is translated by a phrase in the other language. In such instances you can record the phrase as the translation. But in many cases the semantic force of a word is diffused across the translation in such a way that one cannot identify a phrase that may be said to be largely a translation of a particular word. Very common words such as "will," "must," "might," or "like" create problems of their own. If in a given passage a Greek word meaning something like "necessity" is translated by the English auxiliary "must," the cost of creating that particular link is that every English occurrence of "must" will be underlined in the text. The cost of such false positives is clearly too high, and we have eliminated from the index all entries where a Greek noun or adjective is translated by an English word with an extremely high frequency. Similarly, we excluded the Greek word for "all" (pas).
English homonyms create problems. Take the example of "boulê," which Lattimore in Iliad 1.5. translates as "will." A table row that associates "boulê" with "will" introduces a link on every occurrence of the English word "will." In this particular instance one can sidestep the problem by recording "the will" as the translation of "boulê."
One could ask why we did not create a complete index that relates every word occurrence of a Greek noun or adjective to its particular translation in that location. The answer is that the added benefits of such an index would be out of all proportion to the cost of preparing it. The current index is something of a compromise. It is easily updated and if users encounter errors, we encourage them to report them to us.
A printed phone book lets me look up the unknown number of a known person. It is of no use in looking for the name of the person who lives three houses down the street from me. But if the same information is kept in a database, I can start from the person or the address. This is a critical difference, and its importance increases when more than two pieces of information are associated in a particular data row.
In the Chicago Homer the texts are enveloped by a set of indexes that are kept in a database environment and permit "omnidirectional" searches. Every word or repeated phrase in the text is associated with various forms of lexical, morphological, narratological, and locational information, and any combination of these attributes can be used to define a search object and provide the list of objects that meet the search parameters.
The database of the Chicago Homer divides the texts of Early Greek epic into 230,825 word occurrences, and it may be said to "know" the following properties of each occurrence:
The known facts about each word occurrence can be combined, counted, and manipulated in various ways. The following facts about the first word of the Iliad are not especially interesting but illustrate the query potential of the data in this form. "Mênin" is
The Chicago Homer also "knows" that the first word of the Iliad is not part of a repetition but that the lemma "mênis" occurs in eleven different (and overlapping) repeated strings.
Word Division and Lemmatization
The founding step of the database in the Chicago Homer is the division of the text into words and the treatment of each word occurrence as a token with a set of properties. While word division is inherited from the world of printers and scribes, it becomes something different in a digital world. The space between written words is an event of variable significance. Readers tacitly negotiate the wobbly margins of orthographic conventions for splitting or joining lexical units, and they recognize phenomena that are loosely joined or weakly divided, such as "mother tongue," "fatherland," "Web site," "Web-site," or "Website."
In a digital environment, the blank space between character strings assumes the role of the absolute separator that permits the sorting and counting of phenomena. In any text-based electronic project, one must acknowledge the violence of this first cut and be very clear about the fact that word division does not "carve nature at its joints" or recover the constituent elements of sentences. It is not evident that sentences are made from words, and it does not take much reflection to see the pitfalls in the question "What is a word?"
It is, for instance, very easy to prove the false precision of a statement like "The corpus of Early Greek epic consists of 230,825 words." Greek is full of particles and other little words that bond with what they precede or follow. The orthographic conventions treat those bonds differently. Thus the combination of a preposition with a noun ("epi nêas") is represented as two words. But the combination of a preposition with a verb counts as one word ("epimeidaô"), although the placement of verb augments in past tense forms suggests that the component elements of verb and prefix were perceived as quite distinct. What is the difference between the word "hode" and the phrase "ho de"? And so it goes. On the other hand, conventional word division is good enough for many purposes, and its gray zones are well known. To transcribe the printer's space between words as the computer's blank space is a reductive act, but it has heuristic validity and introduces errors of a familiar kind. Their impact can be balanced by a constant and skeptical awareness that when you say, "The corpus of Early Greek epic consists of 230,825 words," you are really saying, "If you base the electronic transcription of the text on traditional orthographic practices, you will end up with a document consisting of 230,825 word tokens." Because so roundabout a way of talking will soon exhaust the reader's patience, we continue to say things like "The corpus of Early Greek epic consists of 230,825 words," but such statements should always be read as shorthand for their more qualified versions.
Lemmatization is a procedure with even larger gray zones than word division. The need for it arises from the observation that many word forms are "really" variants of the "same" basic word. When people look for a word they often want all the word forms that belong together. The solution is to bundle such word forms under a head word, lexical item, or "lemma." "Houses" and "house," "loved," "loves," and "love," or "take," "took," and "taken" are obviously "different forms" of the "same word." But "went" and "was" are not quite so obviously forms of "go" and "be," and if they are, why do grammar books not classify "any" as the form of "some" you use in questions or negative statements?
Questions of this kind arise as soon as one moves beyond the obvious cases. But while lemmatization at the margins is decidedly not a way of "carving nature at the joints," it is important to remember that it works well enough for most practical purposes. The Chicago Homer largely follows the lemmatization practices of the Liddell-Scott-Jones dictionary. There is, however, a fair amount of fudging in that dictionary. Take the word "helôria" from the fourth line of the Iliad. In LSJ, this is referred to the lemma "helôrion," and we learn that "helôrion, to = helôr." The lemma "helôr" is given full treatment, including a translation as "spoil, prey." The combination of editorial and typographic practice leaves it conveniently open whether "helôrion" and "helôr" are independent lemmata.
Our tendency with problem cases, which run in the low hundreds, has been to lump rather than split. As a result the lemma count in the Chicago Homer is 9,391, when a policy of consistent splitting would have produced roughly 9,600 lemmata. Thus the problem area includes about 2 percent of the lemmata and something like 0.025 percent of word occurrences. We have lumped on the assumption that for most purposes users will find the results of lumping more informative. If, for example, we had classified "helôria" as a form of the lemma "helôrion," a user clicking on its sole occurrence in Iliad 1.4 would be informed that it is a hapax legomenon. By classifying it as a form of "helôr," the user who clicks on the lemma sees immediately that the eight Homeric occurrences are divided as follows: "helôr" (8), "helôra" (1), and "helôria" (1). Given the distribution of these particular forms it seems much more plausible to argue for a single lemma than to assume that there is a third-declension lemma "helôr" and a first-declension lemma "helôrion." But regardless of the judgment on a particular case, the policy of lumping problematic cases has the advantage of producing errors of the "false positive" kind, which are easier to spot than the "false negatives" that would follow from a policy of splitting.
Subject to the qualifications discussed above, one can say, then, that there are 9,391 lemmata and 230,825 word occurrences in Early Greek epic, and the table below gives a little more detailed information about how many "different words" occur in, or are unique to, each text. It is possible to argue on the basis of those figures that the lexical density of Homer is quite similar to that of Shakespeare.
Homer scholars have a special term for a unique word, or hapax legomenon, and one sometimes hears references to dislegomena, or words that occur only twice. Beyond that, one does not get much detailed information from dictionaries or commentaries about how common or rare a word is, largely because accurate frequency information is very difficult to compile and maintain in a print environment. On the other hand, frequency is an important property of a word. Computers are very good at counting, and in the Chicago Homer we have tried to summarize quite detailed frequency-based information in such a way that it tells users at a glance how common a word is.
Raw counts give a poor idea of relative frequency because the poems differ in their dimensions. The Iliad is longer by a third than the Odyssey. It has somewhat more narrative (55 percent), whereas the Odyssey has somewhat less narrative (45 percent). The other poems are much shorter. The Theogony and Works and Days are each about the length of a long book of the Iliad. The Shield of Herakles and the four major Homeric Hymns are each about the length of an Odyssean book.
Counts are more informative if their values are converted into relative frequencies. This is usually done by dividing counts into the total number of words. The resultant fractions are accurate but unintuitive. If I am told that the relative frequency of the word for "anêr" is 0.00528, I am not much wiser for the information. But if I am told that it occurs 375.9 per 10,000 lines, I can calculate in my head that it occurs roughly every 25 lines, or once per page. For this reason, we have chosen to express counts as frequencies per 10,000 lines. This produces figures of the same order of magnitude as the raw counts for each epic. The decimal points add no useful precision, but they do mark the status of frequencies as derived figures.
The majority of words are used quite differently in narrative and speech. It is therefore useful to compute narrative and spoken frequencies separately, and we have done so. Basic frequency information about each word is provided in a table that appears in the right margin of the Chicago Homer whenever you click on the word. The information tells you at a glance whether a word is more common in narrative than in speech or more common in one epic than in the other.
The division of the text into lines of narrative and lines of speech is obviously useful for the Iliad and Odyssey as well as for the hymns to Demeter, Apollo, Hermes, and Aphrodite, which follow the narrative conventions of Homeric poetry. But in Hesiod's Theogony, spoken lines are very rare, and it is not obvious whether to classify Works and Days as a form of narrative or speech. Similarly, the later Homeric Hymns, which are largely short prayers or praise poems, are not usefully divided into narrative and spoken parts. For this reason, all lines of Hesiod and the later hymns have been classified as narrative. But narrative in those poems is not quite the same as Homeric narrative, and one needs to be cautious in comparing numbers.
The database contains information about the counts of lemmata, word forms, repeated phrases, and their variants. In many searches, these figures appear in parentheses after a word or phrase. Regardless of the search parameter, the numbers in parentheses always refer to the count of the word or phrase in the entire corpus and are retrieved from a fixed table in the database. It is theoretically possible to generate "on the fly" counts that respect the parameters of a particular search. We have chosen not to do so because it would complicate the underlying queries and significantly slow down performance.
Every lemma in the Chicago Homer is assigned to a word type as follows (numbers in parentheses refer to the count of types in each category):
a. General noun (2,270)
b. The subtype "adj_noun," which identifies a few words that systematically straddle the categories of adjective and name (11)
a. Personal name (1,324)
b. Place name (287)
c. The subtype "adj_name," which identifies a few words that systematically straddle the categories of adjective and name (38)
a. General adjective (2,304)
b. The subtype "pron_adj," which includes a short list of words like "autos," "allos," and "heteros" (11)
a. Personal pronoun (5)
b. Possessive pronoun (14)
c. Demonstrative pronoun (9)
d. The deictic "ho," which is hardly ever used as an article in Homeric Greek but is classified as such here (1)
e. Relative pronoun (8)
f. Interrogative pronoun (3)
g. Indefinite pronoun (1)
These conventional word classes work well enough for most purposes but have some problems at the margins. The biggest limitation is the following: the word type classifies a lemma in general but does not provide for the use of a word as something else. This problem comes up especially with adverbs. The category "adverb" is reserved for words that are used only as adverbs. It does not apply to the adverbial forms of adjectives, which are marked as morphologically distinct but counted as instances of the word type "adjective." If you are interested in adverbial usage, you can work around this by looking for adverbs and adjectival forms with the adverbial markers. On the other hand, Homer frequently uses the neuter accusative forms of adjectives in adverbial fashion. This usage is not caught at all in the database. Similarly, the database does not distinguish between adverbial and prepositional uses of words like "epi" or "kata."
The Morphological Table
Beyond associating a location with a spelling and a lemma, the Chicago Homer also associates it with a "word state," or description through the appropriate categories of tense, mood, voice, case, gender, person, and number. The resultant combination of spelling, lemma, and word state is a "morph_id," or distinct word form. There are 35,497 distinct word forms in the corpus.
The identification of word forms is based on Morpheus, the parser Gregory Crane developed for the Perseus Project. Morpheus is a bundle of rules that take a character string and assign to it the possible grammatical descriptions the string could have on the assumption that it is the spelling of a Greek word. Users clicking on a word in a Perseus text get one or more descriptions associated with the word and choose the one that fits the context.
Through the courtesy of the Perseus Project, we had access to the master file of all the descriptions of words in the Perseus corpus, and we used it to disambiguate all instances in which there is more than one description of a word form. As a result, the Chicago Homer provides for every word occurrence the proper grammatical description of the spelling in that location. While this is occasionally helpful in making sense of a difficult passage, the real utility of such disambiguation lies elsewhere. Through the association of word states with locations, inflectional categories become identifiable objects about whose frequency and distribution, separately or in combination, one can raise particular questions. At a concrete level, when confronted with a word of an unusual grammatical type, I can ask for other words that match its description (e.g. aorist optative passive forms, of which there are very few). More generally, I can get quite precise answers about the distribution of very common phenomena, such as the accusative or the active voice (both of them are more common in the Odyssey). Except for the Bible, there probably is no other ancient text with a comparably dense tagging of grammatical features.
The query potential of the morphological data depends on the accuracy of the tagging. As with other linguistic phenomena, morphological tagging is unproblematic in a sufficiently large number of cases to make the procedure worthwhile. But an awareness of its muddy margins is necessary to assess the level of confidence with which statements about aggregate data can be treated. First of all, the data are derived from the texts of standard editions. They exclude variant readings, and they include an indeterminate but small universe of conjectural readings. Second, it is not always possible to assign a univocal inflectional description to a word form. Scholars disagree about whether "ephato" is an imperfect or an aorist, and there are other examples. Third, there are bound to be some residual errors in the database, about which we say more below.
The total margin of uncertainty from all sources, however, is probably no more than 1 percent in terms of word forms or word occurrences, and one needs to evaluate the vigor of dissent about variants or conflicting interpretations against the background of substantial agreement about the parsing of word forms in particular locations. The Chicago Homer for the first time brings the inflectional phenomena of Early Greek epic into the search environment of an electronic database. The inquiries into usage made possible by this step are extremely unlikely to be affected by the margin of disagreement, uncertainty, or error in the data.
For the purposes of the database, a "morph_id" is a unique combination of a spelling, a lemma, and a word state. Just as we did not incorporate textual variants into the database, so we did not include possible variant interpretations of word states. The design of the data architecture, however, allows for such extensions at a later stage. There are also ambiguities relating to spelling. We decided to ignore positional accent variation. Thus oxytones and enclitics always are assigned the same morph_id regardless of accentuation in a particular location.
Capitalization in our database is a name marker. Thus the upper- and lowercase versions of a spelling mark the difference between a common noun and a proper name. They have different morph_id's. Where a common noun appears as a personal name, it has its own lemma. Where a common noun appears as a weak personification (Hatred, Strife, etc., in Hesiod), it is capitalized, shares the same lemma with its common noun, but has a separate morph_id. It is useful to point out that all capitalization in ancient texts rests on later editorial decisions and that deep skepticism about it is in order, especially when it involves the translation of abstract nouns into mythological entities.
There are probably a few errors left in the morphological table, and since they can be immediately and easily corrected in the database, we will be very grateful for receiving them and have created an "error button" that triggers an e-mail message to the editors. Residual errors are much more likely to be dumb than interesting. In disambiguating word forms with multiple word states, we had to make many decisions about whether "ta" is an accusative or nominative, whether "panta" is a masculine accusative or neuter plural, or whether "keleue" is an imperfect or imperative. This is an error-prone business when you do it thousands of times, and despite several rounds of proofreading some mistakes remain.
Some idiomatic uses defy the morphological parser. Homer frequently uses the accusative neuter of an adjective in an adverbial manner. In all these instances, the parser identifies the form, but does not mark its special use. We have dealt in a slightly unconventional way with a related problem. Forms like "autothi," "allothen," and "oikade" are adverbial forms that are clearly related to the lemmata "autos," "allos," and "oikos" but are not contained in the case system of classical Greek. We have treated all such forms as instances of a putative locative and marked them as "adverbial" in the category of case, but we have not distinguished between the "from," "towards," and "in" forms represented by the different suffixes "then," "de," and "thi." Our goal was not to achieve the last degree of philological precision but simply to make sure that word forms like "Abudothen" and "Abudothi" would not appear as stand-alone words but be classified as forms of the place name "Abudos.".
Morpheus includes information about inflectional types as well as word states. Thus it states that an aorist is a second aorist or that a noun belongs to the "-o" declension, and so forth. This information is captured in our database but requires further refinement and is not yet available to users.
In addition to lexical and morphological features, the Chicago Homer also tags some narratological features. Because in Early Greek epic a speech always begins at the opening of a line, it is very easy to tag the difference between narrative and speech. We have also identified the speaker of each speech, and each speaker is characterized by the attributes of mortality and gender. Thus it is possible to ask whether male and female speakers differ in their use of adjectives.
The speaker tagging extends to disguises and reported speeches. The database captures situations of the kind when Odysseus speaks in disguise as a beggar and reports the speech of a Cretan. In specifying speakers in queries, users may either choose "Odysseus," which includes every line in which Odysseus is the primary speaker, or the narrower category of Odysseus as a beggar, and so forth.
There is the question whether Odysseus's account of his adventures in Odyssey 9-12 is a very long speech or a different kind of narrative. One can argue on either side of this question, but we have chosen to classify it as narrative. To be precise, we have classified the opening lines of Odyssey 9 as speech and switched to narrative at Odyssey 9.39, when the speaker clearly goes into narrative mode. The decision to do it this way is quite pragmatic but also follows the evidence: with very few exceptions, the language of Odysseus's narrative is more like narrative than speech. There are other cases where speech could be classified as narrative, notably Menelaos's story (Od. 4.333-592). But you get into gray areas very quickly, and the line between Odysseus's narrative and everything else is most clearly drawn.
Aristotle observed that Homer differed from other epic poets by talking less in his own person and letting the characters speak in their own voice. The Homeric corpus is evenly divided between narrative (13,865) and spoken (13,858) lines. Within each epic, the balance is a little different: 55 percent of the lines in the Iliad are narrative as against 45 percent of the lines in the Odyssey (counting Odysseus's narrative as narrative)
The most obvious benefit of the narratological tagging appears in the frequency information about common words. Nearly all of them are used quite differently in narrative and speech, and in most cases the ratios of narrative to speech are very similar across the Iliad and Odyssey. But users need to be aware of some problems with the binary division of the text into narrative and speech. It works very well for Homer and the four major Homeric Hymns, which follow the Homeric pattern of a relatively even balance of narrative and speech (1103:813). But the division breaks down for Hesiod and the later hymns. In the Theogony there are a few spoken lines, and in Works and Days there are similarly some quotations. But these works are not really divided into narrative and speech. It would be more appropriate to say that the Theogony is a peculiar kind of narrative and Works and Days is a peculiar kind of speech. Similarly, the later and shorter Homeric Hymns are prayers or praise poems, where the distinction is also problematical. We have mapped all of Hesiod and the later hymns to "narrative," and one needs to be aware that the category of narrative has a slightly different meaning for those poems.
The Shield of Herakles gets somewhat rough treatment from this classification. It is of course problematical to group it with Hesiod in the first place, because there is a consensus that it is not by Hesiod, even though it was transmitted under his name. Secondly, the 480 lines of this poem include 73 lines of speech, which are lumped under the general category of Hesiodic narrative.
In the introductory section we discussed some general points about the treatment of phrasal repetition as the most distinctive feature of the Chicago Homer. Here we deal with the technical details, and in good Homeric narrative fashion we describe the repetitions module by giving an account of how it was made.
We began by making a lemmatized version of the text in which every word form was replaced by its corresponding lemma. In this version the opening line of the Iliad reads:
mênis aeidô thea Pêlêïadês Achileus
The purpose of lemmatization was to allow for "fuzzy pattern matching" by stripping inflected words of their grammatical fuzz. We also mapped personal and possessive pronouns respectively to "personal pronoun" and "possessive pronoun."
This reduced version of the text served as the basis for identifying repeated sequences of two or more lemmata. The "independently recurring string" turned out to be the critical concept in this exercise. It involved the removal of all repeated strings that only occur as parts of longer strings. Consider a text in which the phrase "the Miami airport" is repeated. If the text also includes the sentence "Let's go to the Miami" (name of a bar) as well as the sentence "Miami airport T-shirts are outrageously expensive," then a list of independently recurring strings would include
But if the strings "the Miami" and "Miami airport" occurred only in the longer string "the Miami airport," they would be dependent substrings and be removed from the list of repeated phrases.
A Perl script defining repeated lemma strings using the principle of substrings generated a list of some 36,000 types that occur in some 192,000 locations, ranging in length from 2 to 123 words and in frequency from 2 to 3,152. A second step associated each repeated lemma string with the literal strings that actually occur in the text. There are about 88,000 "rep_variants" that correspond to the 36,000 repeated lemma strings. The most frequent lemma string is the combination of deictic "ho" followed by a particle (3,152). The most frequent literal word string is "te kai" (642).
It is apparent from this sketch that phrasal repetition is a very pervasive phenomenon. There are three times as many repeated phrases as there are word forms, and there are almost as many "repetition occurrences" as there are word occurrences. And since there are only 230,000 word occurrences but some 192,000 occurrences of repeated phrases that are at least two words long and repeated at least once, it is also clear that repetitions involve an enormous amount of overlap. Thus in Iliad 1.12, the repeated phrase "êlthe thoas epi nêas Achaiôn" (he went towards the fast ship of the Achaeans) includes the following independently occurring substrings:
The Perl script makes no assumptions about which of these word strings are idioms, phrases, or formulae, and this particular example shows quite clearly why it would be very difficult to distinguish between strings that are or are not phrases. The Perl script also makes no assumptions about line or sentence endings as natural phrase stoppers. If the last word of a line/sentence and the first word of the next line/sentence are repeated anywhere as a sequence, this counts as a repetition regardless of whether it "makes sense."
There is obviously a lot of noise in the aggregate of phrasal repetition produced by our Perl script. In order to reduce the noise level, we developed a filter to screen out two- or three-word strings that consist entirely or predominantly of very common function words.
These words include
This function-word filter screens out about half of all repeated phrases. One might be tempted to call the filtered materials "junk repetitions," but while many of them no doubt are, some of them are distinctive stylistic markers. Thus the most common of all phrases, the combination of deictic "ho" with the particle "de," is considerably more common in the Odyssey than in the Iliad, and phrases like "ou gar" or "autar epei" are noticeable Odyssey markers. Still, for most purposes these repetitions are profitably screened out as a kind of linguistic background noise, and the student interested in them is likely to focus on them and select subsets of them through particular searches.
Because we recorded length and frequency as properties of repetitions, it is possible to filter the display of repetitions in the text by those categories. Other uses of the repetition table depend on searches that generate lists. For many purposes, location is an important property of a repeated phrase. Thus one may look for shared repeated phrases in the accounts of the deaths of Patroklos and Hektor or phrases that occur in the last book of the Iliad and the Odyssey.
Not all repeated phrases are captured in our repetition tables. Consider the following lines from the Iliad:
IL.1.3 pollas d' iphthimous psuchas Aïdi proïapsen
IL.11.55 pollas iphthimous kephalas Aïdi proïapsein.
A human reader quickly sees the pattern behind these lines: pollas . . . iphthimous . . . Aïdi proïaps(en/ein). But the computer neither sees that "pollas iphthimous" and "pollas d' iphthimous" are really the same pattern nor does it recognize that "kephalas" and "psuchas" are variants in a pattern that extends across the entire line. Similarly, the computer is thrown off by some minor variations in Odysseus's repetition of Agamemnon's offer in Iliad 9 and does not capture the full extent of verbatim repetition. False negatives of this kind may run in the low hundreds, but in nearly all cases the phrasal repetition that is captured gives a human reader enough evidence to supply the rest.
It is also worth pointing out that the repetition tables do not capture many forms of more general syntactic or phonetic resemblance between lines and phrases. But here too the mechanically generated list of literal strings provides a powerful guide to spotting patterns that are not directly caught. Thus it seems fair to say that our deliberately mechanical procedure for generating repeated strings directly captures or points to virtually all repeated phenomena that are of likely interest to a student of Homeric composition.
The User Interface
A General Caution
A user interface mediates between the information and the user. In the technology we know as the book, the user interface is so naturalized that we take it for granted. If we reflect on it, many technical details may be obscure but the basic design is fully transparent. The book shares this virtue of transparency with the bicycle.
The user interface of a modern and "user-friendly" computer application is a much more opaque phenomenon. Not only is it a veil that may hide more than it reveals, but it usually makes it difficult to ascertain whether limitations of an application are functions of the interface, the information, or both. This opacity of user-friendly interfaces is a particularly important thing to keep in mind when considering a Web-accessible database like the Chicago Homer. A database organizes information in ways that will facilitate some queries but complicate or rule out others. The query potential of the data is limited by the architecture of the data and the syntax of "structured query language," or SQL, the lingua franca of databases. But when a database is accessed over the Web, its query potential is much more severely constrained by Web-based data access routines, which have largely been developed to make it very easy for customers to get quick answers to the questions a business thinks its customers want to ask.
Within the constraints of current technology, a Web-accessible database is a system in which the designers must limit the query potential of the data to a small set of canned questions. This may be an advantage in designing Web sites for online shopping; it is a huge drawback in presenting scholarly information. We have done our best to design the interface for the Chicago Homer in a manner that will maximize the user's ability to define queries in an easy-to-use environment. But we remain keenly aware of the very large gap between the query potential of the data and the queries that can be run through the current generation of Web-based query tools.
The Display of Repetitions in the Text
Some aspects of the user interface are much less subject to this general caution about current limits. The most striking feature of the Chicago Homer is a form of data visualization that produces results impossible to achieve in a print environment and is also an advance over the characteristic list display of databases. We refer to the ability to make repeated phenomena visible in the text as links from which the reader can explore the neural networks of poetic memory. It is worth describing in some detail what happens in this situation:
Or more briefly, a database output, originally derived from the text, is visibly re-embedded in the text by projecting it on a Web page. In this instance Web technology actually enhances the query potential of the database: seeing the repeated strings in context is more informative than seeing them as a list. And the cooperation of Web and database technology here serves to highlight a very specific feature of Homeric poetry. If it is the case that reading a passage of Homer with full understanding is to be aware of its repetitive echoes, then this particular combination of electronic "search and display" visualizes or simulates readerly competence. At the very least, it tells readers where to locate the echoes they have not yet heard.
How Search Forms Can Mediate SQL Queries
The rest of the interface is more conventional and consists of input and output forms. Input forms let users fill in blanks or specify options, which are then translated into SQL and passed on to the database. Output forms display the returns of the search to the user.
Every search conducted on the Chicago Homer is an SQL query addressed to a set of underlying tables. Users need not know SQL or even be aware of the fact that they are "speaking" SQL without knowing. On the other hand, if you use the Chicago Homer in any systematic fashion, it is useful to know the way in which the interface both helps and limits the formulation of searches.
An SQL query is a command of the type "Show me all the words where the speaker is 'Penelope' and the word type is 'noun'." In this example the words in boldface refer ultimately to fields in the database and the words in quotation marks refer to particular values that can appear in those fields. On the search forms of the Chicago Homer, the syntax of the SQL query is translated into a simple layout, which permits users to specify various combinations of data fields and data values. The advantage of this method is obvious: for any query that can be fitted into the syntax of the search form, the formulation of the question is extremely simple and requires no technical knowledge at all on the part of the user. The disadvantage of this method is equally obvious: if a question does not fit the syntax of the search form, the user cannot ask it at all.
The success of this project, then, depends on the skill with which the designers of the search forms have anticipated the types of questions users are likely to ask. It is important to be explicit on this point because in practice users will want answers to questions that were not anticipated by the designer of the search form. We invite users to tell us about "unanswerable questions" because search forms can be changed. Sometimes it is quite easy to accommodate a new type of question that looks very difficult. On the other hand, some seemingly simple question may be impossible to implement either because it turns out that it cannot be stated in the syntax of SQL or because it is not supported by the structure of the tables, or because it would take too long. But the basic fact remains this: if the user interface is designed to enable SQL queries by users who do not know SQL, it is important to make sure that the simplified questions on the search form are in fact the questions that users want to ask. And only users know the answer to that.
Input and Output Forms
In manipulating the various parts of the user interface, it helps to know a little about the layering technology (I-frames) that it employs. The interface consists of an upper frame with a set of control buttons and a lower frame with six different forms. The buttons on the upper frame control the appearance and disappearance of the six forms in the lower frame. All forms are actually present on the user's machine, stacked like pieces of papers, but the user only sees the form currently on top of the stack. The site is quite memory intensive because it requires a lot of data to be present at the same time.
Two of the six forms control user input. The other four forms provide different kinds of output. On the input form for Browse mode, users give instructions about how text is to be displayed. On the input form for Search mode, users formulate the parameters for a query. The four output forms show
Using the site means shuttling between these six forms. Whenever you issue a command, the current top form will go to the bottom of the pile. It does not, however, go away, and it keeps the last information loaded on it, whether that was a set of instructions or the returns from a search.