The Forensic Linguistics Society

Forensic Linguistics Works

Index of articles

Purple padders, lexical twisters, heavy lifters, subtle literals and poetic elaborators: finding the source text in naïf plagiarism inquiries by John Olsson

Abstract

It is sometimes not the occurrence of plagiarism which is in doubt, but its direction, i.e. which one of two (or more) texts is the source text and which is the copy text.

Most plagiarists are naïf plagiarists: they do not have the skill to cite properly, or they deliberately purloin the text of others, changing it in insignificant ways which, however, cannot escape detection.

In this paper some famous literary, political and corporate infringements are examined in order to look for common techniques. The reader is then shown how these techniques can be observed and even, on some occasions, measured.

Introduction

Some copyright infringement and plagiarism cases do not hinge on whether plagiarism has taken place, but on which of two texts temporally preceded the other. In this paper I would like to explore a number of ways of dating texts so that it can be established, in any particular case, which of two claimed originals preceded which. I will first look informally at surface readability features, and – later – relate these examples to well known readability measures. Other intersections between plagiarism and readability by previous researchers are referred to later in this paper, but I would draw the reader's attention here to the work of Knight, Almeroth and Bimber (2004), where in a suspect text, university examiners measure the readability of each individual sentence within a text and then look for instances of sentences which fall outside pre-established tolerances (see also Olsson 2004: 114-115). Such techniques, can be used for determining the existence of plagiarism, as well as the direction of plagiarism, though in the case of Knight et al (2004) this latter point was not specifically addressed.

Direction of plagiarism – previous research

Thus far I have been unable to find any references to any linguistic method for detecting the direction of plagiarism , and in fact not many regarding the direction of plagiarism itself (other than one reference to biblical scholarship – mentioned later in this paper), although comments about plagiarism by Anson and Farris were helpful with respect to voice and authorship (Anson and Ferris, 1998: 57-58), while a more recent reference appears to be a Bayesian statistical approach by Deardorff regarding the possibility of the St. Matthew gospel having been 'plagiarised' from the Talmud of Jmmanuel (see Deardorff 2003, OL). However, as far as I know, this method has not been generalised to other plagiarism claims.

'Original' and 'copy' texts

I will be using the term 'original' or 'source' here in a relative sense, to mean the earlier of two texts in a given inquiry. This implies – that for the purposes of this discussion at least – the issue of intertextuality will not be discussed. This is not to say that intertextuality is not an important issue in plagiarism (see Anson and Ferris, ibid).

Several well known cases, or – since not every case ends up in litigation – allegations of plagiarism will be considered. The Appendix contains five pairs of excerpts. These are described as follows:

1. The claim that famous blind author Helen Keller's Frost King was plagiarised from Margaret Canby's Frost Fairies.
2. The claim that extracts from Richard Condon's The Manchurian Candidate were plagiarised from Robert Graves' I, Claudius.
3. The claim that parts of Martin Luther King Junior's famous I have a dream speech were taken from a speech given by Archibald Carey at the Republican National Convention of 1953.
4. The claim that Stephen Ambrose's Citizen Soldiers had been lifted from Joseph Balkoski's Beyond the Beachhead.
5. The claim that Raytheon CEO Bill Swanson plagiarised his 25 Unwritten Rules of Management from the work of a much earlier author, W.J. King's Unwritten Laws of Engineering.

The Appendix contains excerpts from all ten works (in 5 pairs). The advantage of using these excerpts and the works from which they are drawn is twofold: firstly, in each case it is known which work was published first. Secondly, in each instance examples of the plagiarism are clear – they contain substantial lexical similarities or identical strings of 6 or more words. The issue here is not whether plagiarism occurred but by what method we can ascertain – if faced with an investigation where the provenance of disputed texts is not known – which of two texts was produced earlier, since the earlier text is by definition the 'original' or source text and the later text is, inevitably, the 'copy' text. Text dating in this sense, therefore, means ascertaining the earlier of two texts, whether 'earlier' in this context is a day or 50 years.

The purpose of this paper, therefore, is to use the well?known instances of textual similarity cited here as a kind of 'howdunnit' of techniques for future plagiarism cases of the naïf type (see the section Naïf vs Skilled Plagiarism below). The 'howdunnit' aspect of things turns into a tool for the linguist: having found the technique employed by a plagiarist, it is then relatively straightforward to attribute the original or source text to an earlier date, i.e. to 'date' the competing texts.

On a personal note I would say that, considering how much effort some students and others, including very well known writers, expend on plagiarism, my investigation into this area has been something of a disappointment. I have not found any noble intellects, any great thinkers among my plagiarists. On the contrary, their techniques are banal, their repertoire of tricks dismal, and their efforts at escaping detection pitiable. On the bright side, however, this does give the linguist hope in that it shows that most types of naïf plagiarism are laughably easy to detect, and that attributing the source text is often even easier.

Types of text dating

Text dating can be considered to be of three types. The first is diachronic dating, which is where a significant time period has elapsed between the earlier and later text. A suggested minimum time lapse for the observation of significant cultural changes to feed through to the language in the form of discourse styles, lexicon, phrasing and politeness forms would probably be about 20 years for most genres. However, one cannot be inflexible: a single event can change perceptions across the world, and so a lesser time period could be considered. On the other hand some genres exhibit change much more conservatively and so in those cases 30 or 50 years may be more appropriate.

The second type of text dating to consider is cultural dating (i.e. the process of text dating by reference to culture), where one author may originate from a different culture than the other author, for example a UK author plagiarising from a US author or vice versa. In a case like this we might expect to find instances of different lexical tokens for the same entity, different modes of address, a greater commonality of a certain phrase in the one culture rather than the other, etc. The issue here would be an inappropriate or infrequent cultural token appearing in one or other of the texts relative to its appropriacy or frequency in the other. Thus, if the UK author were plagiarising from the US author, s/he may have used a term not commonly used by British writers which also appears in the US author's work but, crucially, is entirely 'at home' in that setting. It is not of course the 'culture' which is being dated, but the cultural token which is being assessed as the earlier or later occurrence based on its goodness of 'fit' within the language of the host culture. Culture, in this sense, need not be 'national' culture – it could be based on age, gender, political affiliation, ethnic or religious group, class, etc.

Lastly, I would like to mention what I term archaeological dating, which is the process of excavating two synchronous or almost synchronous texts to find which of the two presents symptoms of originality and which presents symptoms of copying – relative to each other.

As one would expect, there is some overlap between these three types of text dating. It is not uncommon, for instance, to find some aspects of both archaeological and cultural interest between two texts in a given investigation, or, on the other hand, between cultural and diachronic issues between them.

I will refer further to the different types of text dating in the Conclusion of this paper: suffice it to say for the moment that the first four examples given in this paper fall into the category of archaeological text dating.

A note on plagiarism

Plagiarism is the intended theft of a writer's work by another, with the motive by the copy author of passing the copy work off as an original. The motive is important because not every instance of copying can be thought of as plagiarism. Biblical scholars, for example, have grappled for centuries over what is known as the Synoptic Problem, that is to say the dating of the synoptic Gospels of Matthew, Mark and Luke. A number of methodologies have been proposed for just this authorship problem, from literary interrelationships (Stein, 1987), to questions of order of pericopae ( 'incident' or 'story unit') and, more latterly, factor analysis of lexical words (Miyake, Akoma et al, 2005). However, whichever of the synoptic writers was prior to the others, there is no suggestion that any of these writers were plagiarists in the sense we use the term today since the motive of personal gain appears to have been entirely absent.

Naïf vs. skilled plagiarism

There are two types of naïf plagiarism. The first is typically that of the inexperienced or badly taught student who perhaps does not understand how to cite the work of others appropriately and so blurs the distinction between adequate and inadequate citation. In some cultures it is considered important to defer to the opinions of distinguished scholars and so exact quotations are often relied on heavily, frequently without appropriate quotation resources or citation. Although these practices may not be considered plagiarism in the writer's home culture, such students find adaptation to cultures where plagiarism is more conservatively interpreted to be somewhat difficult.

The second type of naïf plagiarism is where the writer intends to plagiarise but is not very skilled at doing so, at worst leaving whole phrases and even sentences intact from the source text (literal plagiarism), at 'best' interlacing source text with his/her own, altering the sequence of clauses, inserting the odd phrase – all in an effort to disguise the source text. This is known as mosaic plagiarism.

Skilled plagiarism is where the writer succeeds in altering the original to such an extent that little of the original lexis, idiom or syntax is left, and in fact what has been taken is purely at the conceptual level. We can refer to this as ideational or conceptual plagiarism. It is still plagiarism, however, because the writer has neither added anything to the bank of ideas, nor attempted to modify or re?interpret them, but has simply re?phrased, and perhaps re?ordered them.

Skilled plagiarism is relatively rare: most types of plagiarism I have encountered in working for universities, professional writers and private companies has been of the naïf type, with a mixture of the literal and mosaic sub?types. In the remainder of this paper, the examples I will be giving will be of the naïf type, since this is what most readers are likely to encounter most of the time.

The hypothesis of originality

I suggest that in a plagiarism situation the source author has certain lexical and discourse strategies open to him/her which are not open to the copy author. These translate into three main areas of difference and are particularly applicable where the texts under consideration are aimed at a general audience as opposed to a specialised one:

  • Conciseness: It is proposed that the original author has the luxury of being concise. S/he is originating the text and can use brief, tried?and?tested expressions whereas the copy author does not have this advantage, but must adapt well-known phrases and expressions in an attempt to disguise the plagiary. At a psycholinguistic level we can speak of a least-effort situation: the original writer, being the original writer (sic.), does not have to hunt around for words and phrases as much as the copy writer does. “The availability of a word is positively correlated with its frequency” (Cancho and Sole, 2002: 1). On the other hand, the copy writer uses the original work or excerpt as a template, which must be altered: this requires the cognitive effort of changing the original expressions into acceptable alternatives – hardly 'least effort'. Put another way: you cannot simplify that which is already simple, and in all probability the process will result in the copy author having to resort to greater phrasal redundancy, resulting in a text which is loose and verbose;
  • Commonality of lexicon: Similarly, the original author has access to a common lexicon, whereas the copy author has to change common forms into less common ones – again in an attempt to escape detection;
  • Order, logic and consistency: The original author, especially if well?versed in the text topic, will in all probability have devoted a great deal of thought to the sequence of material in the text. This sequence will very possibly be a more logical sequence than that proposed by the copy author who – perhaps not seeing the schemata underlying the original author's work – randomly alters the sequence of textual items in a bid to avoid being noticed. This can lead to gaps in the copy text's logic. Finally, the original author – if s/he was committed to producing a worthwhile text – will most likely exhibit a consistency of format with regard to how points are raised, how counter?arguments are aired (and perhaps defeated) and so on. The copy author, at this level, will be exhibiting a different kind of 'least effort': the plagiary's very raison d'être is to save time and effort, and therefore it is in the nature of naïf plagiarists to be less caring and less careful than their originating models simply because they wish to put the least effort possible into the production – otherwise, why not originate a work rather than copy one?

The method

As a result of the above areas of difference, a composite method is proposed to address the question of text dating. That is to say, no single technique will be applicable in every case, but the techniques are related to each other, and hence can be considered to constitute a composite method.

The method is in three parts corresponding to the above three areas of difference – although, as noted not every part of the method can be used in each investigation.

As regards frequency, I will be basing my counts on frequencies which occur in the Google search engine. Though there are problems with using the Internet for this purpose, particularly with older texts, it is generally accepted to have “tremendous potential value as a linguistic resource” (Kehoe and Renouf, OL). Although Google does not represent a structured corpus it has the advantage of openness and transparency, ease of use, a vast database of more than a trillion words, and is accessible to all. As anyone who has attempted to gain access to online academic corpora will know, the same cannot always be said about those corpora. WebCorp is a search engine which operates on top of search engines like Google, and can be a useful way of checking usages. I have not done so with regard to the present paper, but using WebCorp in the past I have found Google to be generally very reliable.

I will now outline the nature of the plagiary method in each of the six cases referred to above. Each outline will be preceded by a short background note on the texts and their authors.

Helen Keller and Margaret Canby

The reader will have some sympathy for Helen Keller in this case. Not only was she blind and deaf, but at the time of writing 'The Frost King' she was only 11 years old (see Appendix for excerpts from both texts). Moreover, she had never intended to publish the children's story but had sent it to a friend out of gratitude to that friend for finding her a teacher. It was this friend who actually had the story published. Additionally, Helen Keller had a prodigious memory and claimed in her defence that she must have read the story some years before and remembered it, though this would have involved, in some cases, an exact memorising of quite lengthy phrases (see Hjelmquist, 1984, as to the practicalities of this). Whether, therefore, we regard this case as deliberate plagiarism on Helen Keller's part, there are interesting linguistic phenomena to be observed in the resemblances between the two texts.

Keller's method: Keller's method was to make slight alterations to common words and phrases. In Canby's text a character called King Frost tries to imagine what to do with his treasures. He decided to give them to Santa Claus for the benefit of the poor. He calls his fairies and asks them to convey the treasures to Santa's palace.

We have the following phrases from (C)anby (with their (K)eller parallels given below):

C: One day King Frost was trying to think of some good that he could do with his treasure
K: one day King Frost was surveying his vast wealth and thinking what good he could do with it
C: suddenly he concluded to send some of it to his kind neighbour, Santa Claus
K: he suddenly bethought him of his jolly old neighbour, Santa Claus
C: So he called together his merry little fairies
K: So he called together the merry little fairies of his household

In each case we can see that Keller uses less common words to say the 'same' thing – 'vast wealth' instead of 'treasure' (approximately 230 times less common in the language according to Google), 'bethought' instead of 'concluded' (approximately 450 times less common). We see in the last example that Keller has King Frost calling 'together the merry little fairies of his household' whereas Canby just has 'called together his merry little fairies', the first being a somewhat clumsy construction.

Later Canby's Frost 'told them' (i.e. the fairies) whereas Keller's Frost 'bade them' – (about 30 times less common); Frost believes – according to Canby that Santa 'will know how to make good use of [the treasures]' whereas Keller's Frost is sure that Santa 'is the very man to dispose of them satisfactorily'. I found on Google that to 'make use of' is more than twice as common as 'dispose of' – this aside from the fact that Keller's construction is longer and that 'dispose of' is ambiguous, a semantic flaw that 'satisfactorily' does little to repair. 'Good', in turn is three times more common than 'satisfactory', which is about five times more common than its adverb.

In the last paragraph, each story concludes with the fable that each year King Frost brings the colours of autumn to the leaves and trees, and that this is his real treasure. Canby uses the phrase 'from that time' while Keller writes 'ever since that time': the former is approximately 50 times more common than the latter. Keller writes 'I cannot imagine…' whereas Canby has 'I do not know…'. Do constructions are almost twice as common as can constructions in the language – both in full and contracted forms. 'Know' is about ten times more common than 'imagine'.

In conclusion we note that Keller's method is to take original expressions and alter them – in some cases slightly, and in others in major ways. She is forced, for the most part to rely on a lexis which is significantly less common than that of Canby. Her text is less accessible as a result. At the end of this paper I will give Flesch and Flesch?Kincaid readability statistics for all of the text excerpts referred to here. As may be expected from my previous remarks, Canby scores much higher on the reading ease scale than Keller and obtains a much lower grade level too.

On the basis of the above, therefore, I believe there is little difficulty with the proposition that the Canby text shows more signs of originality than the Keller text. The differences are mainly at the lexical, and occasionally at the phrasal level. Over the years I have developed a nomenclature for types of plagiarism, the Keller version being named 'the lexical switcher', for reasons given above.

Richard Condon and Robert Graves

The literal plagiarisms between these works are not extensive. Condon (The Manchurian Candidate) seems to have 'borrowed' from a wide bank of authors, of which Graves' I, Claudius was only one. In the source text, Graves writes: 'He knew that the marriage was impious: this knowledge, it seems, affected him nervously, putting an inner restraint on his flesh'. Condon's contribution is: 'Johnny knew in his superstitious heart of hearts that his marriage to Raymond's mother was an impious thing and this knowledge, it seems, affected him nervously, putting an inner restraint upon his flesh'.

Those who have had the painful task of reading The Manchurian Candidate will know that Condon is at his best when working with clichés, e.g. 'in his superstitious heart of hearts', and in flabbing out words into phrases – hence 'impious' becomes 'an impious thing'. Other examples of Condon's Faulkneresque attempts include: 'he decided that his mind must be bent or that he was drunk with compassion, or something else improbable like that' and 'owing to his endemic mopery, this one had to work nights, because, by now, it must be dark in St. Louis'.

Where Keller and Condon diverge is that Keller shuns common source words and phrases in favour of rare ones, while Condon pads out existing words into phrases and existing phrases into longer ones, producing somewhat purple prose in the process. He is therefore, in my litany of plagiarists a 'purple padder'. Looking at Graves and Condon, I believe the reader will have little difficulty in attributing a greater originality to Graves than Condon.

Archibald Carey and Martin Luther King Jr.

King is well known for having plagiarised almost half his doctoral dissertation from another theology student, rejoicing in the name of Jack Boozer, at Boston University (Wall Street Journal, November 9, 1990). What few recognised until recently was that his famous 'I have a dream speech' was also partly borrowed – from a speaker at the Republican National Convention in 1952. Thus where Carey has:

Let Freedom Ring. Not only from the Green Mountains and White Mountains of Vermont and New Hampshire; not only from the Catskills of New York; but from the Ozarks in Arkansas, from the Stone Mountain in Georgia, from the Blue Ridge Mountains of Virginia

King has:

And so let freedom ring from the prodigious hilltops of New Hampshire. Let freedom ring from the mighty mountains of New York. Let freedom ring from the heightening Alleghenies of Pennsylvania. Let freedom ring from the snow-capped Rockies of Colorado. Let freedom ring from the curvaceous slopes of California. But not only that; let freedom ring from Stone Mountain of Georgia. Let freedom ring from Lookout Mountain of Tennessee. Let freedom ring from every hill and molehill of Mississippi - from every mountainside.

King's style of plagiarism here seems to have been to change the actual names of the mountains – he doesn't refer to any of the mountains mentioned by Carey by name. He then uses alliteration ('mighty mountains') or a kind of assonance ('heightening Alleghenies'). He enters a purple vein when he talks about the 'curvaceous slopes of California' and the 'prodigious hilltops of New Hampshire'.

What is interesting when comparing King with Condon is that both writers rely on 'purple' devices and both pad out the existing text. However, King differs from Condon in that he departs further from his source than Condon does. I would label King's style of plagiarism as elaboration rather than padding. Given the highly specific kind of oratory King was creating, this kind of plagiarist is relatively rare – a 'poetic elaborator'. Looking at both King and Condon we find little difficulty in believing that, respectively, Carey's and Graves' texts were the source texts – they are shorter, pithier and relatively simple in structure.

Balkoski and Ambrose

Joseph Balkoski

The next morning, the 29ers draped the body with the Stars and Stripes and hoisted it on top of a huge pile of stones that once had been a wall of Sainte Croix Church, one block west of the cemetery. The body remained on display throughout July 19. The 29ers and some of the few civilians remaining in the city adorned the site with flowers.

Stephen Ambrose

Men from the 3rd Battalion draped the body with the Stars and Stripes and hoisted it on top of a huge pile of stones that had once been a wall in the Saint Croix Church, a block from the cemetery. Howie's body remained on display throughout the next day, July 19. The GIs and some of the few civilians remaining in the town adorned the site with flowers.

Despite his bare-faced copying of Balkoski, Ambrose's plagiarism attempts at alteration are quite subtle. He hardly changes a thing. Success for the plagiarist in this case is reliance on the source text being less well known than the copy. However, the changes he does make do affect the frequency of certain phrases. Thus, whereas Balkoski wrote: 'one block [from]' Ambrose has 'a block from'. In general 'a' is much more common than 'one' as a determiner, but – curiously – the phrase in the original text, 'one block from', is twice as common as the copy phrase 'a block from'. Another curiosity is that although there are many more towns than cities in the world, the word 'city' is six times more common than 'town'. I would describe Ambrose as a 'subtle literal' plagiarist: he copies whole sentences almost word for word, and when he does change something it is usually very minor. He does not elaborate or pad out what he writes. Providing there is little chance of the reader having first encountered the source text the copy text is quite believable. Subtle literal plagiarists are quite dangerous: they build their reputation by preying on little known works published by smaller publishing houses than their own. When they are finally exposed – as Alex Haley was – it is difficult for people to believe the extent of their criminality, and their reputations often remain significant, while the 'lesser' author is cast into oblivion. In fact, very often they become even more famous because of their plagiarism.

Looking at the texts of Balkoski and Ambrose it is not easy to detect which is likely to have been the source text. However, fine-grained measures of common words and phrases can be of help in this kind of plagiarism investigation.

WJ King and Bill Swanson

I have left the most recent case for last, that of Raytheon Chief Executive Officer Bill Swanson who lifted (quite literally) huge chunks out of a much earlier work. Swanson's gift to corporate literature was his 2004 '25 Unwritten Rules of Business', copied from the 1944 work of WJ King who had penned an almost unknown guide called 'The Unwritten Rules of Engineering'. In many cases Swanson copied WJ King word for word, but often seemed to be compelled to elaborate in a rather clumsy manner.

Thus, where WJ King had written “Promises, schedules, and estimates are necessary and important instruments in a well-ordered business” Swanson felt compelled to add to that homily: “…You must make promises. Don't lean on the often–used phrase, "I can't estimate it because it depends on too many uncertain factors." Where WJ King wrote – quite pithily – “Be as particular as you can in the selection of your boss” Swanson makes heavy going of it, as in: “Work for a boss with whom you are comfortable telling it like it is. Remember that you can't pick your relatives, but you can pick your boss”.

Swanson is thus a 'heavy lifter' kind of plagiarist: he lifts whole passages word for word, but then fumbles with them by adding little homespun clichés of his own. Works of the type '10 Unwritten Rules' or 'The Unwritten Commandments', if well written, have one thing in common: each rule or commandment is brief and to the point (consider the brevity of the Ten Commandments: 'Thou shalt not steal', 'Thou shalt not kill', etc. Most of the Ten Commandments are no more than 5 or 6 words in length). Thus, in the present instance, WJ King's 'Laws' averaged 13.37 words with a deviation of 4.8. Swanson, on the other hand, has a mean of 16.5 words with a deviation of 8.1.

We find that the source text is thus not only more concise than its copy, it is also often more consistent – either in how it sequences points, or with regard to sentence length, lexis and so on. A good source text will often operate over a single register, whereas copy texts will sometimes mix registers. For these reasons I believe readers would have no difficulty in attributing the WJ King text as the source.

Readability scores

One readily available way to turn the above observations into reality is to grade the excerpts according to their readability scores. Most computer users have this facility within the Word program. It is a powerful method and uses the Flesch Ease of Reading Formula, given as follows:

206.835-1.015(total words/total sentences)-84.6(total syllables/total words)

It also gives the grade formula, whereby the appropriate reading level of a text as applied to the US school grade system (Years 1 – 12) can be assessed. This formula is as follows:

0.39(total words/total sentences)+11.8(total syllables/total words)-15.59

These formulae are used here in preference to other, equally excellent formulae because of their ready availability and ease of use.

In four out of the five cases referred to above the source authors had higher ease of reading scores than their imitators and lower grade level scores. This was entirely as predicted (see Hypothesis of Originality section). We would expect, on the basis of the foregoing arguments, that those who imitate make their texts complicated and less accessible to readers. The results of the readability tests are given in the tables below, first for originators and then for their copyists:

Table 1: Readability scores for the Originating authors

The Originators

Flesch Ease of Reading

Grade Level

ER Quot Diff

Canby

98.2

1.0

1.24

Graves

52.5

11.1

1.3

Carey

76.5

5.3

1.3

Balkoski

74.1

8.2

1.03

WJ King

56

7.95

0.9

*Obtained by dividing the originator's Ease of Reading Score by the Imitator's score

Table 2: Readability scores for the Imitating authors

The Imitators

Flesch Ease of Reading

Grade Level

Keller

79.1

5.0

Condon

40

12

ML King Jr

57.2

7.5

Ambrose

71.8

8.8

Swanson

62.45

7.3

As can be seen from the last column of Table 1 the originators' texts have a quotient difference (obtained by dividing the originator's Ease of Reading Score by the imitator's score) of between 0.9 and 1.3. In the case of Balkoski and Ambrose the score is low: 1.03. This reflects the fact that Ambrose's imitation is almost 100 per cent literal and therefore the difference between the two with regard to ease of reading is likely to be insignificant. Nevertheless, in four out of five cases the originator's score is in fact higher than the imitator's, which leads me to the necessity of explaining why, in my view, the fifth score does not conform to this pattern.

Earlier, I classified text dating into three categories. The first four investigations referred to in this paper fell into the category of archaeological text dating, so?called because the inquiry necessitated a certain amount of digging around (hence 'archaeological') within each text pair to find out which was the source text. The only text pair which does not fall into this category is the last inquiry discussed, which was the WJ King and Bill Swanson pair, where the type of dating was diachronic. As can be seen from the above tables the 'readability' approach did not work with regard to the WJ King – Swanson pair. However, all is not lost. Inevitably a text is an event in history and so always reveals itself as a cultural artefact of a particular time and place. WJ King's manual on engineering secrets was written in 1944, in an age where most probably almost all engineers were men. Swanson's contribution on the other hand is a product of the 21st century, an age of supposed egalitarianism, or at least where the appearance of anything less than egalitarianism is not generally tolerated. In this regard we note that whereas Swanson, the imitator, has 'Don't overlook the fact that you are working for a boss. Keep him or her informed…', WJ King, the originator, has this somewhat dated formulation: 'Confirm your instructions and the other fellow's commitments in writing. Do not assume that the job will be done'. Some of his lexis is also rather archaic and culturally outdated: 'In carrying out a project do not wait for foremen, vendors , and others to deliver the goods; go after them and keep everlastingly after them'. Nowadays we would write 'foreperson' rather than 'foreman/men' and the word 'everlastingly' is so very antiquated and hence rare, that it is about fifteen times less common than the equally ancient 'evermore'.

Our assessment, therefore, of the source text between WJ King and Swanson would not be able to rely on readability statistics: rather we would have to look for small clues relating to minor but crucial differences between the time periods. I would suggest that nothing in language happens by accident: one instance of a thing can be enough, if it is in itself strongly indicative of a time and place. We need to look for the micro-clues rather than the big numbers in this kind of plagiarism situation.

The reader will probably have noticed that I have made no mention of the third type of text dating, which I earlier termed cultural text dating. I could only find one example of this, and since it was so short and something of a 'one-off' I have not so far mentioned it. It concerns US Senator Joe Biden who allegedly borrowed part of the UK Labour Leader Neil Kinnock's speech when running for the Democratic nomination for US President back in 1988. In his speech Kinnock had said:

Why am I the first Kinnock in a thousand generations to be able to get to university? Why is Glenys the first woman in her family in a thousand generations to be able to get to university?

At a later stage Biden said:

I started thinking as I was coming over here, why is it that Joe Biden is the first in his family ever to go to a university? Why is it that my wife who is sitting out there in the audience is the first in her family to ever go to college?

I have to admit to having stared quite hard at these two examples before realising that in the US context the phrase 'going to university' – in Biden's speech – is quite unusual, whereas it is very common in the UK. Biden actually says 'go to a university' which is even rarer than 'go to university'. He is much more on home ground, in the US context, when he says 'go to college', as he does when referring to Ms Biden.

Table 3: Frequency of phrases relating to 'college' and 'university' on Google

Phrase

Google Web wide search

Google UK search

go to university

1,430,000

339,000

go to a university

98,200

6940

get to college

20,500,000

181,000

go to a college

105,000

458

get to college

238,000

11,000

'Go to university' and 'go to college' exhibit the highest frequencies of the examples in the table above. The usual ratio of web wide searches to UK searches is between 10 and 15, reflecting the distribution of web domains around the world. We note that in this respect that 'go to university' is only just over 4 times more common worldwide than it is on UK domains. This suggests that 'go to university' is much more frequent in the UK than it is elsewhere. On the other hand, 'go to college' is 113 times more common worldwide than it is on UK domains. Allowing for duplications, errors and mis-registrations, it nevertheless seems to be much less common for a non?UK speaker to talk about going to university than it does for her/him to talk about going to college. In saying 'go to university' therefore, it seems that Senator Biden was going somewhat against the grain, with regard to American, or at least non?British usage. Had Neil Kinnock said 'Why am I the first Kinnock….to go to college' British speakers would have found it equally unusual.

I suggest, therefore, that in the Kinnock-Biden case, there is clear evidence that Kinnock is using the language of his British culture, whereas Biden is using language which is not typical of American (and other non?British) culture. This helps us to pinpoint Kinnock as the originator of the text, not Biden, hence the designation of this type of text dating as 'cultural'.

Interestingly, the connection between readability and plagiarism is not entirely new. Glatt and Hartel (1982) developed a way of using the cloze test to check for plagiarism. They blanked out every fifth word of a suspect text and asked the suspect plagiarist to fill in the blanks. They found that those who had plagiarised consistently produced lower scores on the cloze test than those who had not plagiarised. Much earlier, WL Taylor (1953) had suggested the use of the cloze test as a readability test: if a panel of ordinary readers were unable to fill in above 40% of blanked out words correctly, it was a sign that a text was 'unreadable'. Anything above 60% meant that the text was highly readable. More recently Knight, Almeroth and Bimber (2004) found that using sentences with lower readability scores from suspect texts was more likely to assist in Internet searching for plagiarism. What Knight et al did was to measure the readability score of each sentence in a text, then put to one side those sentences with grade level scores below 10 and submit those to Internet searches. They found this technique highly successful, and the work I have reported in the present study (during which time I was unaware of their work) bears out their findings, though I would suggest the grade level of 10 is somewhat arbitrary: in general it would be better to take each text on a case by case basis, determine the mean grade level and then find those sentences in that text which deviate most from that mean. Alternatively, the authors could take the mean for a class or group of students and find deviations in individual students' sentences from that mean.

Conclusion

Naïf plagiarists have fewer resources than source or original writers. They therefore have to rely on a few cheap circus tricks to get by. These can be summarised as follows:

1. The use of relatively rare words to substitute for common ones: thus the perfectly good 'good' becomes 'satisfactory', 'conclude' or 'think' becomes 'bethought himself' – an ugly, archaic construction if ever there was one, 'treasure' becomes 'vast wealth', and the perfectly serviceable 'told' becomes 'bade'. Measurable symptom: rarer lexis and readability.
2. The use of padding and elaboration: thus we don't have a character who mopes but one who is possessed of 'endemic mopery'. Measurable symptom: longer words and sentences, hence readability issues.
3. The collapse of logic: Condon's character 'knows in his superstitious heart of hearts'. This is a breakdown in logic: superstition is not knowing. You cannot know something if you are superstitious about it – the two are contradictions in terms. Measurable symptom: poor readability because of prolixity.
4. Inconsistency in format, layout or sequence: Here the writer will resort to the use of wildly improbable words or phrases to pad out a source text and render it unrecognisable, alter the sequence of phrases to render the original discourse intent unintelligible, and generally butcher a perfectly harmless work for the sake of momentary glory. These acts of literary homicide often result in text which is badly ordered, poorly connected in logical terms or – more likely still – simply unreadable. Measurable symptom: none in general, but look for excessive repetition and redundancy which may lead to readability differences.

The intention in this paper has been to show the ease with which two competing texts can be dated relative to each other, based on a few simple observations and techniques and that most types of naïf plagiarism fall into one of only a small number of categories: padding, lexical substitution, lifting and fumbling, subtly altering and – rarely – poetically elaborating. In carrying out this intention I have also introduced the notion of classifying text dating (or typing) into three types, diachronic, cultural and archaeological. It is hoped that readers will find these classifications and descriptions useful devices in their own investigations into plagiarism. More than this, it is hoped that others will continue to research the question of applying readability statistics to university and college essays with a view to assessing methods of testing for plagiarism. If a larger study were undertaken it would most likely be possible to undertake an inferential statistical analysis of the relative readability quotients of source vs. imitation texts. In the present instance this has not been done because of the relatively low number of examples under consideration and, furthermore, the intention was to do no more than to outline the method and its application, but it remains a project for the future.

Appendix: The text pairs

Margaret Canby

He has two neighbours, who live still farther north; one is King Winter, a cross and churlish old monarch, who is hard and cruel, and delights in making the poor suffer and weep; but the other neighbour is Santa Claus, a fine, good-natured, jolly old soul, who loves to do good, and who brings presents to the poor, and to nice little children at Christmas.

Well, one day King Frost was trying to think of some good that he could do with his treasure; and suddenly he concluded to send some of it to his kind neighbour, Santa Claus, to buy presents of food and clothing for the poor, that they might not suffer so much when King Winter went near their homes. So he called together his merry little fairies, and showing them a number of jars and vases filled with gold and precious stones, told them to carry those carefully to the palace of Santa Claus, and give them to him with the compliments of King Frost. "He will know how to make good use of the treasure," added Jack Frost; then he told the fairies not to loiter by the way, but to do his bidding quickly.

From that time, I suppose, it has been part of Jack Frost's work to paint the trees with the glowing colours we see in the autumn; and if they are not covered with gold and precious stones, I do not know how he makes them so bright; do you?

Helen Keller

Well, one day King Frost was surveying his vast wealth and thinking what good he could do with it, he suddenly bethought him of his jolly old neighbour, Santa Claus. "I will send my treasures to Santa Claus," said the King to himself. "He is the very man to dispose of them satisfactorily, for he knows where the poor and the unhappy live, and his kind old heart is always full of benevolent plans for their relief." So he called together the merry little fairies of his household and, showing them the jars and vases containing his treasures, he bade them carry them to the palace of Santa Claus as quickly as they could.

Ever since that time it has been King Frost's great delight to paint the leaves with the glowing colors we see in the autumn, and if they are not covered with gold and precious stones I cannot imagine what makes them so bright, can you?

Robert Graves

He knew that the marriage was impious: this knowledge, it seems, affected him nervously, putting an inner restraint on his flesh.

Richard Condon

Johnny knew in his superstitious heart of hearts that his marriage to Raymond's mother was an impious thing and this knowledge, it seems, affected him nervously, putting an inner restraint upon his flesh.

Archibald Carey

Not only from the Green Mountains and White Mountains of Vermont and New Hampshire; not only from the Catskills of New York; but from the Ozarks in Arkansas, from the Stone Mountain in Georgia, from the Blue Ridge Mountains of Virginia

Martin Luther King Jr

And so let freedom ring from the prodigious hilltops of New Hampshire. Let freedom ring from the mighty mountains of New York. Let freedom ring from the heightening Alleghenies of Pennsylvania. Let freedom ring from the snow-capped Rockies of Colorado. Let freedom ring from the curvaceous slopes of California. But not only that; let freedom ring from Stone Mountain of Georgia. Let freedom ring from Lookout Mountain of Tennessee. Let freedom ring from every hill and molehill of Mississippi - from every mountainside.

Joseph Balkoski

The next morning, the 29ers draped the body with the Stars and Stripes and hoisted it on top of a huge pile of stones that once had been a wall of Sainte Croix Church, one block west of the cemetery. The body remained on display throughout July 19. The 29ers and some of the few civilians remaining in the city adorned the site with flowers.

Stephen Ambrose

Men from the 3rd Battalion draped the body with the Stars and Stripes and hoisted it on top of a huge pile of stones that had once been a wall in the Saint Croix Church, a block from the cemetery. Howie's body remained on display throughout the next day, July 19. The GIs and some of the few civilians remaining in the town adorned the site with flowers.

WJ King and Bill Swanson

(Note: Swanson's contributions are numbered)
- "Cultivate the habit of 'boiling matters down' to their simplest terms."
20. Boil matters down to the simplest terms: the proverbial "elevator speech" is best.
- "Do not get excited in engineering emergencies -- keep your feet on the ground."
21. Don't get excited in engineering emergencies. Keep your feet on the ground.
- "Cultivate the habit of making brisk, clean-cut decisions."
22. Cultivate the habit of making quick, clean–cut decisions.
- "Promises, schedules, and estimates are necessary and important instruments in a well-ordered business"
17. Promises, schedules and estimates are important instruments in a well–ordered business. You must make promises. Don't lean on the often–used phrase, "I can't estimate it because it depends on too many uncertain factors."

Bibliography

Anson CM and C Farris. 1998. Under Construction: Working at the Intersections of Composition Theory, Research, and Practice. Utah State University Press.
Cancho R and R Sole. 2002. “Least effort and the origins of scaling in human language”. Proceedings of the National Academy of Sciences of the United States of America. Vol. 100(3), 788-791.
Deardorff JW. 2003. On the accumulation of individual probabilities, OL, found at http://www.tjresearch.info/cumulate.htm on 20 May 2006.
Glatt, B.S., & Haertel, E.H. (1982) The use of the cloze testing procedure for detecting plagiarism. Journal of Experimental Education, 50, 127-136.
Hjelmquist E 1984. “Memory for conversations”. Discourse Processes, 7, 321-336.
Kehoe, A. & A. Renouf. 2002. WebCorp: Applying the Web to Linguistics and Linguistics to the Web. WWW2002 Conference, Honolulu, Hawaii (OL).
Knight A, K Almeroth and B Bimber. OL. An Automated System For Plagiarism Detection Using The Internet. World Conference on Educational Multimedia, Hypermedia & Telecommunications (ED MEDIA), Lugano, Switzerland, June 2004.
Miyake M, Hiroyuki Akama, Nobuyasu Makoshi, Masanori Nakagawa. 2005. Computational Approach to the Synoptic Problem. Department of Human System Science, Tokyo Institute of Technology.
Olsson J. 2004. Forensic Linguistics: An introduction to language, crime and the law. Continuum.
Stein, Robert H. 1987. The Synoptic Problem: An Introduction. Baker Books.
Taylor, W.L. (1953). Cloze procedure: A new tool for measuring readability. Journalism Quarterly, 30. [415-433]


© 2007-2014 The Forensic Linguistics Society   Online privacy policy   Accessibility