Spotlight on stylometric text analysis - The Avon Novels

From MandrakeWiki
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
Word Cloud - The Avon Novels

The Corpus

The Story of the Phantom

"The Story of the Phantom" is a series of 15 novels, published by Avon Publications in the U.S. from 1972 to 1975, based on Lee Falk's Phantom stories. When released the adaptor of issues 2 and 10 was not credited, and issue 15 was credited as Carson Bingham. Lee Falk did correct this using an "Author's note" in the books.

issue(s) Adapted by
1, 6, 9, 12, 15 Lee Falk
2, 3 Basil Copper
4, 5, 7, 8, 10, 11 Ron Goulart (pen name: Frank S. Shawn)
13 Warren Shanahan
14 Bruce Cassiday (pen name: Carson Bingham)

Flash Gordon

"Flash Gordon" is a series of 6 novels, published by Avon Publications in the U.S. from 1974 to 1975, based on Alex Raymond's Flash Gordon stories. When released the adaptor of the four first issues was credited as Con Steffanson, the two last one credited as Carson Bingham. Later Ron Goulart said he wrote the first three novels and Bruce Cassiday the three last ones.

issues Adapted by
1, 2, 3 Ron Goulart (pen name: Con Steffanson)
4, 5, 6 Bruce Cassiday (pen name: Carson Bingham)

Supplementary text

The Avon novels only have one novel written by Shanahan and two written by Copper. To provide a better basis for comparison in the further analysis, three novellettes by these authors are part of the corpus:

  • Warren Shanahan (using his pen name: W. J. Saber): "Your Mission- Block the Brenner Pass!" and "Find and Destroy the Nazis Secret Wolf-Pack Base".
  • Basil Copper: "The Long Rest".

The Avon novels have two different protagonists (The Phantom and Flash Gordon) in stories that take place in different environments. To get an indication of whether this can affect the analysis result, four novels by Burroughs are included in the corpus. One novel with the protagonist Tarzan in a jungle environment and three with the protagonist John Carter in an environment on the planet Mars:

  • Edgar Rice Burroughs: "A Princess of Mars", "At the Earth's Core", "Tarzan of the Apes" and "Warlord of March".

Statistics

Supplementary text

The novels by Edgar Rice Burroughs are approximately 49,000 and 85,000 words. And the novellettes by Copper and Shanahan are approximately 14,000 to 17,000 words.

The Avon Novels

The Avon novels vary between approximately 30,000 and 56,000 words. Where Lee Falk's first novel has the most words, while Goulart's tenth novel about the Phantom has the fewest.

  • All of Goulart's novels are in the lower tier of word count, between approximately 30,000 and 37,000.
  • All the novels written by Falk are in the upper tier, between approximately 47,000 and 56,000 words.
  • Cassiday, Shanahan and Copper's novels are approximately 42,000 to 50,000 words.

Analysis using the 'stylo' package in RStudio

Analysis I - The Phantom novels

With the corpus of the 15 novels, from chapter 1 to the end of the novel.

Using a Bootstrap consensus tree analysis for the 100 MFW[footnotes 1] to 1,000 MFW (with an incremental step size of 50 words), it branches for the authors according to Lee Falk's correction in the Author's note.

The novels by Shanahan (13) and Bingham (14) branches close. But this is not uncommon if the corpus contains only one text by an author. To illustrate this I added some novels by Edgar Rice Burroughs. First "Tarzan of the Apes" (fig. 2), and then "A Princess of Mars" (fig. 3) and next "At the Earth's Core" (fig. 4). When two or more novels by an author are in the corpus, the analysis branches according to the author.

Analysis II

Using the corpus with all novels and novellettes from chapter 1 to the end.

  • The first analysis is an cluster analysis for the 100 MFW using the Classic Delta Distance. (fig. 5)
  • The second analysis is the same as the first, but with 1,500 MFW. (fig. 6)
  • The third analysis is a Bootstrap consensus tree analysis for the 100 MFW to 3,000 MFW (with an incremental step size of 50 words). (fig. 7)

The second and third analysis branches for the authors according to Lee Falk's correction in the Author's note and Goulart's comments according the "Flash Gordon" novels.

Analysis using the Voyant Tools (local Voyant Server)

When it comes to statistics, there may be some sources of error in the corpus.

  • The Avon Novels have some typographical errors that have not been corrected.
  • The proofreading of the OCR[footnotes 2] may introduce some typographical errors.

Punctuation marks

The only authors who use parentheses in this corpus are Shanahan and Falk. Shanahan uses the parentheses eight times in his novel. Falk uses the parentheses in all his novels, varying from 15 (The Phantom #9) to 53 (The Phantom #15).

The use of exclamation marks also distinguishes the various authors:

  • Goulart: 15 - 64
  • Falk: 68 - 87
  • Copper: 110 - 207
  • Shanahan: 141
  • Cassiday: 229 - 370

Tokenization

The tokenization[footnotes 3] has a bearing on how the number of words is calculated using the Voyant Tools. This analysis uses the automatic tokenization.

Tokenization Count Tokens Notes
Automatic 3 What's, voyant, tools.org the hyphen is split but the tools.org is considered a URL token; tokens are lowercase
Word Boundaries 5 What, s, voyant, tools,org any non-word character is a delimiter, tokens are lowercase
Whitespace Only 2 What's, voyant-tools.org? punctuation is kept in tokens and case is unchanged

Note

  1. MFW = Most Frequent Words
  2. OCR = Optical Character Recognition - the process that converts an image of text into a machine-readable text format
  3. Tokenization = The process of identifying words, or sequences of Unicode letter characters that should be considered as a unitz