Spotlight on stylometric text analysis - Newspaper stories: Difference between revisions

From MandrakeWiki
Jump to navigation Jump to search
Line 35: Line 35:
An analysis using the Voyant Tools shows that the corpus consist of 82,678 total words and 7,251 unique word forms.
An analysis using the Voyant Tools shows that the corpus consist of 82,678 total words and 7,251 unique word forms.


Vocabulary Density: <small>''(a measurement of vocabulary usage in comparison to the length of a document. Think of how many words will be read on average before a new word is encountered.)''</small>
In this corpus, the stories of Bester have a higher '''vocabulary density''', a measurement of vocabulary usage in comparison to the length of a document. <small>''(Think of how many words will be read on average before a new word is encountered.)''</small>
{| {{table}}  
{| {{table}}  
! &nbsp; !! 1 !! 2 !! 3 !! 4 !! 5
! &nbsp; !! 1 !! 2 !! 3 !! 4 !! 5
Line 44: Line 44:
|-
|-
|}
|}
 
Comparing the '''readability index''' the childhood story has the lowest score. The readability index is an estimation of how difficult a text is to read. The estimation is made by measuring a text's complexity. Measurable attributes of texts such as word lengths, sentence lengths, syllable counts, and so on give us ways to measure the complexity of a text. The Voyant Tools uses the Coleman–Liau index, and the output is approximates the U.S. grade level.
 
Readability Index: <small>''(an estimation of how difficult a text is to read. The estimation is made by measuring a text's complexity. Measurable attributes of texts such as word lengths, sentence lengths, syllable counts, and so on give us ways to measure the complexity of a text.)''</small>
{| {{table}}  
{| {{table}}  
! &nbsp; !! 1 !! 2 !! 3 !! 4 !! 5
! &nbsp; !! 1 !! 2 !! 3 !! 4 !! 5

Revision as of 11:41, 28 July 2023

Phantom Sundays

1939-1946

The Corpus

The Phantom Sundays
  • PS_002 "The Precious Cargo of Colonel Winn"
  • PS_003 "The Fire Goddess"
  • PS_004 "The Beachcomber"
  • PS_005 "The Saboteurs"
  • PS_006 "The Return of the Sky Band"
  • PS_007 "The Impostor"
  • PS_008 The Marshall Sisters Pt.1: "Castle in the Clouds"
  • PS_009 The Marshall Sisters Pt.2: "The Ismani Cannibals"
  • PS_010 The Marshall Sisters Pt.3: "Hamid the Terrible"
  • PS_011 "The Childhood of the Phantom"
  • PS_012 "The Golden Princess"
  • PS_013 "The Strange Fisherman"
  • PS_014 "Queen Pera the Perfect"
Mandrake the Magician Sundays
  • MS_012 "The Ghost Bear of Glass Mountain"
  • MS_013 "The Theatre Mysteries"
  • MS_023 "Mystery of the Girls with Red Hair"
  • MS_024 "Cloud City"
  • MS_025 "Gloria Golden"
  • MS_026 "The Garden of Wuzzu"
  • MS_027 "The Circus Adventure"
  • MS_028 "The Santa Claus Pirates"
Comics written by Alfred Bester
  • "Starman", The Menace of the Invisible Raiders! (Adventure Comics #67, October 1941)
  • "Starman", The Blaze of Doom! (Adventure Comics #68, November 1941)
  • "Starman", The Little Man Who Wasn't There! (Adventure Comics #78, September 1942)
  • "Green Lantern", The Man Who Wanted the World! (Green Lantern #10, winter 1943)
Voyant Tools

An analysis using the Voyant Tools shows that the corpus consist of 82,678 total words and 7,251 unique word forms.

In this corpus, the stories of Bester have a higher vocabulary density, a measurement of vocabulary usage in comparison to the length of a document. (Think of how many words will be read on average before a new word is encountered.)

  1 2 3 4 5
Highest: GL-10-1943 (0.405) PS_009 (0.387) AC_078 (0.380) AC_067 (0.371) AC_068 (0.365)
Lowest: PS_006 (0.198) PS_010 (0.206) PS_011 (0.223) PS_007 (0.236) PS_004 (0.236)

Comparing the readability index the childhood story has the lowest score. The readability index is an estimation of how difficult a text is to read. The estimation is made by measuring a text's complexity. Measurable attributes of texts such as word lengths, sentence lengths, syllable counts, and so on give us ways to measure the complexity of a text. The Voyant Tools uses the Coleman–Liau index, and the output is approximates the U.S. grade level.

  1 2 3 4 5
Highest: MS_028 (7.863) MS_026 (7.730) MS_024 (7.378) MS_027 (7.360) MS_025 (7.222)
Lowest: PS_011 (3.547) PS_004 (4.493) MS_012 (4.668) MS_013 (4.767) PS_014 (4.836)


A stylometric text analysis of the Avon Novels using the 'stylo' package in RStudio shows that it is possible to use stylometry to identify the author of a book. The text in the comic strips are a bit different than a novels. In the novels the dialogues looking about this: "Hello," said the Phantom. But in the Sunday stories the speech bubble are more like this: Hello!

For a similar analysis I made a corpus of the text from the Phantom Sundays (2-14):

  • PS_002 "The Precious Cargo of Colonel Winn"
  • PS_003 "The Fire Goddess"
  • PS_004 "The Beachcomber"
  • PS_005 "The Saboteurs"
  • PS_006 "The Return of the Sky Band"
  • PS_007 "The Impostor" at OWI
  • PS_008 The Marshall Sisters Pt.1: "Castle in the Clouds"
  • PS_009 The Marshall Sisters Pt.2: "The Ismani Cannibals"
  • PS_010 The Marshall Sisters Pt.3: "Hamid the Terrible"
  • PS_011 "The Childhood of the Phantom"
  • PS_012 "The Golden Princess"
  • PS_013 "The Strange Fisherman"
  • PS_014 "Queen Pera the Perfect"

Two analysis were done: first 0-902 MFW 2-gram (fig. 1) and the second 0-902 MFC 3-grams (fig. 2). Both using the Boostrap Consensus Tree. The result for the MFW 2-grams grouping all stories close in writing style. The MFC 3-grams shows a slightly larger variation, but this is most likely due to the ammount of dialogues from different characters in the stories. The analysis shows no clear indication that anyone other than Lee Falk was the author of these stories.

A third analysis, adding some Mandrake Sundays to the corpus:

  • MS_012 "The Ghost Bear of Glass Mountain"
  • MS_013 "The Theatre Mysteries"
  • MS_023 "Mystery of the Girls with Red Hair"
  • MS_024 "Cloud City"
  • MS_025 "Gloria Golden"
  • MS_026 "The Garden of Wuzzu"
  • MS_027 "The Circus Adventure"
  • MS_028 "The Santa Claus Pirates"

This analysis is a cluster analysis using the 100 MFW (fig. 3). Here one see a greater distance between the stories with the Phantom and Mandrake. But otherwise no indication of a ghost-writer.