Spotlight on stylometric text analysis - Newspaper stories

From MandrakeWiki
Revision as of 17:56, 27 July 2023 by The Clay Camel (talk | contribs) (Created page with "A stylometric text analysis of the Avon Novels using the 'stylo' package in RStudio shows that it is possible to use stylometry to identify the author of a book. The text in the comic strips are a bit different than a novels. In the novels the dialogues looking about this: "Hello," said the Phantom. But in the Sunday stories the speech bubble are more like this: Hello! For a similar analysis I made a corpus o...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

A stylometric text analysis of the Avon Novels using the 'stylo' package in RStudio shows that it is possible to use stylometry to identify the author of a book. The text in the comic strips are a bit different than a novels. In the novels the dialogues looking about this: "Hello," said the Phantom. But in the Sunday stories the speech bubble are more like this: Hello!

For a similar analysis I made a corpus of the text from the Phantom Sundays (2-14):

  • PS_002 "The Precious Cargo of Colonel Winn"
  • PS_003 "The Fire Goddess"
  • PS_004 "The Beachcomber"
  • PS_005 "The Saboteurs"
  • PS_006 "The Return of the Sky Band"
  • PS_007 "The Impostor" at OWI
  • PS_008 The Marshall Sisters Pt.1: "Castle in the Clouds"
  • PS_009 The Marshall Sisters Pt.2: "The Ismani Cannibals"
  • PS_010 The Marshall Sisters Pt.3: "Hamid the Terrible"
  • PS_011 "The Childhood of the Phantom"
  • PS_012 "The Golden Princess"
  • PS_013 "The Strange Fisherman"
  • PS_014 "Queen Pera the Perfect"

Two analysis were done: first 0-902 MFW 2-gram (fig. 1) and the second 0-902 MFC 3-grams (fig. 2). Both using the Boostrap Consensus Tree. The result for the MFW 2-grams grouping all stories close in writing style. The MFC 3-grams shows a slightly larger variation, but this is most likely due to the ammount of dialogues from different characters in the stories. The analysis shows no clear indication that anyone other than Lee Falk was the author of these stories.

A third analysis, adding some Mandrake Sundays to the corpus:

  • MS_012 "The Ghost Bear of Glass Mountain"
  • MS_013 "The Theatre Mysteries"
  • MS_023 "Mystery of the Girls with Red Hair"
  • MS_024 "Cloud City"
  • MS_025 "Gloria Golden"
  • MS_026 "The Garden of Wuzzu"
  • MS_027 "The Circus Adventure"
  • MS_028 "The Santa Claus Pirates"

This analysis is a cluster analysis using the 100 MFW (fig. 3). Here one see a greater distance between the stories with the Phantom and Mandrake. But otherwise no indication of a ghost-writer.

A futher analysis using the Voyant Tools shows that the thirteen Phantom Sundays consist of 52,982 total words and 5,282 unique word forms.

Vocabulary Density: (a measurement of vocabulary usage in comparison to the length of a document. Think of how many words will be read on average before a new word is encountered.)

  1 2 3 4 5
Highest PS_009 (0.387) PS_005 (0.357) PS_014 (0.307) PS_012 (0.284) PS_008 (0.282)
Lowest PS_006 (0.198) PS_010 (0.206) PS_011 (0.223) PS_007 (0.236) PS_004 (0.236)

Readability Index: (an estimation of how difficult a text is to read. The estimation is made by measuring a text's complexity. Measurable attributes of texts such as word lengths, sentence lengths, syllable counts, and so on give us ways to measure the complexity of a text.)

  1 2 3 4 5
Highest PS_008 (7.115) PS_012 (6.912) PS_003 (6.895) PS_005 (6.728) PS_007 (6.038)
Lowest PS_011 (3.547) PS_004 (4.493) PS_014 (4.836) PS_009 (5.473) PS_002 (5.551)

Adding the eight Mandrake Sundays to the corpus it now consist of 75,018 total words and 6,607 unique word forms.

Vocabulary Density:

  1 2 3 4 5
Highest: PS_009 (0.387) MS_025 (0.358) PS_005 (0.357) MS_027 (0.331) MS_028 (0.320)
Lowest: PS_006 (0.198) PS_010 (0.206) PS_011 (0.223) PS_007 (0.236) PS_004 (0.236)

Readability Index:

  1 2 3 4 5
Highest: MS_028 (7.863) MS_026 (7.733) MS_024 (7.378) MS_027 (7.360) MS_025 (7.222)
Lowest: PS_011 (3.547) PS_004 (4.493) MS_012 (4.668) MS_013 (4.767) PS_014 (4.836)