The number of lines for each character by percentage of the series

Submitted ⁨⁨7⁩ ⁨months⁩ ago⁩ by ⁨danielquinn@lemmy.ca⁩ to ⁨startrek@startrek.website⁩

It would seem that I have far too much time on my hands. After the post about a Star Trek “test”, I started wondering if there could be any data to back it up and… well here we go:

The Next Generation

Name	Percentage of Lines
PICARD	20.16
RIKER	11.64
DATA	10.1
LAFORGE	6.93
WORF	6.14
TROI	5.4
CRUSHER	5.11
WESLEY	2.32

DS9

Name	Percentage of Lines
SISKO	13.0
KIRA	8.23
BASHIR	7.79
O’BRIEN	7.31
ODO	7.26
QUARK	6.98
DAX	5.73
WORF	3.18
JAKE	2.31
GARAK	2.29
NOG	2.01
ROM	1.89
DUKAT	1.76
EZRI	1.53

Voyager

Name	Percentage of Lines
JANEWAY	17.7
CHAKOTAY	8.76
EMH	8.34
PARIS	7.63
TUVOK	6.9
KIM	6.57
TORRES	6.45
SEVEN	6.1
NEELIX	4.99
KES	2.06

Enterprise

Name	Percentage of Lines
ARCHER	24.52
T’POL	13.09
TUCKER	12.72
REED	7.34
PHLOX	5.71
HOSHI	4.63
TRAVIS	3.83
SHRAN	1.26

Discovery

Note: This is a limited dataset, as the source site only has transcripts for seasons 1, 2, and 4

Name	Percentage of Lines
BURNHAM	22.92
SARU	8.2
BOOK	6.21
STAMETS	5.44
TILLY	5.17
LORCA	4.99
TARKA	3.32
TYLER	3.18
GEORGIOU	2.96
CULBER	2.83
RILLAK	2.17
DETMER	1.97
OWOSEKUN	1.79
ADIRA	1.63
COMPUTER	1.61
ZORA	1.6
VANCE	1.07
CORNWELL	1.07
SAREK	1.06
T’RINA	1.02

If anyone is interested, here’s the (rather hurried) Python used:

#!/usr/bin/env python

#
# This script assumes that you've already downloaded all the episode lines from
# the fantastic chakoteya.net:
#
# wget --accept=html,htm --relative --wait=2 --include-directories=/STDisco17/ http://www.chakoteya.net/STDisco17/episodes.html -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/Enterprise/ http://www.chakoteya.net/Enterprise/episodes.htm -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/Voyager/ http://www.chakoteya.net/Voyager/episode_listing.htm -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/DS9/ http://www.chakoteya.net/DS9/episodes.htm -m
# wget --accept=html,htm --relative --wait=2 --include-directories=/NextGen/ http://www.chakoteya.net/NextGen/episodes.htm -m
#
# Then you'll probably have to convert the following files to UTF-8 as they
# differ from the rest:
#
# * Voyager/709.htm
# * Voyager/515.htm
# * Voyager/416.htm
# * Enterprise/41.htm
#

import re
from collections import defaultdict
from pathlib import Path

EPISODE_REGEX = re.compile(r"^\d+\.html?$")
LINE_REGEX = re.compile(r"^(?P<name>[A-Z']+): ")

EPISODES = Path("www.chakoteya.net")
DISCO = EPISODES / "STDisco17"
ENT = EPISODES / "Enterprise"
TNG = EPISODES / "NextGen"
DS9 = EPISODES / "DS9"
VOY = EPISODES / "Voyager"


class CharacterLines:
    def __init__(self, path: Path) -> None:
        self.path = path
        self.line_count = defaultdict(int)

    def collect(self) -> None:
        for episode in self.path.glob("*.htm*"):
            if EPISODE_REGEX.match(episode.name):
                for line in episode.read_text().split("\n"):
                    if m := LINE_REGEX.match(line):
                        self.line_count[m.group("name")] += 1

    @property
    def as_percentages(self) -> dict[str, float]:
        total = sum(self.line_count.values())
        r = {}
        for k, v in self.line_count.items():
            percentage = round(v * 100 / total, 2)
            if percentage > 1:
                r[k] = percentage
        return {k: v for k, v in reversed(sorted(r.items(), key=lambda _: _[1]))}

    def render(self) -> None:
        print(self.path.name)
        print("| Name             | Percentage of Lines |")
        print("| ---------------- | ------------------- |")
        for character, pct in self.as_percentages.items():
            print(f"| {character:16} | {pct} |")


if __name__ == "__main__":
    for series in (TNG, DS9, VOY, ENT, DISCO):
        counter = CharacterLines(series)
        counter.collect()
        counter.render()

source

Comments

Sort:hotnew top

Corgana@startrek.website ⁨7⁩ ⁨months⁩ ago
Fascinating stuff I love that you did this. I’m surprised Morn didn’t rank higher considering how chatty he is in every scene.

source
- ericjmorey@discuss.online ⁨7⁩ ⁨months⁩ ago
  Number of lines vs number of words spoken vs length of time speaking probably would have a lot of variation in results.
  
  source
milkisklim@lemm.ee ⁨7⁩ ⁨months⁩ ago
This definitely goes to show why people felt Discovery was the Micheal Burnham show. Not that she had an unusual number of lines but that no one else spoke even half as much as her, with all of the other percentages of lines broken up by more characters than the other series.

Also does GEORGIOU count for both prime and mirror versions of the character?

source
- danielquinn@lemmy.ca ⁨7⁩ ⁨months⁩ ago
  That was my takeaway as well. I just wish I had data for the other seasons. It’d be interesting to see how that might change the percentages as they are.
  
  As for GEOGIOU, I’m reasonably sure that this refers to both versions of her.
  
  source
- exocrinous@startrek.website ⁨7⁩ ⁨months⁩ ago
  Georgiou also got fridged for Michael’s character development. And then we follow Michael over the timeskip. Right out the gate, the universe exists to tell a story about Michael.
  
  source
- rob_t_firefly@lemmy.world ⁨7⁩ ⁨months⁩ ago
  ::: Regarding Georgiou… As the prime version of Georgiou’s lines basically amounted to “Hi!” “Oh crap!” “Bye!” the overall math shouldn’t be too affected. :::
  
  source
deegeese@sopuli.xyz ⁨7⁩ ⁨months⁩ ago
Thanks for sharing. I notice chakoteya.net has TOS scripts. Is there any reason they weren’t included in the analysis?

source
- danielquinn@lemmy.ca ⁨7⁩ ⁨months⁩ ago
  Honestly, it’s 'cause I forgot to include it! I’ll see if I can add it tonight. Check back in 24hrs :-)
  
  source
  - deegeese@sopuli.xyz ⁨6⁩ ⁨months⁩ ago
    Thanks for the update.
    
    Poor Chekhov has almost no lines, but Koenig was great as Bester on B5.
    
    source
ValueSubtracted@startrek.website ⁨7⁩ ⁨months⁩ ago
Wow, Tarka was a chatty sonofagun.

source
Indy@startrek.website ⁨7⁩ ⁨months⁩ ago
This is beautiful! I love data and I’m delighted you were inspired by my post to gather the data.

Thank you for doing this!

source
usernamefactory@lemmy.ca ⁨7⁩ ⁨months⁩ ago
Fascinating! It would be illuminating to see this broken up by season as well. Seven of Nine’s relatively low ratio, for instance, can definitely be attributed to her late arrival to the series. In the latter seasons, I suspect her percentage could be rivalling Janeway’s.

Conversely, it’s impressive Lorca ranks as highly as he does, given he was gone by the end of Disco season one. But since he was simultaneously captain and antagonist while he was around, I guess it isn’t that surprising.

source
clay_pidgin@sh.itjust.works ⁨7⁩ ⁨months⁩ ago
Maybe the two Dax hosts on DS9 should be combined, as they didn’t overlap.

source