"Paperback" Reutter: statistics

Showing posts with label statistics. Show all posts

Tuesday, October 1, 2024

A House of Sky and Breath (Sarah J. Maas)

Five years ago I reported on the frequency of throat bobs in Sarah J. Maas's books. She has since published four more books. I've read three of them and am happy to share an updated chart of the progression of throat bobs across Maas's ouevre. The steady rate since page 4800 has continued, with a few runs here and there during especially emotional moments.

One major change is the number of unnamed characters whose throats bob. For the first 7000 pages or so, only named characters throats bobbed, and now four unnamed characters' throats have bobbed in the last 2000+ pages (though I suspect that the unnamed alpha mystic will get a name in House of Flame and Shadow).

House of Sky and Breath boasts the third-most throat bobs among the novels:

... and Bryce is now tied with Aelin as the queen of throat bobbing, with a whole book to take first place.

Saturday, December 21, 2019

Netflix DVD rental history

We joined Netflix in late November, 2004, with the 3 DVDs out at a time plan. Nadal had yet to play at the French Open; Federer won his second Wimbledon and the U.S. Open; Phelps won 6 gold medals at the Olympics; the Boston Red Sox had won their first World Series in 86 years. And we had a 6-month-old. So we were trendsetters and cut the cable.

Fifteen years later, we're dinosaurs holding on to an old service. Yes, we have Netflix streaming, but had been holding on to 1 DVD out at a time for the past five years in order to see new releases a little sooner. However, we knew we would want Disney+, and the local library tends to carry new DVD releases, so it was an easy switch this fall.

Now, of course, I wanted my Netflix DVD rental history, but Netflix doesn't provide a download, so I had to select, copy, and paste from a webpage.

The resulting file was not a nicely formatted CSV. This provided a nice opportunity to practice some general file processing.

Examining the structure of the file, each rental takes up several rows; each row in order contains:

A number indicating the order of the rental, from the most recent rental to the oldest (exactly 1500 rentals!)
The title of the rental
A blank line
The release date, rating, and runtime; concatenated into a single string
A blank line
The ship date from Netflix's processing facility and the return date

Some rentals have an additional three lines, indicating:

Whether the original shipment was damaged
A blank line
The ship date from Netflix's processing facility and the return date for the replacement disc

From this file format, I wanted to create an initial output table where each row was a separate rental, and each column was a line from the file (I would further process the output table later). My initial parsing script needed to identify a new rental based upon whether the value of the line was an integer, read intervening lines as elements of a list, and so the rental history was a list of lists to be transformed into an output table.

The core of this parsing code is as follows:

with open(filepath) as fp:
    line = fp.readline()
    record = []
    viewingHistory = []
    while line:
        try:
            int(line)
            # print(record)
            viewingHistory.append(record)
            record = []
        except:
            record.append(line)
        line = fp.readline()
    # This final append is necessary in order to get the final record in
    viewingHistory.append(record)

With the output table, I could use some of the visual tools in Dataiku to further prepare the data. Specifically, I:

Removed the empty "junk" columns
Remove the initial empty row created by my parsing script; I didn't bother to think of a clever way to not produce it
Parsed out the ship date and return date from the column containing that information
Computed the difference between the ship and return date to give how long we had the disk out in days
Parsed out the release date, rating, and run time from the column containing that information. This required using a Python code step, and manually checking the various MPAA ratings in an order that would produce the desired results.

def process(row):
    # In 'row' mode, the process function 
    # must return the full row.
    # The 'row' argument is a dictionary of columns of the row
    # You may modify the 'row' in place to
    # keep the previous values of the row.
    # Here, we simply add two new columns.
    ratings = ["PG-13","NC-17","TV-PG","TV-14","TV-MA","TV-Y7","PG","NR","G","R"]
    for rating in ratings:
        split = row["infoblob"].split(rating)
        if (split[0] != row["infoblob"]):
            row["releaseDate"] = split[0]
            row["rating"] = rating
            row["length"] = split[1]
            break
    return row

With the prepared dataset, we can do some visualizations. An obvious start is the number of rentals per year. As a coworker noted, 2010 was an exciting year...

We can also look at the number of rentals by release date. As expected, most are recent years, but we clearly had a backlog of old movies we were working through, too.

Note that this only has 958 records, instead of 1500, because television shows did not have release dates. Further work on the project might be to capture that information, especially since it's a third of the rentals.

This is a visualization of how we worked through the backlog of old movies. The Y-axis is the age of the rental movie in years, so it's not just the release date of the movie that matters, but the release date relative to the year in which we're watching it. For the first several years, there's quite a lot of scatter in the age of the movies, but this peters out as we work through the backlog, and by mid-2014, we're renting only the occasional "old movie" and are focused primarily on new releases. This is also when we decided to go to 1 DVD out at a time.

I'm not sure there's anything to build a model on. That may come later.

Monday, March 24, 2014

Sudoku data

I was given the 2012 and 2013 "Original Sudoku Calendars" as presents prior to the starts of those years, so naturally I recorded the time it took me to complete each puzzle and the date on which I completed it. A quick perusal of the data shows:

I'm not particularly good at Sudoku; I even started keeping track of when I screwed up and had to erase completely and start over.
There are some missing times; either because I forgot to record it, or because I was unable to finish the puzzle, or (in the case of a few at the beginning of 2012) I hadn't started keeping the dataset yet.
2013 saw the introduction of "visual" sudoku, in which they used various symbols other than Arabic numbers. For a variety of reasons, I'm even worse at these than "regular" sudoku.
I tended to work on the puzzles in clumps, finishing several each evening for a few days (or airplane flight), and then not doing them for a week to a month (or more)

While working on the puzzles in 2012, I suspected that the "Medium" difficulty puzzles weren't any easier than the "Hard" difficulty puzzles; also, after spending so many hours doing sudoku puzzles, I wondered whether I'd gotten any better at solving them. Well, after some painstaking record keeping, we can answer that with data science! So, first a quick look at some descriptive statistics of the time it took to complete each puzzle by puzzle difficulty:

Well, hell. The Puzzle Difficulty is sorted in alpha order, so because I used text in the Excel spreadsheet instead of a numeric encoding with 1 = Very Easy, 2 = Easy, and so on, the table's out of order. So, after a little recoding:

Better! And I even remembered to save the image as a hard-G GIF, rather than a JPG, this time so there's no moiré pattern. What jumps out at me from this is that:

The mean time to complete for each category is about 4 minutes more than the previous category. That's eerie. Also, not great evidence to start out for my theory about Medium vs. Hard.
There are many more missing "Hard" puzzles.
The median values are all lower than the mean values, suggesting that the distribution of time to complete is skew for each difficulty level. That's kind of a "duh" observations, because it was to be expected, but I should check to see just how skew the distribution is.
The maximum values for the "Easy" puzzles is higher than the "Medium" puzzles. This is, I think, due to the "Visual" puzzles.

Breaking the information down further by year, and putting it in a graph so that we can better see the distribution of values, suggests that:

I shouldn't be worried about skewness when going to perform statistical tests; yes, there are some outlying values, especially on Easy, but I think these are all the "Visual" puzzles.
It looks like I may have gotten better at the Easy and Medium puzzles from 2012 to 2013, but was no better at the Hard puzzles.
My theory that I was no better at the Medium puzzles than the Hard puzzles (in 2012) is alive!

A quick look at boxplots, paneled by whether I messed up and whether the puzzle was "visual", confirms for me that I should go ahead and start building models.

So let's do a general linear model of the time to complete based on difficulty, whether the puzzle was visual, whether I messed up, and the year, with all first level interactions.

Lots of good stuff here.

The puzzle difficulty contributes the most to the model (duh, but good to have the confirmation)
Whether the puzzle was visual and whether I messed up and had to start over had roughly equal effects on the model.
year did not have a significant effect on the model, which suggests that I did not, overall, get better from 2012 to 2013; however, the interaction of puzzle difficulty and year is significant, which means that the differences between 2012 and 2013 that we saw in the Easy and Medium puzzles may be real effects, and not just noise
The interaction of PuzzleDifficulty and Visual is a redundant effect, because the visual puzzles were all of Easy difficulty. Likewise, the visual puzzles all appeared in 2013, so Visual*year is redundant.
PuzzleDifficulty*Messedup is statistically significant. This means that messing up on a Hard puzzle will have a different, and almost certainly greater, effect on the time to complete the puzzle than messing up on an Easy puzzle (duh, but good to have the reminder)
Visual*Messedup is also statistically significant. This means that messing up on a Visual puzzle (which we have seen take longer than regular Easy puzzles) will have a different, and almost certainly greater, effect on the time to complete than messing up on a regular Easy puzzle. This is really the same type of effect as PuzzleDifficulty*Messedup
Messedup*year is not significant. Screwing up in 2013 was no better or worse than screwing up in 2012.

We could, at this point, look at the parameter estimates table, but it's big and messy and won't tell us anything important that we can't get from the tests of effects table above and the estimated marginal means table below.

Messing up roughly doubles the amount of time spent on a puzzle. This makes sense; often, it's not until I'd near the end that I'd realize that two 7's were in the same row, column, or square (or the like).
Visual puzzles took about twice as long to finish as regular Easy puzzles, and messing up a visual puzzle compounded the error.
I did, in fact, get better at Easy and Medium puzzles from 2012 and 2013.
(*) I appear to have not gotten better at Hard puzzles from 2012 to 2013, and I appear to have been no better at Medium puzzles than Hard puzzles in 2012

I'm putting an asterisk next to this last conclusion for the important reason that most of the missing Hard puzzles are from 2012. It's extremely likely that these are puzzles that I failed to solve at the time, and would have taken me a long time to solve if I'd kept at them. I would hazard a guess that I did, in fact, improve on Hard puzzles from 2012 to 2013, and that I took less time to finish the Medium puzzles than Hard puzzles in 2012; my belief that the Hard puzzles in 2012 were no more difficult than the Medium puzzles was predicated upon a lack of completion times for the hardest of the Hard puzzles.

Now, I could make certain assumptions about how long it would have taken me to complete the missing puzzles and re-run the model to do some "What if?" analysis, but I want to stop here for now. There are also some other methods for modeling whether I got better at solving these puzzles over time; maybe I'll look at them later.

I should also note that I didn't receive the 2014 calender, and so will continue the dataset using puzzles from http://www.websudoku.com/ (though I haven't attempted any this year, so the future of this dataset is actually not very clear). The biggest benefit to using puzzles from the sudoku.com site is that I can link to the exact puzzle, so that my own dataset could be merged with other datasets kept by other people recording their times.

Friday, September 13, 2013

This rabbit hole leads to Triple Town data analysis

Well, how did I get here?

Commenting on a previous post, a friend suggested a new game to try.
The Wikipedia page for Gone Home notes it uses the Unity Engine.
The Unity Engine page lists Triple Town as a client app.
The Triple Town page notes some research done on the distribution of tiles
This Triple Town Tribune post by Andrew Brown mentions that the data was collected by David de Kloet.
The data was originally shared in the comments of this post, which I found by searching on "David de Kloet triple town" in G+. (NOTE: I also asked Andrew if he had a copy of the data, which he was kind enough to share)

Now, the distributions are a good start, but my first question is: does the probability of getting a particular tile on your next tile depend upon the tile you've just been given? For example, if you've just been given a bear, are you more or less likely than the overall estimated probability of 0.15 to get another bear with your next tile?

A quick way to start examining this is to look at a crosstabulation of the tile you've just been given by the next tile. In this table, the rows show the current tile, and the columns show the next tile. Looking at the first row, what this means is that of the 1043 total times that a Bear appeared, it was followed by another Bear 157 times, by a Bush 157 times by Grass 637 times, and so on.

The raw counts are useful, but it can be hard to compare rows to see if they're different. So now let's look at the row proportions. Again, the rows show the current tile, and the columns show the next tile. Looking at the first row, what this means is that of all the times that a Bear appeared, it was followed by another Bear 15.1% of the time, by a Bush 15.1% of the time, by Grass 61.1% of the time, and so on. Bears appear pretty consistently around the overall average of 15.1% of the time, though they seem to appear less often after a Hut (11.4%) and more often after a Tree (18.4%). However, Huts and Trees don't appear very often, so these differences could be due to chance.

So let's look at the chi-square test. SPSS Statistics produces both Pearson and Likelihood Ratio chis-square tests. Since the significance values of the tests are each under 0.05, this suggests that there is, in fact, a relationship between what tile you have now and the next tile you'll get, but I'm a little worried about the large number of cells with low expected cell counts. That can throw the results of the test off.

So another set of tests to look at are pairwise comparisons of the column proportions. (NOTE: I really want to compare the row proportions, but that's not an option, so I've reorganized the table so that the current tile is in the columns and the next tile is in the rows) At any rate, the tests suggests that when your current tile is a Tree, the distribution of your next tile is different from when you current tile is a Bear, Bush, or Grass.

Looking back at the table of proportions, what's not clear from the test is whether the detected statistically significant differences come from the relatively higher rate of Bears and lower rate of Bushes when the current tile is a Tree, or whether it comes from the relatively higher rate of Huts and lower rates of Bots and Trees. The latter sets of relative differences arises from very rare events, and I wouldn't trust results based on that. We could re-run the test while ignoring those columns, but for now I'm pretty comfortable saying that there is no practically significant relationship between the current tile and the probability distribution of the next tile.

These tables were produced using this SPSS Statistics syntax on this text file.

Tuesday, August 20, 2013

We're out of beta we're releasing on time

At the risk of corporate shilling, I'm actually kinda excited to finally be able to talk about the product I've been working on, because we shipped v1.0 at the end of May***. It's called IBM SPSS Analytic Server, and its purpose is to coordinate the execution of analytic jobs (like, building a neural network to predict loads on an electric utility's network) in a distributed environment (like Hadoop). The 1.0 version of Analytic Server is integrated with IBM SPSS Modeler 15, an existing data mining workbench, and IBM SPSS Analytic Catalyst 1, which is a totally new thin-client application that automates some exploratory steps in order to jump-start deeper analysis of data.

If interested, you can read more about it at the Analytic Server information center.

*** Yes, yes, I meant to post about it then, but have been swamped by other things. Also, slacking off on posting this has made this my 500th published post on this blog. Woo.

Thursday, July 18, 2013

Statistical follies: "peak" cars in VT

UVM economist Art Woolf suggests

Vermont may be on the declining side of a “peak cars” curve.

...and offers the following graph as evidence.

The problem is that simply looking at the number of registrations is misleading, because registrations are strongly associated with the size of the driving population. This is noted in the title and article, but the author doesn't actually, you know, deal with it. We're really interested in whether the typical Vermonter is more or less likely to own a car now than 10-20 years ago. A more useful chart would look at the number of registrations, relative to the size of the driving population (say, by plotting the ratio of registrations to population on the vertical axis) over time.

Moreover, the numbers in the bar chart in the article don't match the numbers from the census (the census is counting buses, but the census numbers are lower than Woolf's, so maybe Vermont has negative buses?). And, of course, the article contains no references or (heaven forbid, because it's only 2013) links to the data source the author is using.

Using Census motor vehicle registration and population data, I see something more like:

1980	1985	1990	1995	2000	2001	2002	2003	2004	2005	2006	2007	2008	2009
347	398	462	492	515	534	537	516	523	508	588	565	581	557
511	530	563	589	609	612	615	617	618	619	620	620	621	622

0.68	0.75	0.82	0.84	0.85	0.87	0.87	0.84	0.85	0.82	0.95	0.91	0.94	0.9

the last row is the ratio of registrations to population, suggesting that the rate of vehicle ownership has increased over the last 30 years. I'm a little suspicious (okay, a lot suspicious) of the enormous jump from 2005-2006, but have no experience with the data from which I can form a hypothesis, other than, "someone should check that out."

Monday, October 22, 2012

The future of statistical journals

Larry Wasserman's Rant on Refereeing comes hard on the heels of Karl Rohe's excellent Tale of Two Researchers in the latest Amstat News.

What's interesting to me is that the journals stopped being useful for keeping up-to-date on cutting edge research long ago; 15+ years ago, technical reports were readily available for download (in postscript format!) from departmental websites, and e-mail chains kept people within a field apprised of new work that was ready to be published. By the time an article was actually published in a journal, everyone who "mattered" had already read it and weighed in on it, even if they weren't referees. Things like arXiv are simply the natural evolution of this process.

So why hasn't the vestigial journal apparatus finally fallen away? Presumably because it takes more than a couple of decades to change something that has been around for a few hundred years. (duh) Again, I think even 15 years ago, your fellows should have already known whether you're doing good work, regardless of where you've published**. So is the real problem the administration, which only has where and how often you've published to go by when determining whether to give you tenure? Even then, things like CiteSeer make it easy to count citations, even for unpublished work.

So... it's the researchers themselves who will finally have to kill the journals by refusing to submit papers to anything but arXiv (or equivalent)?

** then again, I've been out of academia for 14 years, so maybe I'm totally wrong about this and the field of statistics hasn't progressed in that time

Tuesday, September 25, 2012

Poster child for the problems with WAR

These are Adam Dunn's 2004-2010 seasons. While I think that the 2004 season is his best and 2006 was his worst in this time period**, he was generally a model of consistency. Unfortunately, while WAR and the human eye agree about his 2004 season, I can't wrap my head around the idea that 2009 was his worst NL season and 2010 was his third best.

GP	AB	H	2B	3B	HR	BB	SO	SB	CS	AVG	OBP	SLG	OPS	WAR
161	568	151	34	0	46	108	195	6	1	0.266	0.388	0.569	0.957	4.4
160	543	134	35	2	40	114	168	4	2	0.247	0.387	0.54	0.927	2.6
160	561	131	24	0	40	112	194	7	0	0.234	0.365	0.49	0.855	0.1
152	522	138	27	2	40	101	165	9	2	0.264	0.386	0.554	0.94	1.2
158	517	122	23	0	40	122	164	2	1	0.236	0.386	0.513	0.899	0.6
159	546	146	29	0	38	116	177	0	1	0.267	0.398	0.529	0.927	-0.6
158	558	145	36	2	38	77	199	0	1	0.26	0.356	0.536	0.892	2.2

Looking at Dunn's Player Value -- Batters table on baseball-reference.com, the problem is the vagaries of DWAR. The OWAR rankings of his seasons fairly closely follow OPS (I'm assuming any slight differences are due to the fact that what constitutes a "good" OPS changed slightly from season to season), and thus the wild variation in DWAR, which is due more to the small-sample nature of defensive statistics than any actual change in performance, dominates the year-to-year differences in WAR.

OWAR	DWAR
4.8	-1.2
4	-2.3
1.9	-2.4
3.7	-3.2
3	-3.2
3.7	-5.2
3.4	-2.1

While WAR isn't simply OWAR + DWAR, DWAR clearly plays an important role in devaluing WAR as an estimate of player worth in a given year, and I'd rather look at OWAR. But if OWAR closely follows OPS over the course of a generation of players, then I'd rather simply look at OPS, which is more intuitive and immediately evident from the seasonal stats.

** I want to be absolutely, positively clear that we're talking about 2004-2010, and not looking at his 2011 season, which was arguably the worst all-time.

Friday, September 7, 2012

dWAR

I like numbers, and not surprisingly, I like baseball. More specifically, I like looking at the numbers that the game of baseball produces, and the statistics that researchers have come up with to compare player performance.

But I have difficultly loving defensive wins above replacement, particularly because dWAR is susceptible to wild swings from season to season, but especially if believing in dWAR means I have to believe that Dave Winfield was as bad a fielder as, or worse than, Manny Ramirez. (scroll to the "Player Value" table and see that Winfield's career dWAR is lower than Manny's). I'm just having a hard time with that idea.

Saturday, June 23, 2012

Problems in reporting high school dropout rates

I first saw FlowingData mention a display to call attention to the insanely high number of high school dropouts. It says 857 per hour. Wow. Do they really mean in the U.S. alone? That would be 7.5 million per year, or more than the entire class of seniors. That can't be right. Well, according to the New York Times report, it's "every single hour, every single school day".... ah, that's getting closer to real numbers, and according to the College Board's own page on the installation, it's "More than 1.2 million students drop out of school every year, which averages out to 6,000 students every school day and 857 every hour".

Okay, that sounds better, but still not quite right. The 2010 American Community Survey reports that 16.8% of 18-24 year-olds do not have a high school education (this number drops to 14.4% for the 25 and older population, presumably because some of those 18-20 year-olds eventually finish or get their GEDs). That 18-24 year-old demographic constitutes about 10% of a 300 million person population, or very roughly 4.5 million people for each age category.

The College Board wants us to equate the 1.2 million dropouts with people who never get a high school degree (or equivalent), but clearly that's not the case. Some of those kids go back and get their degrees. And, likely, some of those kids are going back and dropping out again, and getting counted twice, or three or more times.

A sixth of the adult population without a high school degree is bad enough without trying to make it 25% through shady accounting. I'm a little surprised FlowingData didn't catch this (and worse, reported 857/hour without the "while school is in session" qualification).