Saturday, December 21, 2019

Netflix DVD rental history

We joined Netflix in late November, 2004, with the 3 DVDs out at a time plan. Nadal had yet to play at the French Open; Federer won his second Wimbledon and the U.S. Open; Phelps won 6 gold medals at the Olympics; the Boston Red Sox had won their first World Series in 86 years. And we had a 6-month-old. So we were trendsetters and cut the cable.
Fifteen years later, we're dinosaurs holding on to an old service. Yes, we have Netflix streaming, but had been holding on to 1 DVD out at a time for the past five years in order to see new releases a little sooner. However, we knew we would want Disney+, and the local library tends to carry new DVD releases, so it was an easy switch this fall.
Now, of course, I wanted my Netflix DVD rental history, but Netflix doesn't provide a download, so I had to select, copy, and paste from a webpage.

The resulting file was not a nicely formatted CSV. This provided a nice opportunity to practice some general file processing.
Examining the structure of the file, each rental takes up several rows; each row in order contains:
  1. A number indicating the order of the rental, from the most recent rental to the oldest (exactly 1500 rentals!)
  2. The title of the rental
  3. A blank line
  4. The release date, rating, and runtime; concatenated into a single string
  5. A blank line
  6. The ship date from Netflix's processing facility and the return date
Some rentals have an additional three lines, indicating:
  1. Whether the original shipment was damaged
  2. A blank line
  3. The ship date from Netflix's processing facility and the return date for the replacement disc


From this file format, I wanted to create an initial output table where each row was a separate rental, and each column was a line from the file (I would further process the output table later). My initial parsing script needed to identify a new rental based upon whether the value of the line was an integer, read intervening lines as elements of a list, and so the rental history was a list of lists to be transformed into an output table.
The core of this parsing code is as follows:
with open(filepath) as fp:
    line = fp.readline()
    record = []
    viewingHistory = []
    while line:
        try:
            int(line)
            # print(record)
            viewingHistory.append(record)
            record = []
        except:
            record.append(line)
        line = fp.readline()
    # This final append is necessary in order to get the final record in
    viewingHistory.append(record)
With the output table, I could use some of the visual tools in Dataiku to further prepare the data. Specifically, I:
  1. Removed the empty "junk" columns
  2. Remove the initial empty row created by my parsing script; I didn't bother to think of a clever way to not produce it
  3. Parsed out the ship date and return date from the column containing that information
  4. Computed the difference between the ship and return date to give how long we had the disk out in days
  5. Parsed out the release date, rating, and run time from the column containing that information. This required using a Python code step, and manually checking the various MPAA ratings in an order that would produce the desired results.
def process(row):
    # In 'row' mode, the process function 
    # must return the full row.
    # The 'row' argument is a dictionary of columns of the row
    # You may modify the 'row' in place to
    # keep the previous values of the row.
    # Here, we simply add two new columns.
    ratings = ["PG-13","NC-17","TV-PG","TV-14","TV-MA","TV-Y7","PG","NR","G","R"]
    for rating in ratings:
        split = row["infoblob"].split(rating)
        if (split[0] != row["infoblob"]):
            row["releaseDate"] = split[0]
            row["rating"] = rating
            row["length"] = split[1]
            break
    return row
With the prepared dataset, we can do some visualizations. An obvious start is the number of rentals per year. As a coworker noted, 2010 was an exciting year...

We can also look at the number of rentals by release date. As expected, most are recent years, but we clearly had a backlog of old movies we were working through, too.
Note that this only has 958 records, instead of 1500, because television shows did not have release dates. Further work on the project might be to capture that information, especially since it's a third of the rentals.

This is a visualization of how we worked through the backlog of old movies. The Y-axis is the age of the rental movie in years, so it's not just the release date of the movie that matters, but the release date relative to the year in which we're watching it. For the first several years, there's quite a lot of scatter in the age of the movies, but this peters out as we work through the backlog, and by mid-2014, we're renting only the occasional "old movie" and are focused primarily on new releases. This is also when we decided to go to 1 DVD out at a time.

I'm not sure there's anything to build a model on. That may come later.

Tuesday, October 29, 2019

Milford, a rabbit hole

For reasons I don't now remember, I was looking at Branchville, NJ on Google Maps and noticed something...



Why the heck is West Milford the easternmost of the area Milfords?  Wikipedia says West Milford was "settled by disenchanted Dutch from Milford, New Jersey (later renamed by the British as Newark)", but it's not at all clear whether that Milford is the present-day Newark.  It's possible that this is so, but the history of Newark page suggests that Milford was a proposed name, but never the official name, and the proposer of the name "Milford" for Newark was an Englishman and not Dutch.  Quick googling doesn't resolve the issue. Given that West Milford is far more north than west of Newark, I'm kindof hoping there was a Dutch settlement originally named Milford, renamed Newark, and then lost to time.

Thursday, October 10, 2019

The Trials of Apollo, Book Four: The Tyrant's Tomb; Rick Riordan; 2019

In this edition of the annals of bad editing:
Reyna conceded this with a nod.  "For years, I was supposed to be a good little sister to Hylla in a tough family situation.  Then, on Calypso's island, I was supposed to be an obedient servant. [...]"
Calypso, of course, was alone on her island.  Reyna and Hylla were Circe's servants.

Thursday, August 29, 2019

No Time to Spare; Ursula K. Le Guin; 2017


I've been slowly reading through Le Guin's last few collections, reluctant to let them "go" in the sense that, once completed, there will be nothing new from her that I have left to read.

No Time to Spare is a fantastic title, but the subtitle Thinking About What Matters is somewhat curious because this is a collection of blog posts, and while they are more finely written than the typical internet fare, they are largely about seemingly inconsequential matters.

So is the subtitle a poor one?  Or after 80, are matters that seem inconsequential to the young (or even middle aged) quite important because there is no time to spare?

The post "Belief in Belief" struck me as important.  In essence, it points out the importance of choosing words carefully to improve communication; specifically, she looks at the oppositional "belief in God" versus "belief in science", and offers:
I don't believe in Darwin's theory of evolution.  I accept it.  It isn't a matter of faith, but of evidence.
Wise words.

But troublesome for the choice of words in Bayesian statistics, which heavily uses "belief" to describe probabilities of material events, and not spiritual truths.  

Thursday, June 27, 2019

Throat bobbing in the works of Sarah J. Maas

My family discovered the Throne of Glass series a few years ago, and convinced me to read the A Court of ... books in 2017.  Maas is not a critical darling, but she writes characters I want to read about, she manages an epic cast without losing control of any of the subplots, every book in a series advances the main plot, and she publishes regularly.

Sometime during the second book of that series, A Court of Mist and Fury, I noticed that in moments of extreme emotion, Maas would describe the character's physical reaction to the emotion as a throat bob.  For example, Rhys is reliving a painful memory here:
His throat bobbed.  I could tell it was rage, and pain, that kept him from telling me outright -- not mistrust.
This is a slightly unusual physical description, but cool and somewhat unique to Maas among the authors I've read.  After I started A Court of Wings and Ruin, I noticed that the number of throat bobs had skyrocketed.  Being a data guy, I naturally had to start keeping count in this spreadsheet.

After recording 15 throat bobs in the 669 pages of A Court of Wings and Ruin, I was hooked.  I needed a more complete analysis of throat bobs in Sarah J. Maas's oeuvre.  Earlier this year, I finally started and finished the Throne of Glass series, went back through A Court of Thorns and Roses to comb for the last few throat bobs, and build out a dashboard of visualizations.  The highlights include:



The number of throat bobs by book, ordered from the books with the most bobs to the fewest.  Note that:

  • Throne of Glass itself is missing, because there were zero bobs in it.  
  • Empire of Storms has more bobs per page than Kingdom of Ash, because it's a longer book.




The number of throat bobs by character, ordered from the characters with the most bobs to the fewest.  Note that Aelin's series has 8 books to Rhysand's 4.  I'm kindof shocked how high Gavriel and Yrene are on this list, though this may in part be because the bob rate accelerated around the time they became important characters.  (see the following charts)




The number of throat bobs by year of publication date.  The x-axis is wonky; that rightmost bar is 2018, but appears centered around the middle of 2017. (sigh)




The number of throat bobs by quarter and year of publication date.  This shows us where in the publication history those books with high bob counts are, plus the number of books published each year. 




The best for last, this is number of bobs by cumulative page count of Maas's oeuvre.  You can see that the rate starts slow, then picks up just before the 2,000th page, then kicks into high gear from roughly page 3,800-4,800.  Finally, after that, it slows down a bit, but the rate of throat bobs per page is still higher than it was before page 3,800.

Hopefully, this project will continue for many more years to come.

Friday, March 22, 2019

Database marketing fail

I recently received the following email, which is hilarious for a number of reasons:

  • We did own a Jetta GL.  A 1996 Jetta GL.  I know it's worth nothing.
  • Heritage Toyota should know that, because they rejected the Jetta GL as a trade-in back in 2011
  • Since we eventually purchased our current vehicle from them, Heritage Toyota should know the make, model and year of our current vehicle

Heritage Toyota may very well just be sending spam emails using a purchased database of car owners and their last known vehicle, but they should really merge that database with their own internal database in order to send better targeted emails.