Saturday, December 21, 2019

Netflix DVD rental history

We joined Netflix in late November, 2004, with the 3 DVDs out at a time plan. Nadal had yet to play at the French Open; Federer won his second Wimbledon and the U.S. Open; Phelps won 6 gold medals at the Olympics; the Boston Red Sox had won their first World Series in 86 years. And we had a 6-month-old. So we were trendsetters and cut the cable.
Fifteen years later, we're dinosaurs holding on to an old service. Yes, we have Netflix streaming, but had been holding on to 1 DVD out at a time for the past five years in order to see new releases a little sooner. However, we knew we would want Disney+, and the local library tends to carry new DVD releases, so it was an easy switch this fall.
Now, of course, I wanted my Netflix DVD rental history, but Netflix doesn't provide a download, so I had to select, copy, and paste from a webpage.

The resulting file was not a nicely formatted CSV. This provided a nice opportunity to practice some general file processing.
Examining the structure of the file, each rental takes up several rows; each row in order contains:
  1. A number indicating the order of the rental, from the most recent rental to the oldest (exactly 1500 rentals!)
  2. The title of the rental
  3. A blank line
  4. The release date, rating, and runtime; concatenated into a single string
  5. A blank line
  6. The ship date from Netflix's processing facility and the return date
Some rentals have an additional three lines, indicating:
  1. Whether the original shipment was damaged
  2. A blank line
  3. The ship date from Netflix's processing facility and the return date for the replacement disc


From this file format, I wanted to create an initial output table where each row was a separate rental, and each column was a line from the file (I would further process the output table later). My initial parsing script needed to identify a new rental based upon whether the value of the line was an integer, read intervening lines as elements of a list, and so the rental history was a list of lists to be transformed into an output table.
The core of this parsing code is as follows:
with open(filepath) as fp:
    line = fp.readline()
    record = []
    viewingHistory = []
    while line:
        try:
            int(line)
            # print(record)
            viewingHistory.append(record)
            record = []
        except:
            record.append(line)
        line = fp.readline()
    # This final append is necessary in order to get the final record in
    viewingHistory.append(record)
With the output table, I could use some of the visual tools in Dataiku to further prepare the data. Specifically, I:
  1. Removed the empty "junk" columns
  2. Remove the initial empty row created by my parsing script; I didn't bother to think of a clever way to not produce it
  3. Parsed out the ship date and return date from the column containing that information
  4. Computed the difference between the ship and return date to give how long we had the disk out in days
  5. Parsed out the release date, rating, and run time from the column containing that information. This required using a Python code step, and manually checking the various MPAA ratings in an order that would produce the desired results.
def process(row):
    # In 'row' mode, the process function 
    # must return the full row.
    # The 'row' argument is a dictionary of columns of the row
    # You may modify the 'row' in place to
    # keep the previous values of the row.
    # Here, we simply add two new columns.
    ratings = ["PG-13","NC-17","TV-PG","TV-14","TV-MA","TV-Y7","PG","NR","G","R"]
    for rating in ratings:
        split = row["infoblob"].split(rating)
        if (split[0] != row["infoblob"]):
            row["releaseDate"] = split[0]
            row["rating"] = rating
            row["length"] = split[1]
            break
    return row
With the prepared dataset, we can do some visualizations. An obvious start is the number of rentals per year. As a coworker noted, 2010 was an exciting year...

We can also look at the number of rentals by release date. As expected, most are recent years, but we clearly had a backlog of old movies we were working through, too.
Note that this only has 958 records, instead of 1500, because television shows did not have release dates. Further work on the project might be to capture that information, especially since it's a third of the rentals.

This is a visualization of how we worked through the backlog of old movies. The Y-axis is the age of the rental movie in years, so it's not just the release date of the movie that matters, but the release date relative to the year in which we're watching it. For the first several years, there's quite a lot of scatter in the age of the movies, but this peters out as we work through the backlog, and by mid-2014, we're renting only the occasional "old movie" and are focused primarily on new releases. This is also when we decided to go to 1 DVD out at a time.

I'm not sure there's anything to build a model on. That may come later.