statistical analysis – Brian Shaler

Part of the purpose of any social web site is to build a network of friends. Using the recently released Digg API, I created a map of Digg users and how they’re connected to each other.

The Map [Link]

On the map, users are organized by the length of time they have had accounts on Digg. The oldest accounts are at the center and accounts created in the last few months are around the edges. The map only includes users who utilize Digg’s friendship feature.

Usefulness

I’m known for making an argument that data visualization can be very useful. Charts and graphs, while they may be aesthetically pleasing, can point out trends and habits on a broad scale that would probably be missed using typical statistical analysis.

However, I’ll be the first to say that the Digg friendship map I created has very little value as a practical analysis tool. It’s an idea I’ve had in my head for a few months, and finally got around to building it. I thought it would be fun. I thought it would be neat.

Making it fun and neat

The map itself, as a JPG image, may already appeal to people with an interest in data visualization. However, if I’m going to make something fun and neat, I’m going to try to appeal to more people than just data visualization enthusiasts.

When I rendered the image, I stored the coordinates for every user in a database. This allows me to go back afterwards and query the database for a specific user and retrieve that data point. I created a simple Flash interface where people can type in their (or others’) digg user names to find out where they are on the map.

Check it out!

The Challenge

In the last seven weeks, DiggTaggr has delivered about 115,000 sets of links to relevant stories to several thousand unique users. This is a pretty good size dataset to tinker with, so I decided to hack through it and see if I could present the data in an interesting way.

I have to admit, I was partially inspired to do this by Stamen Design’s data visualizations of Digg’s traffic. If you haven’t seen their scatter-plots, you should check them out.

Graph #1: User ID vs. Story ID

This was my first attempt at displaying the dataset in an interesting manner. Two stories related to DiggTaggr hit the front page and are labeled on the graph. DiggTaggr debuted on Friday, February 2nd, and received a complete redesign on the 4th.

You can see the curve of new users accelerating through most of the graph. This illustrates that there are fewer and fewer new users. There are also grid-like patterns emerging. Horizontal lines represent highly active users, while the dark horizontal gaps represent users who tried the tool and stopped using it. Vertical lines represent high activity during peak hours, while dark vertical gaps represent low activity on weekends.

Graph #2: Time vs. User ID

The Story ID axis in the previous graph gave a fairly accurate chronological referrence, but if it’s time you want, it’s time you should use.

This graph illustrates peak hours and peak days of the week in a more explicit way. I labeled the distinct patterns of weekdays and weekends. You can see that the pattern is more clearly defined in a certain area of the graph. These users involuntarily grouped themselves together by seeing the tool first thing Monday morning (the white horizontal line at the top of that section of users), while daily users had already seen the tool for 2 days.

The graph is color-coded to see how quickly users went through 40 Digg stories using DiggTaggr. Some users quickly went to red, while others used the tool less frequently.

Graph #3: Stories Viewed vs Time

Okay, okay. I was having a little bit of fun with this one. This one took over an hour to render on my laptop, partially because each dot had its own database query to determine how many previous instances there were for that user.

Each squiggly line represents a user. When the line is vertical, the user is viewing stories quickly one after another. You can see that only a handful of users have made it to the 1,000 story mark. I should give them a prize!

The density at the bottom left illustrates the high volume of new users. Some Digg at a rapid pace and shoot up, while others are more moderate and gradually climb. The density at the bottom tells us that a high percentage of DiggTaggr users either rarely visit Digg or uninstalled the tool.

Conclusions

Data visualization is still fascinating and fun.

DiggTaggr has sent almost half a million links to relevant stories to its users.

Yesterday, I chose to parse datasets instead of going outside. geek++

I still enjoy hearing from users. Feedback is always welcome and appreciated.
Email: brian@shaler.name

Tag: statistical analysis

Data Visualization: Mapping the Digg Community