Sunday, January 29, 2012

The Birthday Paradox

This is another post related to my wandering thoughts on my daily commute.  In my first couple of weeks here, I was behind a car with a personalized license plate.  It was hard to figure out the meaning, which helped me remember it when I was behind it again a few days later.  This happened again with a different car more recently, I saw the same car several days apart driving in different places.  Both of these times I was in the city on a crowded interstate (yes, Hawai'i has an interstate ... think about that for a moment); we were surrounded by a large number of cars.  It seems like the chance of seeing the same car twice should be very small, which got me to wondering just how many cars there are around me on a daily basis.  

Our numerical intuition tends to be linear.  We have trouble with anticipating processes that grow exponentially and/or lead to very large numbers.  One of these miss-intuitions leads to the Birthday Paradox.  There are 366 possible birthdays, so it seems like you would need a room full of a lot of people, say 183 or half this size, to have an even 50/50 chance of two people sharing the same birthday.  However, only 23 people are required.  In other words, in a room of 23 people, there is a 50% chance at least two people share the same birthday.

Why is this number so small?  The answer lies in the process of pairwise comparisons and how they grow.  If we have two objects there is only one comparison (A to B) to be made between them.  If we have three then there are three ways to compare them in pairs (A to B, A to C and B to C).  If we have four there are six ways to compare (AB, AC, AD, BC, BD, CD).  With five objects we have ten possible comparisons; with six, fifteen; and so on.  Each object we add to the list can be compared to every preceding object, so the numbers start to grow quickly.  With 10 people in the room we have 45 pairs of birthdays to compare.  With 23, the 50/50 chance of the same birthday, we have 253 birthday pairs; almost 70% of the total number of possible birthdays. 

Another way to visualize these pairwise comparisons is that they are the triangular numbers.  If you stack plastic cups upside down, these are the number of cups required to make triangles of various sizes. 


To get to a greater height of the triangle, the base has to be widened, which adds to the number in increasing amounts.  By the way, cup stacking is an organized sport.  Here is a video. 



The triangular numbers are also a diagonal on Pascal's triangle


The first diagonal is all ones; the second is the counting numbers (1,2,3,4..), and the third are the triangular numbers (1,3,6,10,...).  The fourth row are the tetrahedral numbers (1,4,10,...), or what you get if you stack cups in a three dimensional triangle (pyramid).  The fifth row would be the number of cups required to stack in a four dimensional triangle, except that we can't reach in that direction.  Going the other way, the counting numbers, second diagonal, are what you get if you stack in one dimension (one on top of the other) and the first diagonal, all ones, are stacking in zero dimensions.  Zero dimensions is just a point so there is nowhere to stack beyond the first cup. 

The number of pairwise comparisons (triangular numbers) is given by n(n-1)/2 if n is the number of objects to compare.  This is approximately n^2/2 (or n squared divided by two, which is a parabola).  Here is a graph showing how the number of pairwise comparisons (y-axis) grows with n (x-axis), the number of objects to compare. 


The curve grows at an increasingly steep angle so that with 1,000 objects to compare there are half a million possible pairwise comparisons.  The squares of numbers grows much faster than the numbers themselves. 

Flipping this around and going back to the birthday paradox, the number of individuals needed to get the first match does not grow as fast as the total number of matches possible.  Suppose we were comparing something else, like auto license plate numbers on an island, you only have to sample a small number of the total to get a match and this sample is a smaller and smaller fraction of the total as the total number grows.  Here is a graph to illustrate. 


The total number of cars is on the x-axis.  The number of randomly sampled cars needed for a 50/50 probability of sampling the same car twice is on the y-axis.  So if there were a million registered cars on the island, you would only have to sample 1,400 to have an even chance of getting the same one twice. 


This graph shows how the fraction sampled until the first match is a smaller and smaller proportion of the total as the total grows.  With 10,000 cars you only need to sample 1.4% of the total; at one million cars this is 0.14%. 

This property, of rapidly shrinking fractions required to work with large numbers of unknown identities, can be taken advantage of in various ways, one way is called the birthday attack in cryptography, where matching hash functions can be found between documents to transfer digital signatures.  Another way can be to estimate the total size of a population from a smaller sample. 

There are approximately 700,000 automobiles registered in Oahu.  However, we all have daily patterns and regions we tend to, or not to, be in; depending on this overlap I am more likely to encounter some of these cars than others.  How many?  In other words, effectively what is the number of cars I am usually around while driving?  Both to answer my original question and because I might want to know how many people are likely to see a bumper sticker or advertisement I put on the car. 

To test this out I surreptitiously took pictures of cars while V was driving, usually to the grocery store and back, but also along my regular commute route to work.  (Yes, I know this seems strange, but it is harmless I think.)  Over a couple week period I had pictures of 590 cars, that I entered each day into a database--it seems like a lot but you can collect plate numbers quickly with just a few minutes of entering.  And there was one match!  I took a picture of the same car twice, five days apart, in two places.  Here are the photos.  (I blocked out the plate number for privacy.  I don't think this is entirely necessary, but it doesn't hurt.) 


Turing this around, if I were randomly sampling from approximately 250,000 cars I would expect a 50% probability of a match in a sample of 590 photos.  This is about 1/3 of the island, more than I expected! 

This reminds me of a trip I made with some friends to New York City in 1993, we saw the same person, a perfect stranger, on the street in different places on different days.  There was a story then, that was repeated on the drive to the city: there are so many people in NYC that if you pass someone on the street 10 years would have to go by before passing them again.  This may (or may not) be true for a specific person, but not for all the people you pass on a given day. 

This also helps explain seemingly amazing coincidences, like the woman that won the New Jersey lottery twice in four months.

Say I am behind or adjacent to ~100 cars each day to and from work and the grocery store.  Then on the next day 100 more cars.  There are 100^2 or 10,000 possible matches between the two days.  On the third day there is 100 more cars.  Then there are 100^3 or one million possible matches between the days, but I am only around approximately a quarter of a million cars each day according to my results above, so I have been behind the same car twice in these three days (on average, according to these rough guesses, I am behind the same car every 2.7 days, (log(250,000)/log(100))). 

In case anyone is curious, there were a few out of state plates, people that shipped their cars to Hawai'i.  These were from Alaska, Arizona, California, Florida, Georgia, and Nevada.  Plus there were a few federal government plates. 

There were also quite a few personalized plates, which is what started me on this track in the first place.  In contrast to the serial plates these are meant to be advertised so I will list them here.  Some are puzzling to figure out and remain a complete mystery to me. 

1J-ACE
4-BENJI
AGAPE
AMR-C
BELA
BGOODE
BIGDDY
CHZLNG
DEL
DFS-1
DS24ME
E-SABIR
EASE-UP
ENDURE
FEARY2
GODLDY
GRNWNV
GRRR
H8MYEX
HEL-LMS
I 2SEW
IGUANA
INA
IVIMAR
JAZLVR
JULZ87
JZANO
KAFFEE
KAI-EA
KALAIS
LABETE
LIKE 9
LU-SAN
M LITE
MBPPRP
MINI U
MOVE-ON
NELAZ-1
PERFEK
PRED8R
RAD
RDRK
RILLOZ
RLVRIA
SCTTSH
SENIE
SHANDL
SKTDVL
SMEEN
SRCHNG
SSS-1
ST33L3
SURFZ
TWIN-GL
VELINA
ZAYNAH

No comments: