Monday, April 12, 2010

TAOCP–Random Sampling and Shuffling

Many data processing application call for an unbiased choice of n records at random from a file contains N records. For example, you have a log file contains more than 100K records in your server, you need sample 1K records to analyst the data. If the number of record is huge, so that it’s impossible for you to hold all data in your memory, sometimes the individual record is also large.

Some methods have been devised for this problem. The most obvious approach is to select each record with probability n/N. The model is sometimes appropriate.In the Donald Knuth’s TAOCP , he gave us another approach to do the random sampling.You can go through the 3.2.4 to get more detail about it. In this sections, there are three algorithms. The first one is Selection sampling technique,which select n items from a set with N items. The second is Reservoir sampling, in this case, the N is unknown. Suppose we select n items at random from a file, without knowing exactly how many are present in this file. So this algorithm creates a auxiliary file contains all records that are candidates for the final sample.The third algorithm is Shuffling. I use the C# to implement three algorithms according to the TAOCP Vol.2.

  • This is selection sampling.I omit the process of generating a array(set in the code) for sampling.

  • This is reservoir sampling.The comment part is for generating a large data file with 1000000 random items.

  • This is shuffling
int []set=new int[101];
Random r = new Random();
for (int i = 0; i < 101; i++)
   
set[i] = r.Next(1000);

int timeShuffling = 100;
while (timeShuffling > 1)
{
   
int k = (int)Math.Floor(r.NextDouble()*timeShuffling) + 1;
   
int temp = set[k];
   
set[k] = set[timeShuffling];
   
set[timeShuffling] = temp;
}

No comments:

Post a Comment