|Pivotal HD||All Versions|
The purpose of TeraSort is to test the CPU/Memory power of the cluster and to sort 1TB of data by the a 10-byte ASCII key in the shortest amount of time possible. The benchmark will vary depending on available cluster resources.
Use the following command to run TeraGen:
hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen 10000000000 /teraInput
Argument 1: Number of 100 bytes rows ( 10,000,000,000 ), which is 1TB in this example.
Argument 2: Generated data will be dropped in the HDFS path you enter.
TeraGen will run map tasks to generate the data and will not run any reduce tasks. The default number of map task is defined by the "mapreduce.job.maps=2" param. It's the only purpose here is to generate the 1TB of random data in the following format " 10 bytes key | 2 bytes break | 32 bytes acsii/hex | 4 bytes break | 48 bytes filler | 4 bytes break | \r\n".
Use the following command to run TeraSort:
hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar terasort /teraInput /teraOutput
13/09/23 21:30:21 INFO mapreduce.Job: Running job: job_1379996975669_0001
13/09/23 22:25:54 INFO terasort.TeraSort: done
This will create a series of map tasks that sort the ASCII key data. There will be one Map task for each HDFS block of data. By default there will be one Reduce task which is defined by "mapreduce.job.reduces=1".
In the example below we force 8 Reducers using switch "-D mapred.reduce.tasks=8". This should be tuned based on the number of nodes in the cluster so you maximize the full capacity of the cluster.
The data will be partitioned based on the number reduce tasks with a 1:1 ratio. One partition for every reduce task.
NOTE: Yahoo will run a TeraSort benchmark after changing the replication factor to 1 on the parent directory.
Use the following command to run TeraValidate:
hadoop jar /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar teravalidate -D mapred.reduce.tasks=8 /teraOutput /teraValidate
The command above reads the output data and ensures that each key is less than the next key in the entire dataset.