- mapreduce二次排序_ mapreduce二次排序原理

#e#
? ? ? ? ? ? 接下來我們通過示例，可以很直觀的了解二次排序的原理

輸入文件 sort.txt 內(nèi)容為

40 20

40 10

40 30

40 5

30 30

30 20

30 10

30 40

50 20

50 50

50 10

50 60

輸出文件的內(nèi)容（從小到大排序）如下

30 10

30 20

30 30

30 40

--------

40 5

40 10

40 20

40 30

--------

50 10

50 20

50 50

50 60

從輸出的結(jié)果可以看出Key實現(xiàn)了從小到大的排序，同時相同Key的Value也實現(xiàn)了從小到大的排序，這就是二次排序的結(jié)果

mapreduce二次排序的具體流程

在本例中要比較兩次。先按照第一字段排序，然后再對第一字段相同的按照第二字段排序。根據(jù)這一點，我們可以構(gòu)造一個復合類IntPair ，它有兩個字段，先利用分區(qū)對第一字段排序，再利用分區(qū)內(nèi)的比較對第二字段排序。二次排序的流程分為以下幾步。

1、自定義 key

所有自定義的key應該實現(xiàn)接口WritableComparable，因為它是可序列化的并且可比較的。WritableComparable 的內(nèi)部方法如下所示

// 反序列化，從流中的二進制轉(zhuǎn)換成IntPairpublic void readFields（DataInput in） throws IOException

// 序列化，將IntPair轉(zhuǎn)化成使用流傳送的二進制public void write（DataOutput out）

// key的比較public int compareTo（IntPair o）

// 默認的分區(qū)類 HashPartitioner，使用此方法public int hashCode（）

// 默認實現(xiàn)public boolean equals（Object right）

2、自定義分區(qū)

自定義分區(qū)函數(shù)類FirstPartitioner，是key的第一次比較，完成對所有key的排序。

public static class FirstPartitioner extends Partitioner《 IntPair，IntWritable》

在job中使用setPartitionerClasss（）方法設置Partitioner

job.setPartitionerClasss（FirstPartitioner.Class）;

3、Key的比較類

這是Key的第二次比較，對所有的Key進行排序，即同時完成IntPair中的first和second排序。該類是一個比較器，可以通過兩種方式實現(xiàn)。

1）繼承WritableComparator。

public static class KeyComparator extends WritableComparator

必須有一個構(gòu)造函數(shù)，并且重載以下方法。

public int compare（WritableComparable w1， WritableComparable w2）

2）實現(xiàn)接口 RawComparator。

上面兩種實現(xiàn)方式，在Job中，可以通過setSortComparatorClass（）方法來設置Key的比較類。

job.setSortComparatorClass（KeyComparator.Class）;

注意：如果沒有使用自定義的SortComparator類，則默認使用Key中compareTo（）方法對Key排序。

4、定義分組類函數(shù)

在Reduce階段，構(gòu)造一個與 Key 相對應的 Value 迭代器的時候，只要first相同就屬于同一個組，放在一個Value迭代器。定義這個比較器，可以有兩種方式。

1）繼承 WritableComparator。

public static class GroupingComparator extends WritableComparator

必須有一個構(gòu)造函數(shù)，并且重載以下方法。

public int compare（WritableComparable w1， WritableComparable w2）

2）實現(xiàn)接口 RawComparator。

上面兩種實現(xiàn)方式，在 Job 中，可以通過 setGroupingComparatorClass（）方法來設置分組類。

job.setGroupingComparatorClass（GroupingComparator.Class）;

另外注意的是，如果reduce的輸入與輸出不是同一種類型，則 Combiner和Reducer 不能共用 Reducer 類，因為 Combiner 的輸出是 reduce 的輸入。除非重新定義一個Combiner。

3、代碼實現(xiàn)

Hadoop的example包中自帶了一個MapReduce的二次排序算法，下面對 example包中的二次排序進行改進

package com.buaa;

import java.io.DataInput;

import java.io.DataOutput;

import java.io.IOException;

import org.apache.hadoop.io.WritableComparable;

/**

* @ProjectName SecondarySort

* @PackageName com.buaa

* @ClassName IntPair

* @Description 將示例數(shù)據(jù)中的key/value封裝成一個整體作為Key，同時實現(xiàn) WritableComparable接口并重寫其方法

* @Author 劉吉超

* @Date 2016-06-07 22:31:53

public class IntPair implements WritableComparable《IntPair》{

private int first;

private int second;

public IntPair（）{

}

public IntPair（int left， int right）{

set（left， right）;

}

public void set（int left， int right）{

first = left;

second = right;

}

@Override

public void readFields（DataInput in） throws IOException{

first = in.readInt（）;

second = in.readInt（）;

}

@Override

public void write（DataOutput out） throws IOException{

out.writeInt（first）;

out.writeInt（second）;

}

@Override

public int compareTo（IntPair o）

{

if （first ！= o.first）{

return first 《 o.first ？ -1 ： 1;

}else if （second ！= o.second）{

return second 《 o.second ？ -1 ： 1;

}else{

return 0;

}

@Override

public int hashCode（）{

return first * 157 + second;

}

@Override

public boolean equals（Object right）{

if （right == null）

return false;

if （this == right）

return true;

if （right instanceof IntPair）{

IntPair r = （IntPair） right;

return r.first == first && r.second == second;

}else{

return false;

}

public int getFirst（）{

return first;

}

public int getSecond（）{

return second;

}

package com.buaa;

import java.io.IOException;import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.WritableComparable;import org.apache.hadoop.io.WritableComparator;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Partitioner;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

/**

* @ProjectName SecondarySort

* @PackageName com.buaa

* @ClassName SecondarySort

* @Description TODO

* @Author 劉吉超

* @Date 2016-06-07 22:40:37*/

@SuppressWarnings（“deprecation”）public class SecondarySort {

public static class Map extends Mapper《LongWritable， Text， IntPair， IntWritable》 {

public void map（LongWritable key， Text value， Context context） throws IOException， InterruptedException {

String line = value.toString（）;

StringTokenizer tokenizer = new StringTokenizer（line）;

int left = 0;

int right = 0;

if （tokenizer.hasMoreTokens（）） {

left = Integer.parseInt（tokenizer.nextToken（））;

if （tokenizer.hasMoreTokens（））

right = Integer.parseInt（tokenizer.nextToken（））;

context.write（new IntPair（left， right）， new IntWritable（right））;

}

* 自定義分區(qū)函數(shù)類FirstPartitioner，根據(jù) IntPair中的first實現(xiàn)分區(qū)

public static class FirstPartitioner extends Partitioner《IntPair， IntWritable》{

@Override

public int getPartition（IntPair key， IntWritable value，int numPartitions）{

return Math.abs（key.getFirst（） * 127） % numPartitions;

}

* 自定義GroupingComparator類，實現(xiàn)分區(qū)內(nèi)的數(shù)據(jù)分組

@SuppressWarnings（“rawtypes”）

public static class GroupingComparator extends WritableComparator{

protected GroupingComparator（）{

super（IntPair.class， true）;

}

@Override

public int compare（WritableComparable w1， WritableComparable w2）{

IntPair ip1 = （IntPair） w1;

IntPair ip2 = （IntPair） w2;

int l = ip1.getFirst（）;

int r = ip2.getFirst（）;

return l == r ？ 0 ：（l 《 r ？ -1 ： 1）;

}

public static class Reduce extends Reducer《IntPair， IntWritable， Text， IntWritable》 {

public void reduce（IntPair key， Iterable《IntWritable》 values， Context context） throws IOException， InterruptedException {

for （IntWritable val ： values） {

context.write（new Text（Integer.toString（key.getFirst（）））， val）;

}

public static void main（String［］ args） throws IOException， InterruptedException， ClassNotFoundException {

// 讀取配置文件

Configuration conf = new Configuration（）;

// 判斷路徑是否存在，如果存在，則刪除

Path mypath = new Path（args［1］）;

FileSystem hdfs = mypath.getFileSystem（conf）;

if （hdfs.isDirectory（mypath）） {

hdfs.delete（mypath， true）;

}

Job job = new Job（conf， “secondarysort”）;

// 設置主類

job.setJarByClass（SecondarySort.class）;

// 輸入路徑

FileInputFormat.setInputPaths（job， new Path（args［0］））;

// 輸出路徑

FileOutputFormat.setOutputPath（job， new Path（args［1］））;

// Mapper

job.setMapperClass（Map.class）;

// Reducer

job.setReducerClass（Reduce.class）;

// 分區(qū)函數(shù)

job.setPartitionerClass（FirstPartitioner.class）;

// 本示例并沒有自定義SortComparator，而是使用IntPair中compareTo方法進行排序 job.setSortComparatorClass（）;

// 分組函數(shù)

job.setGroupingComparatorClass（GroupingComparator.class）;

// map輸出key類型

job.setMapOutputKeyClass（IntPair.class）;

// map輸出value類型

job.setMapOutputValueClass（IntWritable.class）;

// reduce輸出key類型

job.setOutputKeyClass（Text.class）;

// reduce輸出value類型

job.setOutputValueClass（IntWritable.class）;

// 輸入格式

job.setInputFormatClass（TextInputFormat.class）;

// 輸出格式

job.setOutputFormatClass（TextOutputFormat.class）;

System.exit（job.waitForCompletion（true）？ 0 ： 1）;

}

mapreduce二次排序_ mapreduce二次排序原理

閱讀全文

上一頁 12全文

MapReduce(6251) MapReduce(6251)
二次排序(1304) 二次排序(1304)

7954

已全部加載完成

搜索歷史

- mapreduce二次排序_ mapreduce二次排序原理

mapreduce二次排序的具體流程

評論