Java学习-Day15

Huffman 编码 (建树)

一、描述

在之前的基础上增加了两个函数. 一个是构建字母表 constructAlphabet() , 另一个就是通过字母表来建立起 Huffman 树 constructTree().

二、构建字母表

1. 三部分

完整的字母映射其实是需要存储出现字符和每个字符出现次数的两个辅助结构才能达到查询的目的.

在代码中这三部分使用三个数组来表示. 分别是 alphabet 、 charCounts 和 charMapping .

alphabet 中存储的是输入字符串中出现了哪些字符.

charCounts 中存储的是每个字符出现的次数, 我们可以把它看做每个字符的权重. 需要注意的是 charCounts 数组的长度为 alphabet 长度的两倍减一. 因为构建 Huffman 树时, 当节点结合之后会产生新的权值. 多出来的部分用于存储这部分的值. 在初始化后字符和它的权重在不同数组的下标是一样的可以做到一一对应.

charMapping 存储的是某个字符在 alphabet 或 charCounts 的数组下标.

整体流程是先获取字符的 ASCII 码, 以 ASCII 码为下标在 charMapping 查找到该字符在 alphabet 和 charCounts 中的 下标从而获得需要的数据.

2. 具体代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
/**
*********************
* Construct the alphabet. The results are stored in the member variables
* charMapping and alphabet.
*********************
*/
public void constructAlphabet() {
// Initialize.
Arrays.fill(charMapping, -1);

// The count for each char. At most NUM_CHARS chars.
int[] tempCharCounts = new int[NUM_CHARS];

// The index of the char in the ASCII charset.
int tempCharIndex;

// Step 1. Scan the string to obtain the counts.
char tempChar;
for (int i = 0; i < inputText.length(); i++) {
tempChar = inputText.charAt(i);
tempCharIndex = (int) tempChar;

System.out.print("" + tempCharIndex + " ");

tempCharCounts[tempCharIndex]++;
} // Of for i

// Step 2. Scan to determine the size of the alphabet.
alphabetLength = 0;
for (int i = 0; i < 255; i++) {
if (tempCharCounts[i] > 0) {
alphabetLength++;
} // Of if
} // Of for i

// Step 3. Compress to the alphabet
alphabet = new char[alphabetLength];
charCounts = new int[2 * alphabetLength - 1];

int tempCounter = 0;
for (int i = 0; i < NUM_CHARS; i++) {
if (tempCharCounts[i] > 0) {
alphabet[tempCounter] = (char) i;
charCounts[tempCounter] = tempCharCounts[i];
charMapping[i] = tempCounter;
tempCounter++;
} // Of if
} // Of for i

System.out.println();
System.out.println("The alphabet is: " + Arrays.toString(alphabet));
System.out.println("Their counts are: " + Arrays.toString(charCounts));
System.out.println("The char mappings are: " + Arrays.toString(charMapping));
}// Of constructAlphabet

3. 运行截图

三、建立 Huffman 树

1. 描述

找 charCounts 中最小的两个值作为左右子节点, 然后加起来构成一个新的值加入到 charCounts 中, 这里就对应了之前为什么要设置 charCounts 的长度为 alphabet 的两倍减一.

然后就是就是将这些节点连接起来, 和之前构造树结构是一样的处理方法.

2. 具体代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
/**
*********************
* Construct the tree.
*********************
*/
public void constructTree() {
// Step 1. Allocate space.
nodes = new HuffmanNode[alphabetLength * 2 - 1];
boolean[] tempProcessed = new boolean[alphabetLength * 2 - 1];

// Step 2. Initialize leaves.
for (int i = 0; i < alphabetLength; i++) {
nodes[i] = new HuffmanNode(alphabet[i], charCounts[i], null, null, null);
} // Of for i

// Step 3. Construct the tree.
int tempLeft, tempRight, tempMinimal;
for (int i = alphabetLength; i < 2 * alphabetLength - 1; i++) {
// Step 3.1 Select the first minimal as the left child.
tempLeft = -1;
tempMinimal = Integer.MAX_VALUE;
for (int j = 0; j < i; j++) {
if (tempProcessed[j]) {
continue;
} // Of if

if (tempMinimal > charCounts[j]) {
tempMinimal = charCounts[j];
tempLeft = j;
} // Of if
} // Of for j
tempProcessed[tempLeft] = true;

// Step 3.2 Select the second minimal as the right child.
tempRight = -1;
tempMinimal = Integer.MAX_VALUE;
for (int j = 0; j < i; j++) {
if (tempProcessed[j]) {
continue;
} // Of if

if (tempMinimal > charCounts[j]) {
tempMinimal = charCounts[j];
tempRight = j;
} // Of if
} // Of for j
tempProcessed[tempRight] = true;
System.out.println("Selecting " + tempLeft + " and " + tempRight);

// Step 3.3 Construct the new node.
charCounts[i] = charCounts[tempLeft] + charCounts[tempRight];
nodes[i] = new HuffmanNode('*', charCounts[i], nodes[tempLeft], nodes[tempRight], null);

// Step 3.4 Link with children.
nodes[tempLeft].parent = nodes[i];
nodes[tempRight].parent = nodes[i];
System.out.println("The children of " + i + " are " + tempLeft + " and " + tempRight);
} // Of for i
}// Of constructTree

3. 运行截图

Huffman 编码 (编码与解码)

一、描述

将一棵 Huffman 树转换为 Huffman 编码. 毕竟一棵只在内存里面的树对于现实没有很大作用, 我们需要的是将信息压缩以及将压缩信息还原.

二、编码

1. 描述

将一段字符串转换为一段二进制, 这里为了简单起见用字符串代表二进制.

从根开始, 向左编码 0 , 向右编码 1.

2. 具体内容

输入

一行字符串

输出

由 0 1 组成的字符串

具体代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
/**
*********************
* Encode the given string.
*
* @param paraString The given string.
*********************
*/
public String coding(String paraString) {
String resultCodeString = "";

int tempIndex;
for (int i = 0; i < paraString.length(); i++) {
// From the original char to the location in the alphabet.
tempIndex = charMapping[(int) paraString.charAt(i)];

// From the location in the alphabet to the code.
resultCodeString += huffmanCodes[tempIndex];
} // Of for i
return resultCodeString;
}// Of coding

运行截图

三、解码

1. 描述

将一段二进制字符串转换为一个字符串.

节点从根开始, 二进制字符串扫描从左到右. 扫描遇到 0 , 节点往左节点移动. 扫描遇到 1 , 节点往右节点移动. 若解析到字符节点退回到根节点位置.

这里字符节点的判断是通过左子树是否为 null. 因为根据 Huffman 树的结构, 字符节点所在的位置都是叶节点.

2. 具体内容

输入

由 0 1 组成的字符串

输出

一行字符串

具体代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
/**
*********************
* Decode the given string.
*
* @param paraString The given string.
*********************
*/
public String decoding(String paraString) {
String resultCodeString = "";

HuffmanNode tempNode = getRoot();

for (int i = 0; i < paraString.length(); i++) {
if (paraString.charAt(i) == '0') {
tempNode = tempNode.leftChild;
System.out.println(tempNode);
} else {
tempNode = tempNode.rightChild;
System.out.println(tempNode);
} // Of if

if (tempNode.leftChild == null) {
System.out.println("Decode one:" + tempNode);
// Decode one char.
resultCodeString += tempNode.character;

// Return to the root.
tempNode = getRoot();
} // Of if
} // Of for i

return resultCodeString;
}// Of decoding

运行截图

总结

编码和解码需要注意的是字符的取值必须要是之前在文本文件中出现过的, 那个文本文件实际上模拟的就是人书信时各字符出现的样例. 当然这部分并没有出现对不识别字符的处理.

附录(完整代码)

这部分代码比较多, 所以我把它拆开, 以及其中还有一些辅助函数.

完整代码如下

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
package datastructure.tree;

import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Arrays;
import java.util.stream.Collectors;

/**
* Huffman tree, encoding, and decoding. For simplicity, only ASCII characters
* are supported.
*
* @author Shihuai Wen Email:wshysxcc@outlook.com
*/
public class Huffman {
/**
* An inner class for Huffman nodes.
*/
class HuffmanNode {
/**
* The char. Only valid for leaf nodes.
*/
char character;

/**
* Weight. It can also be double.
*/
int weight;

/**
* The left child.
*/
HuffmanNode leftChild;

/**
* The right child.
*/
HuffmanNode rightChild;

/**
* The parent. It helps constructing the Huffman code of each character.
*/
HuffmanNode parent;

/**
*******************
* The first constructor
*******************
*/
public HuffmanNode(char paraCharacter, int paraWeight, HuffmanNode paraLeftChild, HuffmanNode paraRightChild,
HuffmanNode paraParent) {
character = paraCharacter;
weight = paraWeight;
leftChild = paraLeftChild;
rightChild = paraRightChild;
parent = paraParent;
}// Of HuffmanNode

/**
*******************
* To string.
*******************
*/
public String toString() {
String resultString = "(" + character + ", " + weight + ")";

return resultString;
}// Of toString

}// Of class HuffmanNode

/**
* The number of characters. 256 for ASCII.
*/
public static final int NUM_CHARS = 256;

/**
* The input text. It is stored in a string for simplicity.
*/
String inputText;

/**
* The length of the alphabet, also the number of leaves.
*/
int alphabetLength;

/**
* The alphabet.
*/
char[] alphabet;

/**
* The count of chars. The length is 2 * alphabetLength - 1 to include non-leaf
* nodes.
*/
int[] charCounts;

/**
* The mapping of chars to the indices in the alphabet.
*/
int[] charMapping;

/**
* Codes for each char in the alphabet. It should have the same length as
* alphabet.
*/
String[] huffmanCodes;

/**
* All nodes. The last node is the root.
*/
HuffmanNode[] nodes;

/**
*********************
* The first constructor.
*
* @param paraFilename The text filename.
*********************
*/
public Huffman(String paraFilename) {
charMapping = new int[NUM_CHARS];

readText(paraFilename);
}// Of the first constructor

/**
*********************
* Read text.
*
* @param paraFilename The text filename.
*********************
*/
public void readText(String paraFilename) {
try {
inputText = Files.newBufferedReader(Paths.get(paraFilename), StandardCharsets.UTF_8).lines()
.collect(Collectors.joining("\n"));
} catch (Exception ee) {
System.out.println(ee);
System.exit(0);
} // Of try

System.out.println("The text is:\r\n" + inputText);
}// Of readText

/**
*********************
* Construct the alphabet. The results are stored in the member variables
* charMapping and alphabet.
*********************
*/
public void constructAlphabet() {
// Initialize.
Arrays.fill(charMapping, -1);

// The count for each char. At most NUM_CHARS chars.
int[] tempCharCounts = new int[NUM_CHARS];

// The index of the char in the ASCII charset.
int tempCharIndex;

// Step 1. Scan the string to obtain the counts.
char tempChar;
for (int i = 0; i < inputText.length(); i++) {
tempChar = inputText.charAt(i);
tempCharIndex = (int) tempChar;

System.out.print("" + tempCharIndex + " ");

tempCharCounts[tempCharIndex]++;
} // Of for i

// Step 2. Scan to determine the size of the alphabet.
alphabetLength = 0;
for (int i = 0; i < 255; i++) {
if (tempCharCounts[i] > 0) {
alphabetLength++;
} // Of if
} // Of for i

// Step 3. Compress to the alphabet
alphabet = new char[alphabetLength];
charCounts = new int[2 * alphabetLength - 1];

int tempCounter = 0;
for (int i = 0; i < NUM_CHARS; i++) {
if (tempCharCounts[i] > 0) {
alphabet[tempCounter] = (char) i;
charCounts[tempCounter] = tempCharCounts[i];
charMapping[i] = tempCounter;
tempCounter++;
} // Of if
} // Of for i

System.out.println();
System.out.println("The alphabet is: " + Arrays.toString(alphabet));
System.out.println("Their counts are: " + Arrays.toString(charCounts));
System.out.println("The char mappings are: " + Arrays.toString(charMapping));
}// Of constructAlphabet

/**
*********************
* Construct the tree.
*********************
*/
public void constructTree() {
// Step 1. Allocate space.
nodes = new HuffmanNode[alphabetLength * 2 - 1];
boolean[] tempProcessed = new boolean[alphabetLength * 2 - 1];

// Step 2. Initialize leaves.
for (int i = 0; i < alphabetLength; i++) {
nodes[i] = new HuffmanNode(alphabet[i], charCounts[i], null, null, null);
} // Of for i

// Step 3. Construct the tree.
int tempLeft, tempRight, tempMinimal;
for (int i = alphabetLength; i < 2 * alphabetLength - 1; i++) {
// Step 3.1 Select the first minimal as the left child.
tempLeft = -1;
tempMinimal = Integer.MAX_VALUE;
for (int j = 0; j < i; j++) {
if (tempProcessed[j]) {
continue;
} // Of if

if (tempMinimal > charCounts[j]) {
tempMinimal = charCounts[j];
tempLeft = j;
} // Of if
} // Of for j
tempProcessed[tempLeft] = true;

// Step 3.2 Select the second minimal as the right child.
tempRight = -1;
tempMinimal = Integer.MAX_VALUE;
for (int j = 0; j < i; j++) {
if (tempProcessed[j]) {
continue;
} // Of if

if (tempMinimal > charCounts[j]) {
tempMinimal = charCounts[j];
tempRight = j;
} // Of if
} // Of for j
tempProcessed[tempRight] = true;
System.out.println("Selecting " + tempLeft + " and " + tempRight);

// Step 3.3 Construct the new node.
charCounts[i] = charCounts[tempLeft] + charCounts[tempRight];
nodes[i] = new HuffmanNode('*', charCounts[i], nodes[tempLeft], nodes[tempRight], null);

// Step 3.4 Link with children.
nodes[tempLeft].parent = nodes[i];
nodes[tempRight].parent = nodes[i];
System.out.println("The children of " + i + " are " + tempLeft + " and " + tempRight);
} // Of for i
}// Of constructTree

/**
*********************
* Get the root of the binary tree.
*
* @return The root.
*********************
*/
public HuffmanNode getRoot() {
return nodes[nodes.length - 1];
}// Of getRoot

/**
*********************
* Pre-order visit.
*********************
*/
public void preOrderVisit(HuffmanNode paraNode) {
System.out.print("(" + paraNode.character + ", " + paraNode.weight + ") ");

if (paraNode.leftChild != null) {
preOrderVisit(paraNode.leftChild);
} // Of if

if (paraNode.rightChild != null) {
preOrderVisit(paraNode.rightChild);
} // Of if
}// Of preOrderVisit

/**
*********************
* Generate codes for each character in the alphabet.
*********************
*/
public void generateCodes() {
huffmanCodes = new String[alphabetLength];
HuffmanNode tempNode;
for (int i = 0; i < alphabetLength; i++) {
tempNode = nodes[i];
// Use tempCharCode instead of tempCode such that it is unlike
// tempNode.
// This is an advantage of long names.
String tempCharCode = "";
while (tempNode.parent != null) {
if (tempNode == tempNode.parent.leftChild) {
tempCharCode = "0" + tempCharCode;
} else {
tempCharCode = "1" + tempCharCode;
} // Of if

tempNode = tempNode.parent;
} // Of while

huffmanCodes[i] = tempCharCode;
System.out.println("The code of " + alphabet[i] + " is " + tempCharCode);
} // Of for i
}// Of generateCodes

/**
*********************
* Encode the given string.
*
* @param paraString The given string.
*********************
*/
public String coding(String paraString) {
String resultCodeString = "";

int tempIndex;
for (int i = 0; i < paraString.length(); i++) {
// From the original char to the location in the alphabet.
tempIndex = charMapping[(int) paraString.charAt(i)];

// From the location in the alphabet to the code.
resultCodeString += huffmanCodes[tempIndex];
} // Of for i
return resultCodeString;
}// Of coding

/**
*********************
* Decode the given string.
*
* @param paraString The given string.
*********************
*/
public String decoding(String paraString) {
String resultCodeString = "";

HuffmanNode tempNode = getRoot();

for (int i = 0; i < paraString.length(); i++) {
if (paraString.charAt(i) == '0') {
tempNode = tempNode.leftChild;
System.out.println(tempNode);
} else {
tempNode = tempNode.rightChild;
System.out.println(tempNode);
} // Of if

if (tempNode.leftChild == null) {
System.out.println("Decode one:" + tempNode);
// Decode one char.
resultCodeString += tempNode.character;

// Return to the root.
tempNode = getRoot();
} // Of if
} // Of for i

return resultCodeString;
}// Of decoding

/**
*********************
* The entrance of the program.
*
* @param args Not used now.
*********************
*/
public static void main(String args[]) {
Huffman tempHuffman = new Huffman("D:/wenshihuai/huffmantext-small.txt");
tempHuffman.constructAlphabet();
tempHuffman.constructTree();

HuffmanNode tempRoot = tempHuffman.getRoot();
System.out.println("The root is: " + tempRoot);
System.out.println("Preorder visit:");
tempHuffman.preOrderVisit(tempHuffman.getRoot());

tempHuffman.generateCodes();

String tempCoded = tempHuffman.coding("a-efx");
System.out.println("Coded: " + tempCoded);
String tempDecoded = tempHuffman.decoding(tempCoded);
System.out.println("Decoded: " + tempDecoded);

}// Of main

} // Of class Huffman