Java学习-Day15

Huffman 编码 (建树)

一、描述

在之前的基础上增加了两个函数. 一个是构建字母表 constructAlphabet() , 另一个就是通过字母表来建立起 Huffman 树 constructTree().

二、构建字母表

1. 三部分

完整的字母映射其实是需要存储出现字符和每个字符出现次数的两个辅助结构才能达到查询的目的.

在代码中这三部分使用三个数组来表示. 分别是 alphabet 、 charCounts 和 charMapping .

alphabet 中存储的是输入字符串中出现了哪些字符.

charCounts 中存储的是每个字符出现的次数, 我们可以把它看做每个字符的权重. 需要注意的是 charCounts 数组的长度为 alphabet 长度的两倍减一. 因为构建 Huffman 树时, 当节点结合之后会产生新的权值. 多出来的部分用于存储这部分的值. 在初始化后字符和它的权重在不同数组的下标是一样的可以做到一一对应.

charMapping 存储的是某个字符在 alphabet 或 charCounts 的数组下标.

整体流程是先获取字符的 ASCII 码, 以 ASCII 码为下标在 charMapping 查找到该字符在 alphabet 和 charCounts 中的下标从而获得需要的数据.

2. 具体代码

/**
 *********************
 * Construct the alphabet. The results are stored in the member variables
 * charMapping and alphabet.
 *********************
 */
public void constructAlphabet() {
	// Initialize.
	Arrays.fill(charMapping, -1);

	// The count for each char. At most NUM_CHARS chars.
	int[] tempCharCounts = new int[NUM_CHARS];

	// The index of the char in the ASCII charset.
	int tempCharIndex;

	// Step 1. Scan the string to obtain the counts.
	char tempChar;
	for (int i = 0; i < inputText.length(); i++) {
		tempChar = inputText.charAt(i);
		tempCharIndex = (int) tempChar;

		System.out.print("" + tempCharIndex + " ");

		tempCharCounts[tempCharIndex]++;
	} // Of for i

	// Step 2. Scan to determine the size of the alphabet.
	alphabetLength = 0;
	for (int i = 0; i < 255; i++) {
		if (tempCharCounts[i] > 0) {
			alphabetLength++;
		} // Of if
	} // Of for i

	// Step 3. Compress to the alphabet
	alphabet = new char[alphabetLength];
	charCounts = new int[2 * alphabetLength - 1];

	int tempCounter = 0;
	for (int i = 0; i < NUM_CHARS; i++) {
		if (tempCharCounts[i] > 0) {
			alphabet[tempCounter] = (char) i;
			charCounts[tempCounter] = tempCharCounts[i];
			charMapping[i] = tempCounter;
			tempCounter++;
		} // Of if
	} // Of for i

	System.out.println();
	System.out.println("The alphabet is: " + Arrays.toString(alphabet));
	System.out.println("Their counts are: " + Arrays.toString(charCounts));
	System.out.println("The char mappings are: " + Arrays.toString(charMapping));
}// Of constructAlphabet

3. 运行截图

三、建立 Huffman 树

1. 描述

找 charCounts 中最小的两个值作为左右子节点, 然后加起来构成一个新的值加入到 charCounts 中, 这里就对应了之前为什么要设置 charCounts 的长度为 alphabet 的两倍减一.

然后就是就是将这些节点连接起来, 和之前构造树结构是一样的处理方法.

2. 具体代码

/**
 *********************
 * Construct the tree.
 *********************
 */
public void constructTree() {
	// Step 1. Allocate space.
	nodes = new HuffmanNode[alphabetLength * 2 - 1];
	boolean[] tempProcessed = new boolean[alphabetLength * 2 - 1];

	// Step 2. Initialize leaves.
	for (int i = 0; i < alphabetLength; i++) {
		nodes[i] = new HuffmanNode(alphabet[i], charCounts[i], null, null, null);
	} // Of for i

	// Step 3. Construct the tree.
	int tempLeft, tempRight, tempMinimal;
	for (int i = alphabetLength; i < 2 * alphabetLength - 1; i++) {
		// Step 3.1 Select the first minimal as the left child.
		tempLeft = -1;
		tempMinimal = Integer.MAX_VALUE;
		for (int j = 0; j < i; j++) {
			if (tempProcessed[j]) {
				continue;
			} // Of if

			if (tempMinimal > charCounts[j]) {
				tempMinimal = charCounts[j];
				tempLeft = j;
			} // Of if
		} // Of for j
		tempProcessed[tempLeft] = true;

		// Step 3.2 Select the second minimal as the right child.
		tempRight = -1;
		tempMinimal = Integer.MAX_VALUE;
		for (int j = 0; j < i; j++) {
			if (tempProcessed[j]) {
				continue;
			} // Of if

			if (tempMinimal > charCounts[j]) {
				tempMinimal = charCounts[j];
				tempRight = j;
			} // Of if
		} // Of for j
		tempProcessed[tempRight] = true;
		System.out.println("Selecting " + tempLeft + " and " + tempRight);

		// Step 3.3 Construct the new node.
		charCounts[i] = charCounts[tempLeft] + charCounts[tempRight];
		nodes[i] = new HuffmanNode('*', charCounts[i], nodes[tempLeft], nodes[tempRight], null);

		// Step 3.4 Link with children.
		nodes[tempLeft].parent = nodes[i];
		nodes[tempRight].parent = nodes[i];
		System.out.println("The children of " + i + " are " + tempLeft + " and " + tempRight);
	} // Of for i
}// Of constructTree

3. 运行截图

Huffman 编码 (编码与解码)

一、描述

将一棵 Huffman 树转换为 Huffman 编码. 毕竟一棵只在内存里面的树对于现实没有很大作用, 我们需要的是将信息压缩以及将压缩信息还原.

二、编码

1. 描述

将一段字符串转换为一段二进制, 这里为了简单起见用字符串代表二进制.

从根开始, 向左编码 0 , 向右编码 1.

2. 具体内容

输入

一行字符串

输出

由 0 1 组成的字符串

具体代码

/**
 *********************
 * Encode the given string.
 * 
 * @param paraString The given string.
 *********************
 */
public String coding(String paraString) {
	String resultCodeString = "";

	int tempIndex;
	for (int i = 0; i < paraString.length(); i++) {
		// From the original char to the location in the alphabet.
		tempIndex = charMapping[(int) paraString.charAt(i)];

		// From the location in the alphabet to the code.
		resultCodeString += huffmanCodes[tempIndex];
	} // Of for i
	return resultCodeString;
}// Of coding

运行截图

三、解码

1. 描述

将一段二进制字符串转换为一个字符串.

节点从根开始, 二进制字符串扫描从左到右. 扫描遇到 0 , 节点往左节点移动. 扫描遇到 1 , 节点往右节点移动. 若解析到字符节点退回到根节点位置.

这里字符节点的判断是通过左子树是否为 null. 因为根据 Huffman 树的结构, 字符节点所在的位置都是叶节点.

2. 具体内容

输入

由 0 1 组成的字符串

输出

一行字符串

具体代码

/**
 *********************
 * Decode the given string.
 * 
 * @param paraString The given string.
 *********************
 */
public String decoding(String paraString) {
	String resultCodeString = "";

	HuffmanNode tempNode = getRoot();

	for (int i = 0; i < paraString.length(); i++) {
		if (paraString.charAt(i) == '0') {
			tempNode = tempNode.leftChild;
			System.out.println(tempNode);
		} else {
			tempNode = tempNode.rightChild;
			System.out.println(tempNode);
		} // Of if

		if (tempNode.leftChild == null) {
			System.out.println("Decode one:" + tempNode);
			// Decode one char.
			resultCodeString += tempNode.character;

			// Return to the root.
			tempNode = getRoot();
		} // Of if
	} // Of for i

	return resultCodeString;
}// Of decoding

运行截图

总结

编码和解码需要注意的是字符的取值必须要是之前在文本文件中出现过的, 那个文本文件实际上模拟的就是人书信时各字符出现的样例. 当然这部分并没有出现对不识别字符的处理.

附录(完整代码)

这部分代码比较多, 所以我把它拆开, 以及其中还有一些辅助函数.

完整代码如下

package datastructure.tree;

import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.Arrays;
import java.util.stream.Collectors;

/**
 * Huffman tree, encoding, and decoding. For simplicity, only ASCII characters
 * are supported.
 * 
 * @author Shihuai Wen Email:wshysxcc@outlook.com
 */
public class Huffman {
	/**
	 * An inner class for Huffman nodes.
	 */
	class HuffmanNode {
		/**
		 * The char. Only valid for leaf nodes.
		 */
		char character;

		/**
		 * Weight. It can also be double.
		 */
		int weight;

		/**
		 * The left child.
		 */
		HuffmanNode leftChild;

		/**
		 * The right child.
		 */
		HuffmanNode rightChild;

		/**
		 * The parent. It helps constructing the Huffman code of each character.
		 */
		HuffmanNode parent;

		/**
		 ******************* 
		 * The first constructor
		 ******************* 
		 */
		public HuffmanNode(char paraCharacter, int paraWeight, HuffmanNode paraLeftChild, HuffmanNode paraRightChild,
				HuffmanNode paraParent) {
			character = paraCharacter;
			weight = paraWeight;
			leftChild = paraLeftChild;
			rightChild = paraRightChild;
			parent = paraParent;
		}// Of HuffmanNode

		/**
		 ******************* 
		 * To string.
		 ******************* 
		 */
		public String toString() {
			String resultString = "(" + character + ", " + weight + ")";

			return resultString;
		}// Of toString

	}// Of class HuffmanNode

	/**
	 * The number of characters. 256 for ASCII.
	 */
	public static final int NUM_CHARS = 256;

	/**
	 * The input text. It is stored in a string for simplicity.
	 */
	String inputText;

	/**
	 * The length of the alphabet, also the number of leaves.
	 */
	int alphabetLength;

	/**
	 * The alphabet.
	 */
	char[] alphabet;

	/**
	 * The count of chars. The length is 2 * alphabetLength - 1 to include non-leaf
	 * nodes.
	 */
	int[] charCounts;

	/**
	 * The mapping of chars to the indices in the alphabet.
	 */
	int[] charMapping;

	/**
	 * Codes for each char in the alphabet. It should have the same length as
	 * alphabet.
	 */
	String[] huffmanCodes;

	/**
	 * All nodes. The last node is the root.
	 */
	HuffmanNode[] nodes;

	/**
	 *********************
	 * The first constructor.
	 * 
	 * @param paraFilename The text filename.
	 *********************
	 */
	public Huffman(String paraFilename) {
		charMapping = new int[NUM_CHARS];

		readText(paraFilename);
	}// Of the first constructor

	/**
	 *********************
	 * Read text.
	 * 
	 * @param paraFilename The text filename.
	 *********************
	 */
	public void readText(String paraFilename) {
		try {
			inputText = Files.newBufferedReader(Paths.get(paraFilename), StandardCharsets.UTF_8).lines()
					.collect(Collectors.joining("\n"));
		} catch (Exception ee) {
			System.out.println(ee);
			System.exit(0);
		} // Of try

		System.out.println("The text is:\r\n" + inputText);
	}// Of readText

	/**
	 *********************
	 * Construct the alphabet. The results are stored in the member variables
	 * charMapping and alphabet.
	 *********************
	 */
	public void constructAlphabet() {
		// Initialize.
		Arrays.fill(charMapping, -1);

		// The count for each char. At most NUM_CHARS chars.
		int[] tempCharCounts = new int[NUM_CHARS];

		// The index of the char in the ASCII charset.
		int tempCharIndex;

		// Step 1. Scan the string to obtain the counts.
		char tempChar;
		for (int i = 0; i < inputText.length(); i++) {
			tempChar = inputText.charAt(i);
			tempCharIndex = (int) tempChar;

			System.out.print("" + tempCharIndex + " ");

			tempCharCounts[tempCharIndex]++;
		} // Of for i

		// Step 2. Scan to determine the size of the alphabet.
		alphabetLength = 0;
		for (int i = 0; i < 255; i++) {
			if (tempCharCounts[i] > 0) {
				alphabetLength++;
			} // Of if
		} // Of for i

		// Step 3. Compress to the alphabet
		alphabet = new char[alphabetLength];
		charCounts = new int[2 * alphabetLength - 1];

		int tempCounter = 0;
		for (int i = 0; i < NUM_CHARS; i++) {
			if (tempCharCounts[i] > 0) {
				alphabet[tempCounter] = (char) i;
				charCounts[tempCounter] = tempCharCounts[i];
				charMapping[i] = tempCounter;
				tempCounter++;
			} // Of if
		} // Of for i

		System.out.println();
		System.out.println("The alphabet is: " + Arrays.toString(alphabet));
		System.out.println("Their counts are: " + Arrays.toString(charCounts));
		System.out.println("The char mappings are: " + Arrays.toString(charMapping));
	}// Of constructAlphabet

	/**
	 *********************
	 * Construct the tree.
	 *********************
	 */
	public void constructTree() {
		// Step 1. Allocate space.
		nodes = new HuffmanNode[alphabetLength * 2 - 1];
		boolean[] tempProcessed = new boolean[alphabetLength * 2 - 1];

		// Step 2. Initialize leaves.
		for (int i = 0; i < alphabetLength; i++) {
			nodes[i] = new HuffmanNode(alphabet[i], charCounts[i], null, null, null);
		} // Of for i

		// Step 3. Construct the tree.
		int tempLeft, tempRight, tempMinimal;
		for (int i = alphabetLength; i < 2 * alphabetLength - 1; i++) {
			// Step 3.1 Select the first minimal as the left child.
			tempLeft = -1;
			tempMinimal = Integer.MAX_VALUE;
			for (int j = 0; j < i; j++) {
				if (tempProcessed[j]) {
					continue;
				} // Of if

				if (tempMinimal > charCounts[j]) {
					tempMinimal = charCounts[j];
					tempLeft = j;
				} // Of if
			} // Of for j
			tempProcessed[tempLeft] = true;

			// Step 3.2 Select the second minimal as the right child.
			tempRight = -1;
			tempMinimal = Integer.MAX_VALUE;
			for (int j = 0; j < i; j++) {
				if (tempProcessed[j]) {
					continue;
				} // Of if

				if (tempMinimal > charCounts[j]) {
					tempMinimal = charCounts[j];
					tempRight = j;
				} // Of if
			} // Of for j
			tempProcessed[tempRight] = true;
			System.out.println("Selecting " + tempLeft + " and " + tempRight);

			// Step 3.3 Construct the new node.
			charCounts[i] = charCounts[tempLeft] + charCounts[tempRight];
			nodes[i] = new HuffmanNode('*', charCounts[i], nodes[tempLeft], nodes[tempRight], null);

			// Step 3.4 Link with children.
			nodes[tempLeft].parent = nodes[i];
			nodes[tempRight].parent = nodes[i];
			System.out.println("The children of " + i + " are " + tempLeft + " and " + tempRight);
		} // Of for i
	}// Of constructTree

	/**
	 *********************
	 * Get the root of the binary tree.
	 * 
	 * @return The root.
	 *********************
	 */
	public HuffmanNode getRoot() {
		return nodes[nodes.length - 1];
	}// Of getRoot

	/**
	 *********************
	 * Pre-order visit.
	 *********************
	 */
	public void preOrderVisit(HuffmanNode paraNode) {
		System.out.print("(" + paraNode.character + ", " + paraNode.weight + ") ");

		if (paraNode.leftChild != null) {
			preOrderVisit(paraNode.leftChild);
		} // Of if

		if (paraNode.rightChild != null) {
			preOrderVisit(paraNode.rightChild);
		} // Of if
	}// Of preOrderVisit

	/**
	 *********************
	 * Generate codes for each character in the alphabet.
	 *********************
	 */
	public void generateCodes() {
		huffmanCodes = new String[alphabetLength];
		HuffmanNode tempNode;
		for (int i = 0; i < alphabetLength; i++) {
			tempNode = nodes[i];
			// Use tempCharCode instead of tempCode such that it is unlike
			// tempNode.
			// This is an advantage of long names.
			String tempCharCode = "";
			while (tempNode.parent != null) {
				if (tempNode == tempNode.parent.leftChild) {
					tempCharCode = "0" + tempCharCode;
				} else {
					tempCharCode = "1" + tempCharCode;
				} // Of if

				tempNode = tempNode.parent;
			} // Of while

			huffmanCodes[i] = tempCharCode;
			System.out.println("The code of " + alphabet[i] + " is " + tempCharCode);
		} // Of for i
	}// Of generateCodes

	/**
	 *********************
	 * Encode the given string.
	 * 
	 * @param paraString The given string.
	 *********************
	 */
	public String coding(String paraString) {
		String resultCodeString = "";

		int tempIndex;
		for (int i = 0; i < paraString.length(); i++) {
			// From the original char to the location in the alphabet.
			tempIndex = charMapping[(int) paraString.charAt(i)];

			// From the location in the alphabet to the code.
			resultCodeString += huffmanCodes[tempIndex];
		} // Of for i
		return resultCodeString;
	}// Of coding

	/**
	 *********************
	 * Decode the given string.
	 * 
	 * @param paraString The given string.
	 *********************
	 */
	public String decoding(String paraString) {
		String resultCodeString = "";

		HuffmanNode tempNode = getRoot();

		for (int i = 0; i < paraString.length(); i++) {
			if (paraString.charAt(i) == '0') {
				tempNode = tempNode.leftChild;
				System.out.println(tempNode);
			} else {
				tempNode = tempNode.rightChild;
				System.out.println(tempNode);
			} // Of if

			if (tempNode.leftChild == null) {
				System.out.println("Decode one:" + tempNode);
				// Decode one char.
				resultCodeString += tempNode.character;

				// Return to the root.
				tempNode = getRoot();
			} // Of if
		} // Of for i

		return resultCodeString;
	}// Of decoding

	/**
	 *********************
	 * The entrance of the program.
	 * 
	 * @param args Not used now.
	 *********************
	 */
	public static void main(String args[]) {
		Huffman tempHuffman = new Huffman("D:/wenshihuai/huffmantext-small.txt");
		tempHuffman.constructAlphabet();
		tempHuffman.constructTree();

		HuffmanNode tempRoot = tempHuffman.getRoot();
		System.out.println("The root is: " + tempRoot);
		System.out.println("Preorder visit:");
		tempHuffman.preOrderVisit(tempHuffman.getRoot());

		tempHuffman.generateCodes();

		String tempCoded = tempHuffman.coding("a-efx");
		System.out.println("Coded: " + tempCoded);
		String tempDecoded = tempHuffman.decoding(tempCoded);
		System.out.println("Decoded: " + tempDecoded);

	}// Of main

} // Of class Huffman

代码 > Java

#java #eclipse

Java学习-Day16 上一篇

Java学习-Day14 下一篇