- Tech Corner
- Articles
- Base64
Base64 Encoding
Abstract
Base64 content-transfer-encoding, commonly called Base64 encoding, is defined in RFC 2045 [4]. It is a method designed to represent an arbitrary sequence of octets (8-bit) in a printable text form. Thanks to that, Base64 encoding allows passing binary data through channels that are designed for flat ASCII text such as SMTP [3] [2]. It also allows embedding of binary data in media supporting ASCII text only such as XML files (see reference [6] on how to).
Alphabet
An alphabet of 64 encoding characters is used (hence the name Base64). Thus allowing 6 bits to represent the value of each encoding character. The alphabet is chosen as a printable subset of the US-ASCII. Table 1 shows the Base64 alphabet, with correspondence between the values and the encoding characters.
Value | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
Character | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q |
Value | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 |
Character | R | S | T | U | V | W | X | Y | Z | a | b | c | d | e | f | g | h |
Value | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 |
Character | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y |
Value | 51 | 52 | 53 | 54 | 55 | 56 | 57 | 58 | 59 | 60 | 61 | 62 | 63 | (pad) | |||
Character | z | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | + | / | = |
The subset for the Base64 alphabet is carefully chosen so that it is represented identically in all versions of ISO 646 [1], including US-ASCII, and in all versions of EBCDIC.
Encoding
The encoding process consists in representing groups of 3 octets (24 bits) of input bits as output strings of 4 encoded characters. Let's consider the input as a linear stream of octets. Proceeding from left to right, the input is divided into 24-bit groups, each formed by 3 consecutive octets of the input stream. These 24-bit groups are then treated as groups of 4 concatenated 6-bit groups. Each 6-bit group is a binary number, representing a decimal value between 0 and 63. That value is used as an index into the array of the Base64 alphabet shown in Table 1. The corresponding encoded character is placed in the output string.
As a simple example, let's consider as input a sequence of 3 octets whose
decimal values are 197, 22 and 233. The 24-bit group formed from these 3
octets is 110001010001011011111011
. It is treated as 4 6-bit
groups: 110001
, 010001
, 011011
and
111011
. Their respective decimal values are 49, 17, 27 and 59.
Using the Base64 alphabet, the resulting output string is iRb7
.
Table 2 decomposes the encoding process, the first row being a representation
of the input in ISO Latin-1 characters.
Å | SYN | û | |||||||||||||||||||||
197 | 22 | 251 | |||||||||||||||||||||
1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 |
49 | 17 | 27 | 59 | ||||||||||||||||||||
i | R | b | 7 |
The Base64 encoding rules specify that the output stream (resulting encoded bytes) must be represented in lines of no more than 76 characters each. A line break being defined by the sequence CR LF.
Padding
What if the input stream is not a multiple of 3 octets? If fewer than 24 bits are available at the end, zero bits are added on the right to form an integral number of 6-bit group. Since Base64 input is an integral number of octets, there are 3 possible endings:
- The input ends with a whole 24-bit group. The output is a multiple of 4 Base64 encoded characters. No special action is needed.
- The input ends with two octets or a 16-bit group. Two zero bits need to be added to form a whole 3 6-bit group, which translates into 3 Base64 encoded characters. A padding character '=' is needed to make the output a multiple of 4 characters.
- The input ends with an octet or an 8-bit group. Four zero bits need to be added to have 2 encoded characters. And two padding characters are added.
Decoding
The decoding process works in reverse to the encoding process. That is 24-bit groups of 4 6-bit groups are translated into groups of 3 octets. So that in our previous example, the bottom row of Table 2 is the input and the top row is the output.
All line breaks or other characters not in the Base64 alphabet are to be
ignored by the decoding software. The same applies to any illegal sequences
of characters in the Base64 encoding, such as "====
".
Java Implementation
First thing first, let's start by defining the maximum length of the encoded data lines, as well as defining the Base64 alphabet as an array of char.
1: private static final int LINE_MAX_LEN = 76; 2: private static final char[] EN64 = 3: "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/".toCharArray();
For simplicity sake, let's assume in the following code that the input
src
is a java.io.InputStream
and the output
dest
is a java.io.Writer
. Note that for better
efficiency, the actual class of src
and dest
will
be some sort of buffered stream and buffered writer respectively.
The encoding process takes 3 octets from the input and represents them as 4
encoded characters. We will use an array of byte to hold the 3 input octets,
and an array of char for the 4 output characters as defined in lines 4-5.
Each 3-octet group is read from the src
, which is then
translated into 4 encoded characters by using the 6-bit value as index into
the Base64 alphabet array EN64
(lines 10-16). Note that there
could be one or two padding characters '=', depending on the number of octets
n
read from the input.
4: byte[] b = new byte[3]; 5: char[] c = new char[4]; 6: int k = 0; 7: while (src.available() > 0) { 8: // Collect three bytes from the source 9: Arrays.fill(b, (byte)0); 10: int n = src.read(b, 0, 3); 11: 12: // Convert into 4 base64 characters 13: c[0] = EN64[(b[0]>>2)&0x3F]; 14: c[1] = EN64[ ((b[0]&0x3)<<4) | (b[1]>>4)&0xF ]; 15: c[2] = (n < 2)? '=' : EN64[ ((b[1]&0xF)<<2) | (b[2]>>6)&0x3 ]; 16: c[3] = (n < 3)? '=' : EN64[(b[2]&0x3F)]; 17: 18: // Ensure that the encoded data have a maximum of 19: // LINE_MAX_LEN (76) characters per line. 20: if (k < LINE_MAX_LEN) k += 4; 21: else { 22: dest.write("\r\n"); 23: k = 4; 24: } 25: dest.write(c, 0, 4); 26: }
Even though RFC 2045 [4] allows output lines of less than 76 characters, we will break each output line at exactly 76 characters (line 20-25). Using lines of less than 76 characters increases the size of the output, as more line breaks CRLF will be needed.
The decoding process works in reverse to the encoding process, and is left as an exercise to the reader.
Conclusion
There are other popular methods for binary data encoding such as the hexadecimal representation, uuencode or Base85 [5]. However, Base64 is relatively compact and more portable as its alphabet is represented identically in all versions of ISO 646 [1], and in all versions of EBCDIC. These properties make Base64 a premiere cross-platform binary transport encoding method.
References
- [1] ISO 646
- "Information technology -- ISO 7-bit coded character set for information interchange", 1991.
- [2] RFC 822
- "Standard for the format of ARPA Internet text messages", David H. Crocker, August 1982.
- [3] RFC 821
- "Simple Mail Transfer Protocol", Jonathan B. Postel, August 1982.
- [4] RFC 2045
- "Multipurpose Internet Mail Extension (MIME) Part One: Format of Internet Message Bodies", N. Freed and N. Borenstein, November 1996.
- [5] RFC 1924
- "A Compact Representation of IPv6 Addresses", R. Elz, 1 April 1996.
- [6] Java Tip 117
- " Java Tip 117: Transfer binary data in an XML document", Odysseas Pentakalos, September 2001.
Copyright © 2003-2004, Northwest Summit. All rights reserved.