Transferring Data across different platforms
31 Oct 07, 10:38AM
Transferring Data across different platforms
The standardisation of compilers has made developing software for multiple platforms comparatively easy. It is rarely necessary to concern ourselves with the instruction set of the underlying chipset. However, subtle differences between systems still exist, the main ones being how text files are stored, and the Little and Big Endian methods for storing numbers. Most of the time these issues can be ignored, however when it comes to transferring data from one platform to another, problems can arise.
With text documents, there are many differences between how systems handle certain ASCII characters. For example, most Mac word processors ignore control characters 0-31, and define tabs by length rather than spaces. However, the real difference between PC, Unix machines and Macs is the insertion of a Carriage Return (CR), Line Feed (LF) or both at the end of each line. The history behind this is simple: in the old days of typewriters, CR returned the head to column 0, while LF advanced the paper.
Today the need for CR and LF together is redundant, but each major platform addresses this differently. Thus:
Numbers are also stored differently on the PC and Unix platforms, due to differences in their underlying hardware. One (bad) way around this is to store numbers in text format, but it is much more efficient to store them in binary format, so that the data on disk exactly matches the data in memory.
So whats all the fuss about? A single byte can only store 2^8 different values, such as the integers 0 to 255. For larger numbers, more bytes are needed. For example, 2 bytes together (also known as a "word") can store 2^16 values, or integers 0 to 65535. Like with decimal counting (where "tens" are more significant than "units") the bytes within a word are referred to as the "high-order" and the "low-order" byte. Those in other data types, such as the 8-byte double, are also ordered in a specific way.
The problem arises because bytes are stored sequentially, but within multi-byte numbers their ordering is implicit. Thus, any program running on an Intel-based architecture (i.e. Windows, DOS) "knows" that the low-order byte always comes first. However, all numbers in the Unix world (including the Mac) are led by their high-order byte.
These systems are known as the "Little Endian" and "Big Endian" respectively and numerical data cannot be copied directly between them. However, its easy to convert from one to the other, by simply reversing the byte order. The following routine demonstrates this for numerical type double:
With text documents, there are many differences between how systems handle certain ASCII characters. For example, most Mac word processors ignore control characters 0-31, and define tabs by length rather than spaces. However, the real difference between PC, Unix machines and Macs is the insertion of a Carriage Return (CR), Line Feed (LF) or both at the end of each line. The history behind this is simple: in the old days of typewriters, CR returned the head to column 0, while LF advanced the paper.
Today the need for CR and LF together is redundant, but each major platform addresses this differently. Thus:
- PC insert a CR and LF (ASCII codes 13 and 10) at the end of each line
- Mac insert only a CR (ASCII code 13)
- Unix systems insert LF (ASCII code 10)
Numbers are also stored differently on the PC and Unix platforms, due to differences in their underlying hardware. One (bad) way around this is to store numbers in text format, but it is much more efficient to store them in binary format, so that the data on disk exactly matches the data in memory.
So whats all the fuss about? A single byte can only store 2^8 different values, such as the integers 0 to 255. For larger numbers, more bytes are needed. For example, 2 bytes together (also known as a "word") can store 2^16 values, or integers 0 to 65535. Like with decimal counting (where "tens" are more significant than "units") the bytes within a word are referred to as the "high-order" and the "low-order" byte. Those in other data types, such as the 8-byte double, are also ordered in a specific way.
The problem arises because bytes are stored sequentially, but within multi-byte numbers their ordering is implicit. Thus, any program running on an Intel-based architecture (i.e. Windows, DOS) "knows" that the low-order byte always comes first. However, all numbers in the Unix world (including the Mac) are led by their high-order byte.
These systems are known as the "Little Endian" and "Big Endian" respectively and numerical data cannot be copied directly between them. However, its easy to convert from one to the other, by simply reversing the byte order. The following routine demonstrates this for numerical type double:
void reverse_bytes( double *Data ) { union { double as_number; char as_bytes[8]; } forward, reverse; forward.as_number = *Data; for( int a=0,b=7; a<=7; a++,b-- ) reverse.as_bytes[a] = forward.as_bytes[b]; *Data = reverse.as_number; }