Encoding 101 - Part 4
In this final part of the series, we discuss how to prevent encoding problems.
So, what can we do?
Can we reliably detect the encoding used to produce a given lump of binary text data? No, unfortunately not. We can make an educated guess (or write code to the same effect), but it won’t be reliable. There simply isn’t enough information to be 100% certain of the result. We could write a method to do it and refine it with new logic based on the specific data involved every time it got it wrong, but it would only be valid for the specific integration scenario we wrote it for. It might eventually be more than 99% accurate, but it is not economically feasible to maintain and update such complicated code logic for each and every integration we do.
What we can do is to follow a set of best practices to minimize the possibility of encoding errors. I will outline those here.
Use proper data formats
There are many text-based data formats which either contain information about their encoding, or at least provide the option to include it. All EDIFACT formats explicitly state which encoding each file uses for instance. Likewise, XML files can contain (although it is not mandatory) an XML declaration with the name of the encoding they use, so all XML based formats can do the same. Using formats like these and using the options they provide for eliminating encoding errors, can alleviate a lot of the potential issues you might otherwise be faced with. Whether you are on the sending or receiving end of an integration, always seek to use proper data formats.
Use structured interfaces
There are many ways to send data from one system to another. Some of these offer the sender the possibility of including information about the data they are sending, such as which encoding it uses. When transferring data via HTTP (e.g. sending data to a webservice), the HTTP Content-Type header can be used to indicate the encoding of the data.
Agree on which encoding to use
If you have no other options for exchanging data than by sending or receiving plain text files of some kind (e.g. CSV files, which contain data fields separated by a separator character, often semicolon), make sure to make an agreement with the sender or receiver about which encoding should be used. As long as both parties always stick to using the encoding they have agreed upon for each integration, there won’t be any issues.
Additionally, if you are not restricted to a specific encoding by either end of the integration but can actually choose yourself which one to use, always use UTF-8. It is the industry standard, and the default most systems will fall back on in the absence of actual information about the encoding used.
Be aware of encoding issues
Even if you do everything “the right way”, you cannot entirely prevent encoding issues from cropping up from time to time. Anytime your code needs to process data that came from somewhere else, there is a chance that the “somewhere else” delivers data containing wrongly encoded text. To this end it is useful to have a basic understanding of how text encodings work and the kind of problems they can give rise to.
Do not trust tools blindly
As developers we use various tools to display and/or edit text files. Generally these tools do not ask you which encoding to use to decode a given file when you ask them to open said file, they will simply make the best guess they can at the correct encoding and then use that.
Sometimes it is not so much a guess as it is simply reading which encoding the file claims to be, if it is in a format that includes encoding information, e.g. an XML file with encoding information in its XML declaration.
Other times it is an actual guess, trying to decode the file using the most likely encoding, and if it does not result in any invalid characters concluding that this was correct.
Often, having opened a file, somewhere in the UI of the tool the name of an encoding will be displayed (Notepad++ does this, for instance). The intuitive conclusion is that that is the actual encoding of the file. This is not necessarily the case. All it means is that that is the encoding the tool has used to decode that file, it doesn’t necessarily mean that it is the correct one. It may be correct, but you cannot always know for sure.
There can be several reasons for a tool to use the wrong encoding to decode a file:
The file contained wrong encoding information
All the bytes present in the file matched valid character codes in the used encoding, even if some of the decoded characters were not the same as the ones that were originally encoded
The file, while encoded using a different encoding, only contained characters that were the same in both encodings
In the first two scenarios you may be able to see that something is wrong if the rendering of the file shows unexpected characters.
The third one is more tricky though. If for instance you have a file which only contains characters present in the ASCII encoding, there really is no way to determine which encoding was used to originally produce it, since most encodings use the exact same codes to represent those characters. Because of this it doesn’t really matter which encoding you use to decode that file, it will produce the correct characters regardless.
What you have to remember is that this doesn’t necessarily mean that the next file from the same source will be decoded correctly, using that same encoding. If you try to determine which encoding a given data source uses, you need to look at data containing characters from outside the ASCII space.
Conclusion
The existence and usage of multiple different text encodings frequently leads to problems, particularly in integration. A problem caused by an encoding error can be tremendously difficult to trace and solve, due to the fact that on its way from the sending to the receiving system, a given batch of data may be encoded and decoded many times. This can make it very difficult to find out where the error actually occurs.
It is therefore always wise to be aware of encoding and to be explicit about it. When planning a new exchange of data between two systems, always try to determine exactly which encoding each system uses to generate its own outgoing transmissions, and which encoding it expects for its ingoing ones. That way you probably won’t have any nasty surprises down the road.