The importance of test data in integration solutions

Let me preface this by talking a bit about the various circumstances that can surround the development of a new integration. Whether you are working internally or externally, there will be some people in charge of either end of the integration, the sending and the receiving end, and these people can be more or less knowledgeable about exactly what goes on at their respective ends.

The integration that needs to be made can involve new or established document types, new or established partners, or even just be a migration of an existing integration to a new platform.

The data formats involved can be ones the integration solution already uses, or wholly new ones that need to be implemented from scratch. They can be official commonly used standards, or be completely custom made in-house formats. They can have complete official up-to-date technical specifications, or be entirely undocumented.

In short, when it comes to making integration solutions, the amount of information you will have to work from can vary enormously, and so can the involvement of the people at either end.

At one end of the spectrum you can have a new document type being exchanged between existing partners, meaning the actual transmission of data is already established. Both the input and output formats are from official standards with proper documentation, and you have been given an actual mapping specification detailing which fields in the input format goes where in the output format. The people responsible for both the sending and receiving systems are technically competent, they are available and responsive for communication, and are directly involved in the testing process.

This is the dream scenario. It doesn’t happen often, but it does happen, and when it does you don’t have much to worry about. Everything will be just fine.

However, in most cases you will have to deal with more unknowns than that, and the less certainty there is about various parts of the requirements, the more important it becomes to have access to good test data.

With that in mind, let us imagine the worst scenario possible and examine the role of test data from that perspective.

“Let’s make a new integration!”

One of the simplest and most frequently encountered integration scenarios are as follows: your solution will receive some text-based data files, convert the data to a different text-based data format, and output the resulting files somewhere else. It doesn’t really matter whether the input and output are to actual files in a file system, or if the data is exchanged in some other manner, the same rules apply in any case. In order to show you what needs to be done, the customer sends you two text files, one an example of an input file, the other an output file. In many cases, the customer will firmly believe that this is all you need. I mean, you can see that it looks like this, and you just need to change it so it looks like that. How hard can that be?

Leaving aside the issue of the complete lack of proper requirement specifications, there are several aspects of the files themselves you need to think about. Depending on the formats involved and the complexity of the data manipulations needed to transform one into the other, it may actually look doable, but beware: if you go blindly ahead and make something which can indeed perform the transformation of the input file into the output file, you run the risk of running into some nasty surprises down the road.

Where did the files come from?

This is the first question you need to ask. It’s a very important question, the answer of which determines what other questions you may need to ask. The best possible answer you can get is that both files are actual data files taken from the production environments of the involved systems. The worst possible answer you can get is “They were made manually by one of our employees who knows about these things”.

Now, apart from the fact that “knowing about things” is a both subjective and relative concept, a manually created file inherently lacks certainty about several aspects of the data format it represents. Let us have a look at some of the things that could potentially be wrong with the files you have been given.

Structural differences

Very few people are experts at everything they do, and whoever made the files may have an imperfect understanding of the file formats involved, and so may make mistakes even though so far as they know, what they are doing is correct. Also, humans are fallible, and a simple typo in an example file may result in quite large problems later in the project.

Even if the person at the other end knows exactly what they are doing, and are creating the data structure with 100% accuracy according to a set of specifications he has been given, the specifications themselves may be out of date, and not actually identical to the data the system now produces.

Apart from the data structure itself, two things in particular are very easy for someone manually generating data files to get wrong, simply because they are invisible under normal circumstances.

Line break characters

Different systems use different ways of indicating a line break. The three most common ones are CR (carriage return), LF (line feed) and CRLF (both in sequence). When someone creates a text file manually, it will typically use whatever kind of line break the system they create it on uses. If your data file is a flat file (either delimited or positional), it most likely contains a single record per line, so in order to split the data into individual records, your code has to split it at the line break character. Assuming that you have no explicit information about the type of line break used in the files you will eventually receive from the sending system, you will write your code to work with the example file you have been given. If it turns out that the files generated by the sending system use a different type of line break, your code will most likely not work. Hopefully this happens before your solution gets put into production.

People generally don’t think about line breaks, because in a text editor they are visually identical. You can see that the line does break, which is what you wanted it to do, so you think no more about it. Eventually though, some code somewhere is going to have to determine what to do with the data by reading the actual characters it contains, even the ones that are normally invisible. Typically in a flat file, a line break marks the end of a record, so if the data uses a different kind of line break than your solution is looking for, various bad things can happen. If you have defined the end of a record as CRLF but the data only uses LF, it will never find the end of the first record. If it is the other way around, it will find the end of the record, but the last field in the record will now end with a CR character which will be seen as part of the data content even though it isn’t.

Encoding – a developer’s nightmare

Another thing which people tend to forget about because it is usually not visible is which encoding the file uses. When looking at a text file in an editor, it usually doesn’t matter whether the text is encoded as UTF-8, Windows-1252, ISO-8859-1 or any of a number of other encodings, because the editor simply shows the characters the file contains, not the byte values used to represent those characters. For your part though, if your solution is responsible for reading the file itself from somewhere (a folder, FTP server, etc.), you will need to take the encoding into account in your code. Otherwise you run the risk of having some characters interpreted wrongly, and thus actually changing the data contents of the file.

However, text encoding is such a deep and detailed subject that I shall not delve into it any further in this article. For further information on the subject, please read my series of blog posts “Encoding 101”.

What percentage of possible scenarios do the files cover?

Many types of business document can look very different from one specific document instance to the next, depending on the details of each document. The possible differences between them are determined by the data format used. As such, if you do not have an actual format specification but only a few examples to work from, chances are you will not have a complete picture of what you can expect to have to deal with in the future. Let us look at some of the types of differences you will commonly run into.

Optional fields

Most complex data formats for business documents contain optional fields or entire optional structures within them. In order to facilitate all necessary usage scenarios, it has to be possible to include many kinds of data that are not applicable to all usage scenarios. When a scenario does not call for a specific kind of data, it would be awkward to still include all of the fields for that data and just leave them empty, so it makes sense to simply make those fields optional in the format.

If a format does contain optional fields, and you do not have a specification for the format itself, the chances of the example files you have been given containing between them every possible field in the format are slim. The likelihood that your integration solution will eventually receive documents containing fields you have never seen before and therefore not taken into account is correspondingly high.

Repeatable data

Many hierarchical data formats contain, somewhere in their structure, one or more repeatable records, such as lines on an order or invoice. It is quite common to receive test files with only a single instance of each repeatable record type. This is problematic, because it means that the integration logic for dealing with repetitions will not be tested properly and may cause issues down the line.

Apart from that, while it should be obvious to most people that an order or invoice can contain multiple order/invoice lines, other kinds of repeatable structures may exist within a format that are not necessarily obvious to someone who is not used to dealing with a particular kind of document on a daily basis. Therefore, if you do not have a specification for a format which explicitly shows that a specific part of the data structure can occur more than once, and you have no examples of it doing so, your solution is probably not equipped to deal with multiple instances of said structure.

“Most common”

People have a tendency, when describing a given data flow, to focus exclusively on the most common and obvious usage of it. In doing so they may completely forget to mention any special scenarios that may occur far less often, but which may require special handling when they do. It is therefore not uncommon for an integration specialist to have to go back and change a solution because they suddenly find out that the data flow they have developed can be used in several other ways than the one the customer initially told them about.

An argument I hear frequently from customers is “this should cover most cases, so let’s just start with this, and we’ll handle everything else manually to begin with”. This can be fine if you are dealing with traffic on the order of 20 documents per week. However, if you are processing 10,000 documents per day, manual handling may not be feasible. In that scenario, if your integration covers 95% of cases, that amounts to 500 failed documents per day, or worse, 500 documents that have gone through the system and been sent on, but do not contain the data they were supposed to.

Another argument I often hear during development is “this should cover most cases, so let’s get this developed and tested, and we’ll look at the less frequent scenarios afterwards”. The problem with that approach is that once you introduce the additional scenarios into the solution, you may have to make so many changes to it that it is essentially a new solution. As such, most of the time spent on the initial development and test has now been wasted, which wouldn’t have happened if you had spent the necessary time to uncover all the requirements to begin with. The end result will generally also be better in the latter case. Creating a solution, knowing that it will need to be able to handle a range of specific scenarios, usually makes for a much better structured solution than one that was made for a single scenario and then later modified to handle a couple more.

Case study: Invoice

Let us, as an example, look at an invoicing flow. Broadly speaking, there are two types of invoices: normal invoices, and credit notes. A credit note is basically the opposite of an invoice. Rather than asking for money, it is a notification of money you are due.

In some cases, invoices and credit notes are handled using different file formats. Quite commonly though, they use the same file format. In that case, there are two common ways to distinguish them from one another. One is for the data to contain some sort of flag or code indicating whether this file is an invoice or a credit note. Another is to simply use negative amounts for credit notes and positive ones for invoices.

Let us say we have gotten a task from a customer to implement an integration for invoices. The solution will receive invoices in one format from external systems, and transform them to another format, which is the one used internally.

If the only documentation or specification we have to work from are the two example files mentioned above, they will almost certainly be examples of an ordinary invoice. Working from that, we make a solution which transforms one into the other. It is tested, approved, and put into production. And then disaster strikes, when the first credit notes arrive.

It turns out that the external data format uses a code field to distinguish between invoices and credit notes, while the internal system uses negative amounts instead, or it uses a completely separate format for credit notes. As a result, due to the way our integration has been made, to the best of our ability given the information we have received, all credit notes are converted to invoices in the integration. Chances are, this is going to make someone unhappy, particularly the people who are now being asked to pay the money they were supposed to receive.

Case study: Shipping booking

When a company needs to ship goods of some kind from one place to another, they need to book the shipping. Whether they have their own shipping department or contract out the shipping to one or more logistics companies, the shipment needs to be booked so the people in charge of doing the shipping can make sure the goods are picked up at the correct location and time at one end, and is delivered to the correct location at the correct time at the other.

To facilitate this, some kind of shipping booking document is usually involved. It tells the recipient where and when to pick up the goods, where and when to deliver them, and also provides details of the dimensions, weight, packaging, etc. of the goods involved, all so that it can be ensured that the correct number people are assigned to the job, and that whatever method of delivery chosen is adequate to the task. After all, if you are shipping fifty pallets of canned tomato sauce to a buyer four countries away, you can’t send a single guy in a van.

This is all fairly logical and intuitive, but there is one aspect of shipping that is often overlooked, which is the concept of dangerous goods. Dangerous goods are goods that due to one or more aspects of their nature have special shipping requirements. These requirements are usually regulated by law, so that it is actually a punishable offense not to adhere to them, since failing to do so may pose a risk to other goods, other people or the environment. There are many different classes of dangerous goods with different kinds of shipping requirements. Some are fairly obvious, like flammable chemicals that have to be kept within a certain temperature range during shipping. Others are surprising, like the fact that pistachio nuts are actually classified as dangerous goods due to the fact that under the right (or more likely, wrong) circumstances they have been known to spontaneously combust.

What all of this leads up to is that a shipping booking document format will almost certainly have provisions for providing information about dangerous goods. However, depending on what sort of company you are dealing with, you may not have been given any test data containing such information. Since information on dangerous goods is exceedingly important when it comes to shipping, this could potentially spell disaster later on. Nobody wants to be the cause of a truck exploding because the integration solution responsible for converting the booking document from one format to another neglected to map the information that the shipment had to be kept below 30°C at all times.

What to do?

So what can we do to give ourselves and the customer the highest possible chance of success when it comes to test data? Well, there are several things we can do.

Ask questions

This is the most important step. Find out the nature of the test files. Ask for more of them, if possible. Ask if they cover all data points in the format, as well as all usage scenarios. If one or more of the formats involved are internal and not an official standard, ask for specifications of the formats. Anything at all you are unsure of, ask about it.

Some people are afraid to ask questions because they think it makes them look unprofessional. In reality it is the other way around. Asking the right questions, thereby minimizing the amount of surprises down the road, is the professional route to take.

Get more data

If possible, get more test data. I would almost say you can never have too much. As an optimum scenario, before going live your integration should have been tested on a month of production data. This is not always necessary or possible, but it is a good thing to aim for. A month of production data will likely contain at least most of the variants you are likely to encounter, and will go a long way towards ensuring yourself against nasty surprises.

Involve the recipient

If at all possible, make sure the eventual recipient of the data is involved. Have them sign off on the resulting files. Having them verify the files ensures that the integration lives up to its intended purpose. They are the only ones who can find certain types of problems. It may turn out that your customer’s ERP system only sends their internal product ID for each product, but the receiver system requires an EAN number. Since this requires changes to the customer’s ERP system, this might delay the production date. As such, the earlier in the process things like this are caught the better, so try to involve the recipient as early as possible.

Beware a lack of test data

Just a few words of caution here. I have seen many integrations that were developed, tested and signed off on with only a single fake test file of garbage data. In such a case, all you really know is that the integration can handle that one file. In many cases, that integration will fail immediately as soon as it goes to production. Don’t do that. Have higher standards. And if your customer won’t provide better test data for whatever reason, make sure to tell them the possible repercussions. But do try your best to give the project the best chance of success by adhering to the principles outlined above. It’s all well and good to be able to document that it isn’t your fault when disaster strikes, but it is still far better to ensure that it doesn’t strike in the first place.

Conclusion

When it comes to integration, test data is king. The better your test data, the better your chances of making something that will work on the first try, without any nasty surprises.

Next
Next

The MS Office character replacement problem