(July 2013)

Fork me on GitHub

For the TL;DR crowd: If your programs are sending and receiving data structures, you are probably using high-level encapsulations - e.g. XML or JSON. There are however far more optimal solutions in terms of memory, CPU and network requirements ; ASN.1 is one of them.

Below you will find a short introduction into the why and how of serialization with ASN.1, as well as hands-on sessions with an open-source ASN.1 compiler implemented under the auspices of the European Space Agency.

(Do you already know about ASN.1 and/or XML/XSD/JSON? Then feel free to skip this section. Otherwise keep reading).

If you code for a living, you will inevitably end up in a position where two processes are communicating over a link. The link-layer technology itself is not important in this discussion: it can be sockets, or pipes, or whatever else you fancy. What is important, is how you handle the problem of sending your data "across the wire".

In the simplest of cases, you are writing the code at both ends - coding both the server and the client, in the same language.

The easy way - memory dump

In that case, things appear easy enough - for example, assuming you write in C/C++ and send data over sockets, you can just send a memory dump of the message itself:


// The message type
struct foo {
    long msgId;
    char myflag;
    float value;
    unsigned char szDescription[10];
    bool isReady;
};

// Sender
void somefuncThatRunsInTheSendingProcess()
{
    // Populate message data
    ...
    struct foo b;
    b.msgID = ...
    b.value = ...
    ...
    // Send them across
    send(..., &b, sizeof(b), ...);
    ...
}

// Recipient
void somefuncThatRunsOnTheReceivingProcess()
{
    ...
    struct foo b;

    // Receive message data
    recv(..., &b, sizeof(b), ...);

    // Work with them
    ...
}

(how we would handle messages with pointers or references inside them is left as an exercise for the reader - and yes, ignore packet fragmentation for now).

Looks easy and safe enough - block until the data of the structure are read, and work with them.

Until it becomes clear that...

Using higher-level languages

Some of you may be thinking: 'Yeah, C/C++ causes this kind of mayhem ; use a higher level language'. It could be argued however that there are a lot of markets where using low-level languages like C/C++ is mandatory. Embedded systems are a particularly good example - weak CPUs, little memory available...

But OK, I'll follow along - let's use Python instead:

+-----------------------+   # Sender
|   Data structure      |   a = socket(...)
| inside Sender process |   ...
+-----------------------+   a.send( pickle.dumps(myvariable) )
           |
           |      +------------+
           `----->| Serializer |
                  +------------+
                         |
                         |
                         | (Link layer - sockets, etc)
                         |
                         |
                         |  +--------------+
                         `->| Deserializer |
                            +--------------+
                                         |
                                         |
b = socket(...)                          |   +-------------------------+
...                                      |   |     Data structure      |
myvariable = pickle.loads( b.recv(...))  `-->| inside receiver process |
                                             +-------------------------+

(if you are thinking about it, for now ignore the fact that the recv() call may fetch only part of the serialized message because of packet fragmentation. For the purposes of this example, assume that we get the exact packet of data sent by the sender in that single recv() call).

This is a definite improvement: Python comes pre-loaded with a generic serializer. We no longer have to care about what exists inside myvariable: Integers, float, strings, lists, dictionaries - they will all safely migrate across different platforms, different CPUs' endianess, different CPU word sizes, etc:

$ ipython
Python 2.7.3 (default, Mar 5 2013, 01:19:40)
Type "copyright", "credits" or "license" for more information.

In [1]: import pickle

In [2]: a = {'msgId': 1, 'myflag': 'a', 'value': 255, 
        'szDescription': 'Hello', 'isReady': True}

In [3]: pickle.dumps(a)
Out[3]: "(dp0\nS'msgId'\np1\nI1\nsS'myflag'\np2\nS'a'\np3\nsS'
         szDescription'\np4\nS'Hello'\np5\nsS'value'\np6\nI255
         \nsS'isReady'\np7\nI01\ns."

In [4]: print pickle.loads(pickle.dumps(a))
{'msgId': 1, 'myflag': 'a', 'szDescription': 'Hello',
 'value': 255, 'isReady': True}

We have no clue about how pickle.dumps encodes things, but we don't care ; as long as they decode fine on the other end, why should we?

Great.

But does this mean that we will code all our software in Python from now on?

Unfortunately, not always an option...

I wish :‑)

As a developer, you will inevitably come to the position where you need to work with other people, that don't care for your preferred language.

Think about it...

How can you send your data structures across, to a program that is written in a completely different language?

Abstract Syntax Notation One (ASN.1)

You saw above that Python did a marvelous job of hiding the message encoding details by automating the handling of different types in the pickle module. Wouldn't it be fabulous if we had this kind of machinery across different languages?

Guess what - we do. Since the 1980's, in fact - it's called ASN.1.

The idea behind it is very simple: specify your exchanged message data types in a data description language:

Message ::= SEQUENCE {
    msgId INTEGER,
    myflag INTEGER,
    value REAL,
    szDescription OCTET STRING (SIZE(10)),
    isReady BOOLEAN
}

The language uses simple constructs to describe data types [1]. SEQUENCEs are what you would call structs or records in other languages - and as you can see, they contain descriptions of their fields. The usual basic types are there: BOOLEAN, INTEGER, REAL, ENUMERATED, OCTET STRING, etc - and SEQUENCEs can contain not only them, but also other SEQUENCEs, or arrays (SEQUENCE OFs).

Once we have written our ASN.1 grammar, we then feed it to an ASN.1 compiler - a tool that reads the specification, and emits, in our desired target language(s), (a) the language-specific type declarations, and (b) two functions per type: an encoder and a decoder, that encode and decode type instances to/from bitstreams.

A hands-on example

First, download the Data Modelling Tools ; a tool suite that contains a free, open-source ASN.1 compiler developed under the supervision of the European Space Agency.

Installing it is easy:

$ wget https://download.tuxfamily.org/taste/DMT/DMT-latest.tgz
...
$ tar zxvf DMT-latest.tgz
...
$ cd DMT-r*
$ ./install.sh

Install in what folder [Default: /opt/DMT] /home/ttsiod/DMT

Installation completed successfully.

The install script will tell you if you are missing any dependencies, and suggest installing them. It will also indicate what to do next:

Now please setup your shell startup scripts to source...

    source /home/ttsiod/DMT/DMT_vars.csh

...if you use C-shells, or

    source /home/ttsiod/DMT/DMT_vars.sh

...otherwise (bash, etc). You can of course also source these right 
now in your current shell, and then run any of the DMT tools ;
they will exist in the PATH (asn1.exe, asn2dataModel, aadl2glueC)

To learn what you can do with the DMT tools, start by reading
'/home/ttsiod/DMT/README' - this will point you to additional docs.

I am using bash, so I follow the second path:

$ echo '. /home/ttsiod/DMT/DMT_vars.sh' >> ~/.bashrc
$ source /home/ttsiod/DMT/DMT_vars.sh

$ which asn1.exe
/home/ttsiod/DMT/tools/asn1scc.3/asn1.exe

$ asn1.exe
No input files
...
Current Version is: 3.0.976 
Usage:

asn1  <OPTIONS> file1, file2, ..., fileN 

Where OPTIONS are:
...

The ASN.1 compiler is up and running [2].

Here's what the ASN.1 compiler creates when it is invoked on our simple ASN.1 grammar:

$ cat sample.asn
MY-MODULE DEFINITIONS AUTOMATIC TAGS ::= BEGIN

Message ::= SEQUENCE {
    msgId INTEGER,
    myflag INTEGER,
    value REAL,
    szDescription OCTET STRING (SIZE(10)),
    isReady BOOLEAN
}

END
$ asn1.exe -c -uPER sample.asn
$ ls -l
total 220
-rw-r--r-- 1 ttsiod ttsiod 22131 Jun 18 17:36 acn.c
-rw-r--r-- 1 ttsiod ttsiod 19479 Jun 18 17:36 asn1crt.c
-rw-r--r-- 1 ttsiod ttsiod 16467 Jun 18 17:36 asn1crt.h
-rw-r--r-- 1 ttsiod ttsiod 12757 Jun 18 17:36 ber.c
-rw-r--r-- 1 ttsiod ttsiod 48772 Jun 18 17:36 real.c
-rw-r--r-- 1 ttsiod ttsiod   254 Jun 18 17:36 sample.asn
-rw-r--r-- 1 ttsiod ttsiod 51789 Jun 18 17:36 sample.c
-rw-r--r-- 1 ttsiod ttsiod  4098 Jun 18 17:36 sample.h
-rw-r--r-- 1 ttsiod ttsiod 24443 Jun 18 17:36 xer.c

$ gcc -c -I . *.c
$ echo $?
0

So, a number of .c and .h files were generated, which GCC then successfully compiled.

The 'gateway' - the only file you need to care about - is sample.h. Remember in our description above, we said:

... an ASN.1 compiler reads this specification, and emits, in our desired target languages, (a) the language-specific type declaration, and (b) two functions per type: an encoder and a decoder, that encode and decode type instances to/from bitstreams.

This is the type declaration that ASN1SCC generated for our message:

$ cat sample.h
...
typedef struct {
    asn1SccSint msgId;
    asn1SccSint myflag;
    double value;
    Message_szDescription szDescription;
    flag isReady;
} Message;
...

asn1SccSint is a typedef inside the Run-time library (asn1crt.h) - and is defined as a 64bit int. Similarly, flag is typedef-ed to bool. So, the ASN.1 compiler generated a semantically-equivalent transformation of the ASN.1 grammar, into our target language's declaration of the corresponding types.

We also said the ASN.1 compiler generates two functions - an encoder, and a decoder. And indeed:

$ cat sample.h
...
flag Message_Encode(
    const Message* val, ByteStream* pByteStrm, int* pErrCode, 
    flag bCheckConstraints);
flag Message_Decode(
    Message* pVal, ByteStream* pByteStrm, int* pErrCode);
Message ::= SEQUENCE {
    msgId INTEGER (0..10000),
    myflag INTEGER,
    value REAL,
    szDescription OCTET STRING (SIZE(10)),
    isReady BOOLEAN
}

...then the ASN.1 compiler generates an error code definition:

#ifndef ERR_Message_msgId 
#define ERR_Message_msgId     1003  /*(0 .. 10000)*/
#endif

...and it is this error code that will be stored inside pErrCode, if we violate the constraint. That is, if we call Encode with an invalid value inside the .msgId field of the val argument, the encoder will report this error code.

In case you missed it, or it wasn't clear enough:

In ASN.1, we can specify not only the field types, but also limits on their values - and have them automatically checked!

And that's the main idea. You can use this generated C code in your projects - it will just work. There are no external dependencies, no libraries to speak of, the code is there, open, for you to use as you please. Note that the encoders will properly handle all manner of potential mischief: endianess of the platform you compile it on, word sizes, etc. You can be sure that by using ASN.1, your encoded messages (that is, the representations inside the ByteStreams) can be sent to whatever platform you fancy, and they will decode fine, into the receiving platform's variables.

(Note: ASN1SCC is made specifically for embedded, safety-critical systems, so it only addresses ASN.1 grammars containing bounded (in size) messages. ASN.1 itself has no such limitation - e.g. you can model open-ended SEQUENCEOFs with it).

What if I have more than one message? Complex ones?

Actually, that's what ASN.1 was built for : to allow easy specification of all the messages that will be exchanged between your apps, regardless of their complexity. Here's a more advanced example, showing ENUMERATED types, nesting inside SEQUENCEs, etc:

MyInt ::= INTEGER (0 .. 20)

TypeEnumerated ::= ENUMERATED {
    red(0),
    green(1),
    blue(2)
}

My2ndEnumerated ::= TypeEnumerated

AComplexMessage ::= SEQUENCE {
    intVal          INTEGER(0..10),
    int2Val         INTEGER(-10..10),
    int3Val         MyInt (10..12),
    intArray        SEQUENCE (SIZE (10)) OF INTEGER (0..3),
    realArray       SEQUENCE (SIZE (15)) OF REAL (0.1 .. 3.14),
    octStrArray     SEQUENCE (SIZE (20)) OF OCTET STRING (SIZE(1..10)),
    enumArray       SEQUENCE (SIZE (12)) OF TypeEnumerated,
    enumValue       TypeEnumerated,
    enumValue2      ENUMERATED {
        truism(0),
        falsism(1)
    },
    label           OCTET STRING (SIZE(10..40)),
    bAlpha          T-BOOL,
    bBeta           BOOLEAN,
    sString         T-STRING,
    ...
}

The language is fairly simple, so you should be able to figure out what is going on. If not, you can study Olivier Dubuisson's freely available book for an extensive treatment of ASN.1 - all the way up to advanced features.

Processing this grammar with any ASN.1 compiler (including ASN1SCC), you would be all set to use the ready-made message definitions for AComplexMessage, TypeEnumerated, ... and their corresponding encoders/decoders.

In plain words - the complexity and the number of messages don't matter when you use ASN.1.

And this works for any language?

Yes, there are ASN.1 compilers for almost any language you can think of. ASN1SCC in particular, has been developed under the supervision of the European Space Agency, and it targets C and Ada, with specific emphasis on embedded, safety-critical systems - for which it does some pretty amazing things:

Suffice to say, if you are involved with embedded development, it's worth taking a look.

Isn't XML/JSON better than this?

Well, it depends on your definition of 'better'.

If you value optimal encoding/decoding performance, minimal encoded message size, guarantees of code safety, and minimal power requirements for encoding/decoding messages, then no, XML is most definitely NOT better. That's why your mobile phone has used ASN.1 encoding while you were reading this article. I am not kidding - almost every single signalling message that your phone sends to the local cell tower, is encoded via ASN.1!

If on the other hand...

... then yes, XML/JSON may be a better match for you.

What about CORBA? Or other middleware tech?

Let me repeat: If you care about optimal encoding/decoding performance, optimal memory use, ...

Remember, when we are speaking about ASN.1, we are looking at technology that was built by the Ancients. Being optimal wasn't a choice, back then - it was mandatory. You didn't have resources to waste. When you use ASN.1, you simply automate the parts of message marshalling that can be automated, without losing any performance or wasting any memory.

Do a low-level comparison of ASN.1 with any other technology that involves marshalling, and I guarantee you will be paying something at runtime: memory use, performance, or both.

How does ASN.1 represent the messages on the wire?

ASN.1 comes with a set of predefined rules, that specify how encoding is done. You choose one when you invoke the ASN.1 compiler on your ASN.1 grammar - ASN1SCC, for example, currently supports four:

Note that choosing encoding has zero impact on your type declarations - you can switch between encodings without changing anything in your user code, except the name of the encoding/decoding function you call: e.g. instead of the default encoding (UPER), where you call Message_Decode, you'd call Message_BER_Decode - etc.

OK, I see it now - this can be useful

ASN.1 basically allows your programs to communicate with implementations coded in other languages, by establishing common ground - through a simple [1] data definition language. If you use it, arbitrarily complex messages are handled easily: think of arrays containing unions containing structs - or, in ASN.1 parlance, SEQUENCE OFs containing CHOICEs containing SEQUENCEs). You don't have to ever implement any serializers/deserializers of the messages, and provided you use a compiler like ASN1SCC, you also get guarantees of correctness, type safety and performance - for free.

And then, you realize you can do some magic...

Equally important - and this is a matter for another blog post, but consider this a teaser - by using ASN.1 to define your messages, you can then automatically create many things that are depending on the message definitions: in the case of the work I've done for the European Space Agency, we've built automatic translators of ASN.1 messages towards...

Two examples of ASN.1-based mappers: A SWIG-based python wrapper, and a SQL exporter

The following is a hands-on example of how our SWIG-based Python mapper - bundled in our DMT tools - wraps around the UPER C functions generated by ASN1SCC, and allows Python code to speak with ASN.1 UPER encoded data:

$ cat sample.asn
MY-MODULE DEFINITIONS AUTOMATIC TAGS ::= BEGIN

Message ::= SEQUENCE {
    msgId INTEGER(0..10000),
    myflag  INTEGER(0..255),
    value REAL(0.0 .. 3.14159),
    szDescription OCTET STRING (SIZE(1..10)),
    isReady BOOLEAN
}

END
$ mkdir pythonOutput
$ asn2dataModel.py -toPython -o pythonOutput/ sample.asn
$ cd pythonOutput
$ make -f Makefile.python
$ ...

... which then gives you a full API over your ASN.1 types, through Python classes:

$ cat test.py
from sample_asn import *

# Create a buffer big enough for the maximum form of the PER-encoded data)
d1 = DataStream(DV.Message_REQUIRED_BYTES_FOR_ENCODING)

# Create an instance of the message and set the fields)
msg = Message()
msg.msgId.Set(5)
msg.myflag.Set(15)
msg.value.Set(2.718)
msg.szDescription.SetFromPyString("Hello!")
msg.isReady.Set(True)

# Encode the message as per ASN.1 UPER encoding rules
msg.Encode(d1)

# Get the encoded binary data - send them with sockets/pipes/whatever...
b = d1.GetPyString()

# ...and eventually place them in a buffer and use them 
# as a source for decoding on the receiver code:

d2 = DataStream(DV.Message_REQUIRED_BYTES_FOR_ENCODING)
d2.SetFromPyString(b)
msgReceived = Message()
msgReceived.Decode(d2)

print "Received msgId:", msgReceived.msgId.Get()
print "\nAll fields:\n"
msgReceived.PrintAll()
$ python ./test.py
Received msgId: 5

All fields:

msgId:
    5
myflag:
    15
value:
    2.718
szDescription:
    Hello!
isReady:
    True
$

The code shows a full round-trip that passes from structure, to byte buffer, and back to structure.

Similarly, here's our SQL mapper, automatically generating the SQL schema for storing/retrieving our messages:

$ asn2dataModel.py -toSQL sample.asn

$ ls -l
total 16
drwxr-xr-x  2 ttsiod ttsiod 4096 Jun 20 11:15 ./
drwxr-xr-x 18 ttsiod ttsiod 4096 Jun 20 10:55 ../
-rw-r--r--  1 ttsiod ttsiod  261 Jun 20 11:15 sample.asn
-rw-r--r--  1 ttsiod ttsiod 1232 Jun 20 11:15 sample.sql

$ cat sample.sql
--  SQL statements for types used in "sample.asn"

CREATE TABLE Message_msgId (
    id int PRIMARY KEY,
    data int NOT NULL, CHECK(data>=0 and data<=10000)
);

CREATE TABLE Message_myflag (
    id int PRIMARY KEY,
    data int NOT NULL, CHECK(data>=0 and data<=255)

...

CREATE TABLE Message (
    id int NOT NULL,
    msgId_id int NOT NULL,
    myflag_id int NOT NULL,
    value_id int NOT NULL,
    szDescription_id int NOT NULL,
    isReady_id int NOT NULL,
    CONSTRAINT Message_pk PRIMARY KEY (id),
    CONSTRAINT msgId_fk FOREIGN KEY (msgId_id)
        REFERENCES Message_msgId(id),
    CONSTRAINT myflag_fk FOREIGN KEY (myflag_id)
        REFERENCES Message_myflag(id),
    CONSTRAINT value_fk FOREIGN KEY (value_id)
        REFERENCES Message_value(id),
    CONSTRAINT szDescription_fk FOREIGN KEY (szDescription_id)
        REFERENCES Message_szDescription(id),
    CONSTRAINT isReady_fk FOREIGN KEY (isReady_id)
        REFERENCES TaStE_Isready_type(id));
$ 

As you can see, the transformation is also converting ASN.1 constraints to SQL constraints - and more importantly, works regardless of the complexity of the message.

Conclusion

Putting it simply: ASN.1 is another technology that is optimal for certain problem domains - and yet people will ignore it and pay the penalty in performance, memory and robustness.

By modelling your system's messages, ASN.1 also allows for lots of automatic code generation. In our case, we identified tremendous opportunities for automation, and have made a number of ASN.1-based code generators, that, among other things, automatically...

Fellow developers, have a look. You may find out ASN.1 can make your work easier, simpler, and more efficient.

Notes

  1. If you are unfortunate enough to come into contact with lots of Telecom standards, you will often see how a simple and useful idea can become ridiculously complex as feature creep takes hold. To whatever extent you can, resist this - in the case of ASN.1, I humbly suggest that you use only the basic principles: type specification and constraints.

  2. The compiler is an F# application, so it can run (via Mono) under Linux and OS X, or natively under Windows (to put it simply: it works on all major platforms). Why it is written in F# is a matter for another blog post - suffice to say, OCaml (the mother of F#) is a language with a strong type system that prevents many potential issues, detecting them at compile-time. It is very important for any code generator (and an ASN.1 compiler is exactly that!) to detect as many errors as possible at compile-time.

  3. Since ASN1SCC targets embedded platforms, memory is an issue (we don't want to allocate stuff from the heap, since the heap may run out - when you're in space, what can you do if you run out of heap?...). The compiler therefore emits #defines that allow us to reserve the necessary memory during compile-time:

void foo()
{
    static Message decodedPDU;
    static byte encBuff[Message_REQUIRED_BYTES_FOR_ENCODING];
    static BitStream bitStrm;

    BitStream_Init(&bitStrm, encBuff, Message_REQUIRED_BYTES_FOR_ENCODING);
    ret = Message_Encode(pVal, &bitStrm, pErrCode, TRUE);
    ...


profile for ttsiodras at Stack Overflow, Q&A for professional and enthusiast programmers
GitHub member ttsiodras
 
Index
 
 
CV
 
 
Updated: Sat Oct 8 12:39:34 2022