Monday, September 1, 2014

Agnostic type int float greater/smaller number comparison


It happens sometimes, a problem is in your mind for a while and one day the solution suddenly comes.
I know that what I'm going to explain may be obvious to many, probably it is, but because is also quite simple and nicely low-level I like to share this post.
The problem: let suppose that I wanted to be able to compare two numbers without knowing if they are floating points or integers. This basically means that I want to find a procedure that is good for both types; we can also call it "type agnostic".
Why should I desire such a thing?
There could be more than one reason: 
  • Depending on the architecture int comparison can be faster.
  • You don't want to put an additional if on the type, thus breaking the pipeline to choose either of procedures.
  • You want to keep your OpenCL or CUDA kernels simpler and good performing as possible and so on...
A good starting point is to consider that integers may come to us either signed or unsigned, whereas floating points are only signed! Because we want to use the same assembly instruction for both types and we need to deal with the float sign the best candidate in the integer side is the signed one.
For us convenience we are going to focus on 32bit comparisons but all we are going to say can be applied to 16, 64 and 128 bit.
As usual we need a little and incomplete overview... let's start seeing how int and float numbers are encoded. Of course experts can jump this part.
Signed integers in modern computers are encoded using the two's complement representation.
We  all know how to count using base 2: 0, 1, 10, 11, 100, 101... etc etc
The two complement uses the first bit to store the sign: 0 means + and 1means -. So if we are talking about 32 bit unsigned integer (uint32_t) the natural number sequence will be:

0 -> 00000000000000000000000000000000
1 -> 00000000000000000000000000000001
2 -> 00000000000000000000000000000010
3 -> 00000000000000000000000000000011
3 -> 00000000000000000000000000000100 and so on...

Talking about negative numbers we already introduced the problem of dealing with the sign bit (the first one); the surprising fact is that all the other bits are reversed compared to what we would expect.

-1 -> 11111111111111111111111111111111
-2 -> 11111111111111111111111111111110
-3 -> 11111111111111111111111111111101
-4 -> 11111111111111111111111111111100
-5 -> 11111111111111111111111111111011 etc...

I think that is quite intuitive, but if you want to be a little more formal to convert from binary to two's complement notation please accept this pseudo code:

if (first_bit == 0)
    decode(bits_sequence)
else
    -1 * (decode(reverse(bits_sequence)) + 1)

My advice if you are new to that is to play a little bit.
You can do it taking advantage of the fact that in gdb you can change variable values and print directly the bit representation.


Coming to the floating point encoding (defined in the IEEE 754) the situation is slightly more complex (but not that much).
For32 bit systems we have:



Again we find the first bit as sign bit, and again 0 means + and 1 means -.
After having coded 8 bits as unsigned number and shifted subtracting -127 (we will understand after why), the remaining 23 bits are used for the mantissa (or fractional part).
The mantissa encoding is made like this: every bit is assigned a value that becomes smaller moving to the right. In details the first one (position 22) has value 1/2, the second (position 21) 1/4 and the third 1/8. The idea behind is pretty clear; if we take a random bit in the mantissa the sum of the values of all the right ones is always smaller! In the end when we have calculated the mantissa value we have to add 1 to them. The result is that the mantissa has a value that always ranges beteween 1 and 2.
You can aggregate sign exponent and mantissa using this formula: 
Where: s represent the bit sign, m the mantissa value and e the exponent value.
Last but not least we have 2 strong exceptions to this rule:
  1. If all the exponent bits are 1 and all the mantissa are 0 the output is +Infinity or -Infinity (depending on the bit sign).
  2. If all the exponent bits are 1 and the mantissa is non zero the output is Nan.
So there are definitely a lot of bit combinations that don't correspond to regular numbers. The reason why integer greater/smaller comparison is typically faster respect to the float one is that in integer encoding the CPU doesn't have to deal with such a complexity.

Now that we know "everything" about number encoding at first sight we are not much further ahead than before. To find the solution for our problem we must find some kind of similarity in the comparison process between signed int and float.

Let's analyse how we would decide if signed int1 > signed int2... the procedure would be:
  1. Let's compare the bits sign, if they are different we already have a winner! unless...
  2. If both numbers are positive move from left to the right searching for the first different bit between the two numbers. The one's with the first 1 more is the greater.
  3. If both numbers are negative... do the same!! Because bits in two's complement notation for negative numbers are "inverted" but also the sign of the number is negative the two effects cancel out!  
-1 * -1 = 1

Not only now we know one reason why the two's complement notation has been chosen, but also we have learned a really important lesson form the point 2 of the last procedure. Bits on the left are always more important compared to the the right ones, so we need only to find the first different bit to decide which number is the greater.

Now let take a look to a couple of important similarities from which we can take advantage in the floating point encoding:
  1. The first bit represents the sign.
  2. Also the bits significance decrease from left to right.
In fact the only difference is that when the number is negative, bits from 1 to 32 are not inverted!
What we have to do is simply taking the floating point numbers that I will have to compare and encode them in this new way. After that for greater/smaller comparison operations I'll be able to forget completely that they actually are floating point numbers.

We will check if the float is negative and in that case we will have to:
  1. Invert all bits values using the not bitwise operator "~".
  2. Set again the first bit to 1 using the OR bitwise operator "|". We will use as one of the either two operands a bit sequence with the first bit 1 and the other zero. In hexadecimal notation this 32bit sequence is 0x80000000.
Finally here is the function:



Of course even if less readable we can write it as a macro.
And here is the test program with the macro version in it.


As you can see I'm extracting a random bit sequence to use as source. I'm also paying attention to verify that every bit sequence that I'm going to use is a valid floating point bit sequence, using the != operator.

Conclusions: Also if I know I'm not the first using some ideas similar to this one, I think it was a quite funny use case to introduce you to number encoding and to a couple of bitwise operators.
But most important I think we've learned that sometimes to solve a problem we can change the encode method instead the algorithm that processes the data created with that.

The small code examples can be also cloned from here.

NOTE: A colleague of mine has correctly pointed out that I didn't treated denormalized numbers:
A denormalize number is a floating point number in wich all the exponent bits are zeroed. In that case to decode the number simply we don't add one to the mantissa value. Denormalized number has been made to encode such a small numbers that the minimum exponent (-126) is not sufficient for the encoding. In that way is possible to encode smaller number paying the price in precision.
Again here is the standard for more details.
Because the smaller normalized number is bigger than the bigger denormalized one left bits are always more important then the right one, for this reason this method for the comparison works also in this case (as proved by the source test provided).

Tuesday, July 29, 2014

Memory Hacks - Freeze and reload a C program - part II


"Here we come to the break"

Once in the previous post we have obtained a way to execute our program with "stable" addresses we can now enter in the real game.
In this post we are going to cover the heap part; before starting coding a quick (and incomplete) overview is (again) necessary.
What we have to know about the heap (or data segment) is that is simply a portion of memory that the operating system give us to store data dynamically.
Again using the picture above we can see where the heap is typically located in the present day. We can also observe that it grows upward.


Obviously we all know that "dynamically" means we don't know the amount of memory necessary for the execution of the program before the execution itself.
The linux kernel gives us this classical UNIX interface to tackle the problem.

#include <unistd.h>
int brk(void *addr);

void *sbrk(intptr_t increment);

A call to brk sets the new end of the data segment to the value specified in addr, and on success returns 0.
Calling sbrk is possible to increment (or to decrement) the size of the data sector of a certain value. This function returns the new pointer to the end of the data segment; if the increment is 0 it returns the current end of the heap.
For more info on this two functions obviously the man page will be more exhaustive.
It's time now for two observations:
1 the entire problem of the dynamic allocation is brilliantly solved at the interface with only two simple functions.
2 thinking on it we realize that not both are needed to solve the problem. Maybe for this reason none of the two is specified either in the C or POSIX standards but traditionally UNIX systems implement at least one. In fact  I'm just taking a look to the source of the UNIX Version 6 in the Lions' book and I find only trace of sbrk (at the time named sbreak). But the research provides me with another interesting/funny result in form of annotated comment:

/* break system call.
* -- bad planning: "break" is a dirty word in C.
*/

Luckily nobody's perfect...

Now that we have introduced the UNIX interface we can point out that in fact malloc is a function executed in user space by the standard library that effectively calls sbrk.
Instead of taking a look to the glibc implementation if you are interested in a simpler but working example of malloc implementation you can find it in the K&R. So now we have clear that all the allocators that we commonly use in the end pass trought either brk or sbrk calls. Working on the top of the operating system interface will make  our implementation agnostic on the kind of memory allocator that is used by the program (if any is used).

Ok let's go with the implementation!
First thing to do when we launch our program is to find and to store the heap address. It's very important to do this action before every allocation in the data segment. According to the interface previously this simply translate in:
heap_start = sbrk(0);
Now we need the function, that we will call before freezing our program, to save the content of the data segment to file. With our previous knowledge the implementation is pretty straightforward. 


StoreHeap simply  reads the current last address of the data segment and taking as argument the initial address, it stores the portion of memory between the two values directly onto a binary file named heap.data.

We now need the symmetric function that will load back the previously stored data.
This is a little bit more tricky... We are using fopen from stdio.h to keep our life as simple as possible, but fopen itself allocates some memory in the data segment.
We will have to:
  1. open the file
  2. store the new heap end
  3. allocate the necessary memory to load back our data
  4. write the data in memory (starting just after the portion allocated by fopen)
  5. close the file
  6. shift our heap content to the original position
  7. move the end of the data segment back in the original position.
This is basically what our load_heap is doing.


We are almost there and we miss only the main function.

Here is an example (written as simple as possible):


You will easily notice a couple of details:

#include <malloc.h>

mallopt(M_MMAP_MAX, 0);


This is because in fact the kernel linux provides a third way to allocate memory that is Anonymous Memory Mapping through mmap.

Because this is typically convenient for big portion of memory, malloc sometimes could also use that.

If (as I'm now assuming) we will be interested in using malloc, it's a good idea to instruct this not to use mmap with the code just showed.

Moreover you will have noticed that I stored the value of str directly in the code with:

str = (void*)0x602010;

I did this simply because reloading the program we will not have the correct value of the pointer in str.
Because this variable is in the stack we will learn in the next post how to restore that. For now I've simply discovered the pointer value reading the print in the first execution and I've hardcoded it in the code.

Do the same because depending on various factor this address probably will be different on your computer!
Here is the example on how to how our little library:

And here the output on my computer.



To prove that we are not doing something too illegal here is the valgrind output with no errors.


Next post stack time!

Thursday, July 10, 2014

Memory Hacks - Freeze and reload a C program - part I


Some years ago I was studying some memory topics, in summary that kind of stuff that make happy every C coder worthy of such a name...
As result I asked my self: is possible in a simple way to save "the state" of a generic C program?
It would be really nice to save the state of a program to the hard disk while it's running and reload it when needed to restart it from the saved state!
Even I think that the resulting solution is hardly  applicable in production, I  suppose it could be quite instructive for who is trying to enter in this kind of topics.
Because the main focus of this post is to build a simple example of a C program with freeze save and restart capabilities, we will only introduce some concepts in a real informal way. To the interested reader more research on these topics are left.
First we have to take a look to the classical memory layout of a C program.


Thanks to the virtual memory every program thinks to have all the memory for itself starting from address 0 to 2^64 (obviously talking about 64 bit systems).
We will see some more details later but we will focus on 3 main points:
  1. Stack: The history of the called function is in the stack (basically what you can see translated by gdb using the backtrace command while stopped in a break point). Also the values of local variables are contained there. As you can see the stack start from a really high address and "grows" downward going to the directions of the zero.
  2. Heap: In the heap (sometimes referred as data segment) is located all the memory dynamically allocated so typically the most part of our data.
  3. CPU registers: In the registers are located some values used during calculation by the processor. The register in which we are more interested in is the instruction pointer, also called program counter. In this register the address of the next instruction that the processor will execute is simply stored. Instructions (i.e. our program) are stored in the text segment area (see first figure). It's quite clear that the instruction pointer is a critical value to be saved for our purpose, because it basically holds the position of the execution of our program inside the code.
If you don't know nothing about these topics and you really want to understand what is going on your computer I strongly suggest this book ProgrammingGroundUp.
Last important point is that for simplicity we are now making the assumption to have a program without global variables (in fact, their use in general is not a good practice).
Before starting we face the first problem...
Try to compile and execute this example:


Here is a screenshot of the output on my computer.
First we can observe that, as expected, the stack variable has an address value really high. Also the main function address (that stay in the text sector) and of the memory allocated on the heap via malloc correspond to low values as expected.

But bad news!! The addresses of the memory allocated on heap and stack is always changing!
In fact only the address of the main function (that is stored in the text sector) stays stable.
The problem is pretty clear, if heap and stack do not start from the previous data addresses in memory when we will reload in memory all the data saved will have a new location shifted of a certain value. In this situation all the pointers will point to a wrong address and nothing will work.

But with bad news also good news are coming... In the lower buffer of the trusty emacs you can see the output of the same program running under gdb.
Good news is pretty evident, there the addresses are stable. Why?
The answer is Address Space Layout Randomization.
ASLR (for friends) is a technique  adopted to avoid buffer overflow attack, basically consists in randomly changing the initial address of stack and heap at every program run.
If this technique is good for security reason is not good for our purpose so we have first to disable ASLR in some way.
Luckily gdb disables ASLR to provide stable addresses during different debug sessions (a treatament that not all the debuggers will grant you...)
So to do the same has been only necessary to take a quick view to the gdb source code to extrapolate something like this:

We have now a tool that can child a process after having set the "personality" of the executable in the proper way. In this way the kernel will be informed not to adopt ASLR on the new process.

It works!
Next part following here.