Category Archives: tech

NoSQL–dive into graph DB

I am used to relational databases – I used to work in Oracle DBS, a little bit with MySQL and currently Microsoft SQL Server. The problem with these databases is that it is hard to easily show relationship between objects.

In the relational DB tables are objects and relation between is either a part of this table or (what makes such schemas more complicated!) additional tables with many-to-many relations.

One of the ideas to solve this issue is to use different approach – graph databases. The ones use structures of graphs to represent and store data. There are three main elements used in them – nodes, properties and edges. Nodes, as it used to be – are objects such as people, accounts, cars. And this is easy to understand.

Properties are details about particular node – for example, about a person we can store property such as name, surname, age.

The connection between nodes are called edges and these represent the relationship between two edges. This is the place where most of the important information is really stored. And this is also the main difference from relational databases.

I would like to investigate this topic a little bit more deeper so I planned to play with Neo4j database (I know it’s in Java, but thanks to REST web API we can access it through many programming languages and obviously I will use .NET platform) – which is an open source graph database which supports ACID.

Advertisements

Hashtable and Dictionary

As I have mentioned in the previous post – the .NET runtime is now open source. This means that we can dive into it’s code. What is important – most of the code is written in C++ (as expected obviously!).

As part of the runtime I have go through the mscorlib project. What got me interested were Dictionary and Hashtable implementations. There is quite sophisticated arithmetic there as we might have expected.

They share the code

The main difference between this two is obvious – the idea is the same, using hashes to store values in buckets. But it is more important that Dictionary code is just the copy of the initial Hashtable implementation. This means that if any bug fix is implemented in either of them, it needs to be made in the other one as well.

Hashtable has thread safety for multiple readers and single writer build into some of its methods and properties. As you probably know the hash table (aka hash map) is a data structure that maps keys to values and uses hash function for calculating the key (index) to put values into buckets.

In .NET implementation the hash function uses double hashing which looks like this:

h(key, n) = h1(key) + n*h2(key)

where n is the number of collisions for a bucket. h1 is our implementation of GetHash(key) (or the default if we did not override it). More complicated is the h2 one which uses h1 and calculates the value between 1 and hashsize-1.

The table where values are stored is initialized with 3 elements when we do not specify the size (we should not treat this as the constant value from version to version – this can differ). Even when we specify smaller number – there is a check to the “awfully small sizes” are avoided. Very interesting is the procedure for extending the size of the array.

Resize hash array

When the procedure reaches the maximum number of buckets in the array (load factor reaches the maximum load factor) – it needs to be resized. The number of buckets in the hashtable is “automatically increased by approximately a factor of two”. What does it mean? It means that that the size is increased to the smallest prime number which is larger than the current size of hashtable.

How is this organized in the code? First of all – the first 72 prime numbers are stored as the readonly array – so there is no need to calculate them. The biggest one is 7 199 369 (that’s quite big number). When the hashtable size is even bigger – the next prime number is going to be calculated.

We know that by doubling we are preserving the asymptotic complexity of the hashtable operations such as add. What is more important – by having a prime guarantees that double hashing does not lead to infinite loops or just one bucket with plenty of collisions.

Summary

By releasing .NET Framework as open source Microsoft allowed us, developers, to verify the code they wrote. We can even make changes and build our own version of the .NET Framework. I think it is very interesting to check how some of the objects were implemented by devs at Microsoft.

In this short post I dig a little bit into the Hashtable implementation in mscorlib.

Windows 10–why not Windows 9?

This question was bothering me since I heard that the next release of Microsoft’s OS will be named Windows 10. Where is 9 everyone asked – even me.

Then my friend sent me this one:
By4SyycCUAA4p6d

This got me thinking – maybe this is not only a rumor but actually the truth?

Another hint is here – https://searchcode.com/?q=if%28version%2Cstartswith%28%22windows+9%22%29 
search_code

There are “About 4,342 results” (and this is only one, sample site!), so maybe this is not only a “mistake” but actually this saved us another millennium bomb after changing operating system.

BTW – I wonder what they did discover when used Windows 9 for the first time without changing it to 10.

System.Generic.Collections.List–internal structure

Ever wondered how List<T> is implemented? Do you know what is the default capacity for List<T> when you call it’s constructor? If not, read on – after this post you should be able to answer such kind of questions. This is not the post for ordinary developers 🙂

Capacity vs. Count

These two can be sometimes troublesome and one can mistake one from another. But most of us get it correct – the count is the actual number of elements on the List<T>. Where the capacity is the total number of elements the internal data structure can hold without resizing.

As you can expect – behind the curtain there is a data structure like this:

private T[] _items;

In such table the List stores it’s elements – all the methods base on this.

So in other words:

  • count – the number of elements actually in the table,
  • capacity – the length of the table.

Default capacity

Ok, so there is a table behind the List<T> object – but actually what is the default capacity when you call default ctor of list? We do not provide any elements at all, we also do not tell how many elements we do expect.

Here, I have to mention that this is not DEFINED and can be changed in the future. I have checked it on the .NET Framework v4.0 (and before) – it is still the same value.

private const int _defaultCapacity = 4;

This is not a big number you say. So the other question is – what happen when you would like to add 5 elements.

Going further, let’s check the Add(T item) method where the first two lines are like that:

if (this._size == this._items.Length)
        this.EnsureCapacity(this._size + 1);

So going further, analyze of EnsureCapacity(int) method leaves us such operation in case if there is no more space to add new element to the table:

int num = this._items.Length == 0 ? _defaultCapacity  : this._items.Length * 2;

As you can see – this line is responsible not only for extending the capacity (binary growth!) but also for setting the default storage capacity.

This gives us quite important info – when you call List() ctor actually the internal table used for storing data object is EMPTY! That’s quite a surprise, you have to admit..

Summary

So in other words List<T> is a simple implementation of the table data objects where extending the data structure is always a nightmare. Thankfully folks at Microsoft created BCL with a great implementation of the List<T> generic collection which gives us all the speed known from table data structure and the nice execution and usage options at the same time.

Object initializers

When you would like to create an object – you simply call the ctor of specific class. But sometimes building plenty of constructors just for adding another parameter is a waste of time (yours and developer who will use your code). This is why in C# 3.0 the object initializers were introduced.

But let’s start with simple example where we have a class such as below.

class Person
{
    public string Name { get; set; }
    public string Surname { get; set; }

    public Person(string name)
    {
        Name = name;
    }
}

With that to initialize it we were used to do it like that:

var p1 = new Person("Tom");
p1.Surname = "Kuj";

But starting from C# 3.0 you could also do it like that:

var p2 = new Person("Tom2") { Surname = "Kuj2" };

Ok – looking at that there seems to be only the language (compiler) feature to write the initialization differently. BUT there is also a different execution approach which can have significant influence on your multithreaded application.

IL code examination

Let’s check the IL code generated by the compiler.

.locals init (
  [0] class TestObjectInitialazerConsoleApplication.Person p1,
  [1] class TestObjectInitialazerConsoleApplication.Person p2,
  [2] class TestObjectInitialazerConsoleApplication.Person '<>g__initLocal0'
)
IL_0000: nop IL_0001: ldstr "Tom" IL_0006: newobj instance void TestObjInitApp.Person::.ctor(string) IL_000b: stloc.0 IL_000c: ldloc.0 IL_000d: ldstr "Kuj" IL_0012: callvirt instance void TestObjInitApp.Person::set_Surname(string) IL_0017: nop IL_0018: ldstr "Tom2" IL_001d: newobj instance void TestObjInitApp.Person::.ctor(string) IL_0022: stloc.2 IL_0023: ldloc.2 IL_0024: ldstr "Kuj2" IL_0029: callvirt instance void TestObjInitApp.Person::set_Surname(string) IL_002e: nop IL_002f: ldloc.2 IL_0030: stloc.1 IL_0031: ret

First look at this code does not provide much information but with a closer look – there will be interesting flow.

First of all let’s look at the locals initialization – we have 3, not 2 objects in there!!! We have two named p1 and p2 and there is something called ‘<>g__initLocal0’ – this is just temporary variable which will be used for the object creation (you probably understand it already :-)).

For our investigation the most interesting lines starts at IL_0022  where we allocate the object created in IL_001d object into [2] local and we work on that one. The lines IL_002f and IL_0030 – the temporary variable is assigned to the p2 object.

Summary

So object initialization gives us another temporary variable and when it’s finished – this is assigned to our object. But what do we get because of that? Why did designers decide not to work on the original object?

The answer is – the multithreading. You can imagine that it’s possible that some thread will read p1 when its ctor was called but not all properties were set. That is why using the object initialization is a better idea – will make our code more “atomic” I can write. So the object will never be in the state which we did not expect – it’s either not assigned or assigned as expected.

Additional summary (EDIT)

This seems to be the path taken by the designers with more than just object initializers. If you inspect the IL code of collection initializers – this works exactly the same way (with the temporary variable).

Virtual method dispatch – how does it work?

Working with object oriented programs mostly is very straightforward. But sometimes the execution just looks like a bit of magic. Especially the virtual methods where the compiler did know about the static type at the compile time BUT that virtual method was called at the runtime. How does it work? I will try to explain that in this post.

First of all we have to understand how does our execution look like from the intermediate language point of view.

There are two IL instructions for calling methods – call and callvirt. The assumption you make based on their names is correct – one is used for calling static methods with, the second for virtual ones (REMEMBER: static methods can NEVER be virtual – yet you have to specify the type name to call it!). One thing worth mentioning is that the CLR virtual dispatch works at the method level – i.e. properties will be considered as two methods (get_X and set_X).

The motto you should remember about virtual methods is rather simple – the least derived type declaring the virtual method will be used. What does it mean is – the runtime will call to the most specific implementation of the method it can find.

Learn by example

I know this could be quite confusing – and learning by reading is not always the best way, and the easiest way for sure. Let’s consider a simple hierarchy of classes:

class AClass
{
    public virtual void Do()
    {
        Console.WriteLine("AClass here!");
    }
}
class BClass : AClass
{
    public override void Do()
    {
        Console.WriteLine("BClass here!");
    }
}
class CClass : AClass
{
    public void Do()
    {
        Console.WriteLine("CClass here!");
    }
}

The structure is rather simple, so let’s try to use such implementation in sample code:

AClass a = new AClass();
AClass b = new BClass();
AClass c = new CClass();

a.Do();
b.Do();
c.Do();

The question rises – will you be able to answer that the output will look like that on the prtsc below?

virtuality_calls

As you can see there are two interesting cases (let’s leave the a object, this is just way too simple! :_) ) – b and c objects. Each of them has it’s own story, let’s talk about it now.

b and c objects and the override keyword

This is cool – you have AClass object, but you assign to it an object which type is BClass. This means that the runtime has the info about it. Let’s dive into the IL code generated:

IL_0000: nop
IL_0001: newobj instance void TestVirtualityApp.AClass::.ctor()
IL_0006: stloc.0
IL_0007: newobj instance void TestVirtualityApp.BClass::.ctor()
IL_000c: stloc.1
IL_000d: newobj instance void TestVirtualityApp.CClass::.ctor()
IL_0012: stloc.2
IL_0013: ldloc.0
IL_0014: callvirt instance void TestVirtualityApp.AClass::Do()
IL_0019: nop
IL_001a: ldloc.1
IL_001b: callvirt instance void TestVirtualityApp.AClass::Do()
IL_0020: nop
IL_0021: ldloc.2
IL_0022: callvirt instance void TestVirtualityApp.AClass::Do()
IL_0027: nop

As you can expect there are two interesting lines for our derived objects – line IL_001b and IL_0022. As you can see both of them use callvirt method call to execute, AClass::Do() method. That is what you would expect from the compiler – the type of the object is AClass so we need to call its method. But the reason we get such results is of course the IL instruction.

As you can read on the spec of the language, this instruction is used for late-bound method call on an object. This is where the whole “magic” lies – late-bound call found that BClass overrides the original virtual method Do() from AClass (because of the override keyword). This is called the polymorphism, where for the base class object the most specific type is called.

But why doesn’t the c object method Do() been called? Everything will be clear if you look at the IL_0022 line where the call is made. We call AClass method and there is no direct connection between Do() in CClass and Do() in AClass. Even though the name of the method is the same, we are calling the Do() method on the AClass and the runtime has no connection between this method and the method in CClass. This would be the case if there was the override keyword.

The new operator – not only for object creation

The designers of the language found this mechanism quite confusing for most developers and introduced the new keyword in the context of overriding methods. This one mustn’t be confused with the ctor calls (how the compiler achieves such differential is the other topic).

One will get the warning CS0114 message from the compiler when he will write a method with the same signature like the base class and there will be no override for the virtual method (this warning states that if the inherited member is hidden there will be a warning). This will make the code more readable – there will be no confusion whether this is another method or the overridden one.

EDIT: One interesting thing to be noticed in here – the different approach between Rebuild and Build. During the first one – there will be a warning because the compiler will do a full build of the project. But the second one will not find any change (if none was made) so no warning message! Quite confusing…

The easiest way to deal with such cases is just to play with the code – you can start with the one I posted and extend it by more classes and dependencies.. I hope this background makes it a good starting point for you to explore this very interesting topic.

Intelligent alarm clock – take your smartphone into bed…

Sleep graph made last night.
Sleep graph made last night.

There is always problem to wake up in the morning at 7am. Who does not have that problem with snoozing the alarm beep – I have. But fortunately I have also an iPod which has accelerometer – like almost every modern smartphone. I am using Sleep Cycle App – on the right you can see the sleep during last night.

There are plenty of applications nowadays for every possible OS on the market (I have also an app on my Nokia Lumia 800) – I suggest to give it a try. More importantly this little fellow gives you plenty of statistics – because before you go to bed you can select what did you do this particular day (worked out, drank alcohol etc.). Based on that and generated quality during each day – you have the whole picture.

But the most important fact is… IT WORKS! You select the hour you would like to be waken up, time window when it activates and.. magically it detects the right moment for you to get up!