Saturday, October 27, 2012

Understanding computer performance and architectures

Hi all.
This article is co-authored with David Turner David watched my youtube video showing of a workstation running Linux and multitasking beyond what is expected for that hardware.
David started communicating with me as he has the same hardware base I have and uses windows, so we was both curious, confused... and I think that part of his brain was telling him "fake... it's got to be another fake youtube crap movie".
So I channelled him to this blog and the latest post at the time about the Commodore Amiga and its superiority by design. Dave replied with a lot of confusion as most of the knowledge in it was too technical. We then decided that I would write this article and he would criticize-me whenever I got too technical and difficult to understand, forcing-me to write more "human" and less "techye".
Se he is co-author as he is criticising the article into human readable knowledge.  This article will be split into lessons and so this will change with time into a series of articles.

Note, this article will change in time as David forces-me to better explain things. Don't just read-it once and give-up if you don't understand, comment, and register to be warned about the updates.

Starting things up...(update 1)

Lesson 1 : The hardware architecture and the kernels.

Hardware architecture, is always the foundation of things. You may have the best software on earth, but if it runs on a bad hardware...instead of running, it will crawl.
Today's computers are a strange thing to buy. There is increasingly less support for NON-intel architectures, which is plain stupid, because variety will generate competition instead of monopoly, competition will generate progress and improvement. Still most computers today are Intel architecture.

Inside the Intel architecture world, there is another heavy weight that seems to work in bursts. That would be AMD.
AMD started as an Intel clone, and then decided to develop technology further. They were the first to introduce 64bit instructions and hardware with the renown Athlon64. At that time, instead of copying Intel, AMD decided to follow their own path and created something better than Intel. Years latter, they don-it again with the multi-core CPU. As expected, Intel followed and got back on the horse, so now we have to see AMD build more low budget clones of Intel until they decide to get back on the drawing board and innovate.

So what is the main difference between the 2 contenders on the Intel Architecture world?
Back on the first Athlon days, Intel focus development on the CPU chip as pure speed by means of frequency increase. The result is that (physics 101) the more current you have passing on a circuit with less purity of copper/gold/silicon, the more atoms of resisting material will be there to oppose current and generate heat. So Intel developed ways to use less and less material (creating less resistance, requiring less power and generating less heat) that's why Intel CPU have a dye size smaller than most competitors 65nm, 45nm, 37nm and so on. For that reason, they can run at higher speeds and that made Intel development focus not on optimizing the way the chip works, but rather the way they build the chips.
AMD on the other hand doesn't have the same size as Intel, and doesn't sell as much CPUs, so optimizing chip fabrication would have a cost difficult to return. The only way was to improve chip design. That's why Athlon chip would be faster at 2ghz than an Intel at 2.6 or was better in design and execution of instructions.
Since the market really don't know what they buy and just look at specs, AMD was forced to change their product branding to the xx00+... 3200+ meaning that the 2.5gh chip inside, would be compared to (at least) a pentium 3.2ghz in performance. That same branding evolved to the dual core. Since Intel publicized their Hyper-threading CPU (copying the AMD efficiency leap design, but adding a new face to it called the virtual CPU) AMD decided to evolve into the dual core CPU (Intel patented the HyperThreading and thow using the AMD design as inspiration, they managed to lock them out of the marketing to use their own designs.... somehow I feel that Intel has really a lot to do with today's Apple!)... and continued calling it the 5000+ for the 2 core 2500+ 2gh per core CPU.
So to this point in time the AMD and Intel could compete in speed of CPU, the AMD athlon64 5000+ dual core @ 2gh per core would be as fast as an Intel Core2Duo dual core @2.5Ghz!? Not quite. Speed is not always about the GHz as AMD already proved with the Athlon superior design.
At some point in time, your CPU needs to Input/output to memory, and this means the REAL BIG difference in architecture between AMD and Intel.
Intel addresses memory through the chip-set (with the exception of the latest COREix families). Most chip-sets are designed for the consumer market, so they were designed for a single CPU architecture. AMD, again needing to maximize production and adaptability designed their Athlon with an built in memory controller. So the Athlon has a direct path (full bandwidth, high priority and very very fast) to memory, while Intel has to ask the chip-set for permission and channel memory linkage through it. This design removes the chip-set memory bandwidth bottleneck and allows for better scalability.
The result? look at most AMD Athlon, Opteron or Phenom multi-CPU boards and find one memory bank per CPU, while Intel (again) tried to boost the speed of the chip-set and hit a brick-wall immediately. That's why Intel motherboards for servers rarely go over the 2 CPU architecture, while AMD has over 8CPU motherboards. Intel and it's race for GHz rendered it less efficient and a lot less scalable.
If you always stopped to think how intel managed a big performance increase out of the CORE technology (that big leap that CORi3, i5 and i7 have when compared to the design it's based on - the Core2Duo and Core2Quad), then the answer is simple... they already had Ghz performance, when they added a DDR memory controller to the CPU, they jumped into AMD performance territory! Simple, and effective...with much higher CPU clock. AMD had sleep for too long, and now intel rules the entire market in exception for the super computing world.

The Video and the AMD running Linux.
This small difference in architectures play an important role in the Video I've shown with the Linux being able to multitask like hell. The ability to channel data to and from memory directly means the CPU can be processing a lot of data in parallel and without asking(and waiting for the opportunity) the chip-set to move data constantly.

So the first part of this first "lesson" is done.
Yes, today's Intel Core i5 and i7 is far more efficient than AMD equivalence, but still not as scalable, meaning that in big computing, AMD is the only way to go in the x86 compatible world. AMD did try that next leap with the APU recently, but devoted too much time on the development of the hardware and forgot about the software to run-it properly. And I'll leave this to the second part of this "lesson". They also choose ATI as it's partner for GPUs... Not quite the big banger. NVIDEA would be the ones to choose. Raw power of processing power is NVIDEAs ground, while ATI is more focused on the purity of colour and contrast. So when AMD tried to fuse the CPU and the GPU (creating the APU), they could have created a fully integrated HUGE processing engine... but instead they just managed to create a processing chip-set. Lack of vision? Lack of money? Bad choice in the partnership (as NVIDEA is the master of GPU super computing)? I don't know yet... but I screamed "way to go AMD" when I heard about the concept... only to shout "stupid stupid stuuupid people" some months later when it came out.

The software architecture to run on the hardware architecture.
Operating systems are composed of 2 major parts. The presentation layer (normaly called GUI, or Graphical User Interface) which is the one communicating between the user (and the programs) to the Kernel layer. And obviously the kernel layer that will interface between the presentation layer and the hardware. and pictures and icons apart, the most important part of a computer next to the hardware architecture, is the kernel architecture.
There are 4 types of kernels:
   - MicroKernel - This is coded in a very direct, and simple way. It is built with performance in mind. Microkernels are normally included into routers, or printers, or simple peripherals that have specific usage and don't need to "try to adapt to the user". They are not complex and so eat very little CPU cycles to work, meaning speed and efficiency. They are however very inflexible.
   - Monolithic Kernels - Monolithic Kernels are BIG and heavy. They try to include EVERYTHING in it. So it's a kernel very easy to program with, as most features are built in and support just about any usage you can thing of. The down side is that it just eats up lot's of CPU cycles while verifying and comparing things because it tries to consider just about every possible usage. Monolithic kernels are very flexible at the cost of a lot of memory usage and heavy execution.
   - Hybrid Kernels - The hybrid-kernel type is a mix. You have a core kernel module that is bigger than the rest, and while loading, that module controls what other modules are loaded to support function. These models are not as heavy as the monolithic, as they only load what they need to work with, but they have to contain a lot of memory protection code to avoid one module to use other modules memory space. So they are not as heavy as the Monolithic, but not necessarily faster.
   - Atypical kernels - Atypical kernels are all those kernels out there that don't fit into these categories, mainly because they are too crazy, too good or just too exquisite to be sold in numbers big enough to create their own class. Examples of these are brilliant Amiga kernels and all the wannabes sprung by it (BEOS, AROS, etc), Mainframe operating system kernels and so on.#REFERENCE nr1 (check the end of the article)#

For the record, I personally consider the Linux to be an atypical kernel. A lot of people think the Linux is Monolithic and would be part. Some others would consider it to be Hybrid and be part.
The linux kernel is a full monolithic code block as a monolithic kernel, however, that kernel is hardware match compiled. When you install your copy of Linux, the system probes the hardware you have and then chooses the best code base to use for it. For instance why would you need the kernel base to have code made for the 386 CPU, or the Pentium mmx if you have a Core2Duo, or an AMD Opteron64? The Linux kernel is matched to your CPU and the code is optimized for it. When you install software that needs a direct hardware access (drivers, virtualization tools, etc) you need the source code for your kernel installed and a c++ compiler for one simple reason ->The kernel modules installed to support those calls to hardware are built into your new kernel and it is recompiled for you. So you have a Hybrid-made-monolithic kernel design. Not as brilliant as the Amiga OS kernel, but considering that the Amiga O.S. kernel needs the brilliant Amiga hardware architecture, the Linux kernel is the best thing around for the Intel compatible architecture.
Do I mean that Linux is better for AMD than Intel? Irrelevant! AMD is better than Intel if you need heavy memory usage. Intel is better than AMD if you need raw CPU power for rendering. Linux kernel is better than windows comparing to today's windows, Linux is the better choice, regardless of architecture. However AMD users have more to "unleash" while converting to Linux, as windows is more Intel biased on purpose, and less memory efficient.

Resources are limited!
Why is Linux so much more efficient than windows with the same hardware?
Windows kernel is either monolithic (w2k, nt, win 9x) or hybrid (w2k3, xp, vista/7, w2k8, 8). However the base of a hybrid kernel is always the cpu instructions and commands and that is always a big chunk.
Since Microsoft made a crusade against the open-source, they have to keep with their "propaganda" and have a pre-compiled (and closed) CPU kernel module (and this is 50% of why I don't like Windows...they are being stubborn instead of efficient). So while much better that w2k, xp and 7 will still have to load-first a huge chunk of code that has to handle everything from the 386 to the future generations i7 cores and beyond. Meaning that they always operate in a compromised operation mode and will always have code in memory being unused. Microsoft also has a very closed relationship with Intel and tends do favor it against AMD, making any windows run better in Intel than AMD...this is very clear when you dig around AMD FTP and find several drivers to increase windows speed and stability on AMD CPUs...and find nothing like that on Intel. For some reason people call the PC a wintel machine.
So To start, Linux has a smaller memory footprint than windows, it has more CPU instruction-set usage than windows "compatibility mode", it takes advantage of AMDs excellent memory to CPU bus.
Apart from that there is also the way windows manages memory. Windows (up until the vista/7 kernel) was not very good managing memory. When you use software, the system is instancing objects of code and data in memory. Windows addresses memory in chunks made of 4kb pages. So if you have 8kb of code, it will look for a chunk with 2 memory pages of 4kb free and then use-it.... if however your code is made from 2 objects, one with 2kb and another with 10kb, windows will allocate a chunk with one page for the first one, and then a chunk of 3 pages to the second code. You'll consume 4+12kb = 16Kb for 12kb of code. This is causing the so called memory fragmentation. If your computer only had 16Kb of memory, in this last case you would not be able to allocate memory for the next 4kb code. Although you have 4Kb of free memory, it is fragmented into 2 and since it's non continuous, you would not have space to allocate the next 4kb.
The memory fragmentation syndrome grows exponentially if you use a framework to build your code on. Enter the .NET. .Net is very good for code prototyping, but as it's easy to code for, it is so because the guys building it created objects with a lot of functionality built into it (to support any possible usage)... much like the classinc monolithic kernel. The result is that if you examine memory, you'll find out that a simple window with a combo box and an ok button will mean hundreds if not thousands of objects instanced in memory...for nothing as you'll only be using 10% of the coded object's functionality.
Object Oriented programming creates Code objects in memory. A single "class" is Instanced several times to support different usage of the same object types but as different objects. After usage, memory is freed and returned to the operating system for re-usage.
Now picture that your code creates PDF pages. The PDF stamper works with pages that are stamped individually and then glued together in sequence. So your code would be instancing, then freeing to re-instance a bigger object, to free after and re-instance a bigger one...and so on.
For instance:
    Memory in pages:    
|-1K 1K 1K 1K-|-1K 1K 1K 1K-|-1K 1K 1K 1K-|-1K 1K 1K 1K-|-1K 1K 1K 1K-|-1K 1K 1K 1K-|-1K 1K 1K 1K-|-1K 1K 1K 1K-|-1K 1K 1K 1K-|
    Your code:
    code instance 1 6K   

|-C1 C1 C1 C1-|-C1 C1          -|- 
    Then you add another object to support your data (increasing as you process it) called C2
    code instance 2 10K 

|-C1 C1 C1 C1-|-C1 C1          -|-C2 C2 C2 C2-|-C2 C2 C2 C2-|-C2 C2          -|-
    Then you free your first instance as you no longer need it.           

|-                   -|-                   -|-C2 C2 C2 C2-|-C2 C2 C2 C2-|-C2 C2          -|-
    And then you need to create a new code instance to support even mode data called C3. This time you need 18Kb, so:
    code instance 3 18K 

|-                   -|-                   -|-C2 C2 C2 C2-|-C2 C2 C2 C2-|-C2 C2          -|-C3 C3 C3 C3-|-C3 C3 C3 C3-|-C3 C3 C3 C3-|-C3 C3 C3 C3-|...... and you've run out of memory!!
I know that today's computers have gigs of ram, but today's code also eat up megs of ram and we work video and sound and we use .Net to use it.... you get the picture.

Linux and Unix have a dynamic way to address memory and normally re-arrange memory (memory optimization and de-fragmentation) to avoid this syndrome.
In the Unix/Linux world you have brk, nmap and malloc:
   - BRK - can adjust the chunk size end to match the requested memory so a 6k code would eat 6k instead of 8k
   - malloc - can grow memory both ways (start and end) and re-allocate more memory as your code grows (something wonderful for object oriented programming because the code starts with little data, and then grow as the program and user starts working it). In windows this will either be handled with a huge chunk pre-allocation (even if you don't use-it), or by jumping your code instance from place to place in memory (increasing fragmentation probability). The only problem with malloc is that it is very good allocating memory and not so good releasing it. So nmap was entered into the equation.
   - nmap -  works like malloc but it's useful for large memory chunk allocation and it's also very good releasing it back. When you encode video or work out large objects in memory, nmap is the "wizard" behind all that Linux performance over windows. The more data you move in and out of memory, the more perceptible this is.

There is also something important to this. If you thing about this, who does the memory moving in an Intel architecture? The CPU... so even using windows, moving stuff around memory constantly, the AMD has better performance because of the in CPU memory controller while the Intel platform needs to channel everything through the chip-set.
The CPU architecture (both Intel and AMD) have, under normal conditions a "stack" of commands, and not all of them are using the entire CPU processing power, so Intel uses the "virtual processor" in hyper-threading, making 2 different code threads to be calculated at once, while AMD works it's architecture with simultaneous execution (everything from cache to CPU registers is parallel) and doubling the bus speed (100mhz bus, would work as 200mhz bus inside the CPU, allowing the system to divide or share CPU resources and communication from outside would happen at half speed of the processing speed. So if you enter 2x 32bit instructions (on a 64bit Athlon for instance), in theory, if those instructions are actually 32bit only and use the same amount of CPU cycles to be worked out, the CPU would return the result at once. Without this technology, the CPU would accept one instruction at a time and reply accordingly.
Does the Intel CPU return better MIPS on CPU tests? yup. Most of the CPU testing software's induce big calculation instructions and eat up all of the CPU execution stack, so, no parallelization is possible (that part of AMD execution optimization...and Intel Hyper threading), and since the Intel CPU runs a higher clock speed (all those GHz), the results favour them. Still in real life, unless you are rendering 3D, AMD has the ground in true usable speed. Especially if under a good operating system that takes advantage of this and doesn't cripple RAM as it uses it.
It's simple if you think about it.
Both AMD Athlon 64 running at 2GHz and Intel Core2 at 2.5GHz have 64bit architecture. If they both get 2x 32bit instructions, Core2 will show the real CPU for its first 32bit instruction, and the hyper-threading second virtual cpu for the second instruction... and then would do this at 2.5GHz.
At the same time the AMD would receive 2 instructions at once to the one and only CPU, side by side, but then would process each instruction to the CPU internally at double the speed. So the 0.5Ghz the AMD has less, is compensated by the fact that internally, it works writes and reads instructions, results and data twice as fast. If, however you send a full 64bit calculation, neither of CPU's will be able to parallel the execution stack... so the advantage of the double-data-rate inside the Athlon is gone and the only thing in play from that point on is Ghz....and the Intel has more!

So, to conclude this first "lesson":

Linux on a good hardware architecture will multi-task way better than windows because:
   - AMD had a direct memory controlled in CPU and a direct memory connection as a result.
   - It can take direct advantage of AMD memory bandwidth and CPU functions because the kernel is CPU and hardware matched
   - The kernel is lighter because it is hardware matched.
   - The kernel doesn't need a lot of memory protection because it's "monolithic" in part.
   - Most code for Linux is done in c++ so it has no .net weight behind it (nor the operating system)
   - Linux handles memory. Windows juggles things until it "starts to drop"...or crash :S.

The P.S. part :)

Comment: You like Amiga a lot. Are you implying one can still buy one?
Reply: Yup and No. Yes you can still use an amiga today. Yes you still have hardware updates and software updates today that keep the Amiga alive.
No, not the commodore USA as it's just another wintel computer named as amiga... a grotesque thing for a purist like me.
Keep in mind that the Amiga was so advanced that, if you are looking too buy a computer 10 years into the future, than you have no Amiga to buy. The NATAMI project is the best so far, but from what I've read, it's just an up-to-date of the old Amiga... good and faithful, but not the BANG the Amiga was and has been until commodore gone under. The new Amiga can't just be an update, cause the old one with today's hardware mods can do so! The new Amiga has to show today what wintels will do 10 years from now.
Maybe I can gather enough money to build it myself...I've got the basic schematics and hardware layout and I call this Project TARA (The Amiga Reborn Accurately).

No comments:

Post a Comment