Wednesday, November 11, 2009

I haven't forgotten this...

... it's just that qemu is really tightly coded, even though the code it makes is a bit less than optimal ;)

See, in non-kqemu/kvm mode the TCG generator is _->TCG->_, with several choices of _ on each side - qemu's regular mode can't take x86-on-x86 shortcuts.

So, time to stop distracting myself with fixing qemu's output and look @ what I wanted to do from the start... setting it up to use it for precise kernel/code profiling. ;)

(valgrind can actually do that, but only for user mode linux, and i dunno if it needs patches)

Sunday, November 8, 2009

Optimizing QEMU a lil'...

Looking at the assembly output, there're a lot of redudant load/stores. How do you fix this in the code generator without breaking things left and right? Make a non-mandatory cache and have non-cache-aware instructions write the cache out.

Stay tuned for a patch that'll have the core caching code... and a couple of cached instructions.

edit: dang this is complex. ;) I think having a two stage emit phase might help... then you could do things like:

0x00f65b54: xor %ebx,%ebx
0x00f65b56: add %ecx,%ebx
to: mov %ecx, %ebx

0x00f65c98: mov %eax,%edx
0x00f65c9a: mov %edx,%ecx
0x00f65c9c: mov $0x44,%ebx
0x00f65ca1: mov %eax,0x2c(%ebp)
0x00f65ca4: mov %edx,0x0(%ebp)
mov %eax, %ecx
mov $0x44, %ebx
mov %eax, 0x2c(%ebp)
mov %eax, 0x0(%ebp)

and finally:
0x00f66372: mov 0x0(%ebp),%eax
0x00f66375: mov 0x0(%ebp),%edx
0x00f66378: and %edx,%eax
0x00f66372: mov 0x0(%ebp),%eax


More stuff to look at...

(largely a note to self) - points out that depmod+modprobe does a lot more than it needs to. It shouldn't be too hard to make a binary map to easily jump from a PCI ID to a module.

I need to look at oprofile and other kernel profiling stuff.

Also, research using qemu(+kvm) - at the very least it's a nice sandbox. In addition you can do a lot of traces pretty easily in an emulator...

Saturday, November 7, 2009

General tips pt. 1 (of many?)

- Use 64-bit distributions whenever you can. They're tuned to produce code for modern CPU's (all floating point goes through SSE2 for instance) and there are various 64-bit only assembly optimizations. With the new glibc release, Core i* users will gain SSE4.2-accelerated strcpy/strcmp/etc routines that can be up to 10-12x faster, according to H.J. Lu... but only on x86-64.

- If you have a recent Radeon card and don't need 3D (yet), look at using the free radeon(hd) drivers which have faster 2D and tear-free video playback.

- Don't run more than your RAM can handle. If you have <=512MB of RAM, running OpenOffice and firefox together is generally a bad idea.

- Keep an eye on firefox when running Flash - many flash pages run in the background and suck up all your CPU. In general, firefox tabs can also nom memory. It's probably best to restart firefox occasionally, especially if you have <=1GB of RAM.

- Consider using suspend (or maybe hibernate) when you're done working instead of shutting down.

Friday, November 6, 2009

The first almost trivially tiny lil' thing...

(or: Chad spends way too much time researching trivial stuff.)

The locales code in [e]glibc isn't caching directory names correctly, so you have something like this:

open("/usr/lib/locale/en_US.UTF-8/LC_TELEPHONE", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/lib/locale/en_US.utf8/LC_TELEPHONE", O_RDONLY) = 3

about 12 times for each setlocales() call - which happens to be done by most/all processes.

I wrote this mini-program to see how many microseconds a bad open() takes:


int main()
int i;

for (i = 0; i < 1000000; i++) close(open("/usr/lib/locale/en_US.UTF-8/LC_IDENTIFICATION", O_RDONLY));
return 0;

On a Core Duo locked to 1.0ghz, it takes about 3.75usec per call. This is miniscule, but multiply it by 12, and then the # of times setlocale() is called, and eventually you get to about a second.

So, instead of actually fixing the locale code (and dealing with getting it upstream), a very simple fix for ubuntu 9.10 et al is:

sudo ln -sf /usr/lib/locale/en_US.utf8 /usr/lib/locale/en_US.UTF-8
^ your actual locale goes here.

This should cache nicely.

Hello world!

The goal of this blog and my "November Project" (instead of, say, a nano) is to find ways to streamline Linux and write about them here.

Some things I'm planning to do this month:

- Development of an OpenWRT image for fully isolated benchmarking.

- Documenting use of oprofile, ltrace, strace, and CPU performance counters to find areas of improvement.

- System call usage optimization (repeated attempts to open() files in non-existent directories, etc)

- Memory allocation optimizations (reduce #'s of mallocs and frees needed)

- Generally finding new bottlenecks and removing them.

Hopefully this'll be good for everybody. :)