# Redi/S - Performance

Questions on any of that?
Either Twitter [@helje5](https://twitter.com/helje5),
or join the `#swift-nio` channel on the
[swift-server Slack](https://t.co/W1vfsb9JAB).


## Todos

There are still a few things which could be easily optimized a lot regardless of
bigger architectural changes:

- integer backed store for strings (INCR/DECR)
- do proper in-place modifications for sets


## Copy on Write

The current implementation is based around Swift's value types.
The idea is/was to make heavy use of the Copy-on-Write features and thereby
unblock the database thread as quickly as possible.

For example if we deliver a result, we only grab the result in the locked DB context,
all the rendering and socket delivery is happening in a NIO eventloop thread.

The same goes for persistence. We can grab the current value of the database
dictionary and persist that, w/o any extra locking
(though C Redis is much more efficient w/ the fork approach ...)

> There is another flaw here. The "copy" will happen in the database scope,
> which obviously is sub-optimal. (Redis CoW by forking the process is much
> more performant ...)


## Data Structures

Redi/S is using just regular Swift datastructures
(and is therefore also a test of the scalability of those).

Most importantly this currently uses Array's for lists! 🤦‍♀️
Means:
RPUSH is reasonably fast, but occasionally requires a realloc/copy.
LPUSH is very slow.

Plan: To make LPUSH faster we could use the NIO.CircularBuffer.
[If we get some more methods](https://github.com/apple/swift-nio/issues/279)
on it.

The real fix is to use proper lists etc.
But if we approach this, we also need to reconsider CoW.


## Concurrency

How many eventloop threads are the sweet spot?

- Is it 1, avoiding all synchronization overhead?
- Is it `System.coreCount`, putting all CPUs to work?
- Is it `System.coreCount / 2`, excluding hyper-threads?

We benchmarked the server on 
a 13" MBP - 2 Cores, 4 hyperthreads,
and on
a MacPro 2013 - 4 Cores, 8 hyperthreads.

Surprisingly *2* seems to be the sweet spot.
Not quite sure yet why.
Is that when the worker thread is saturated? It doesn't seems so.

Running the MT-aware version on a single eventloop thread halves the 
performance.

Notably running a SingleThread optimized version still reached ~75% of the
dual thread variant (but at a lower CPU load).


## Tested Optimizations

Trying to improve performance, we've tested a few setups we thought might
do the trick.

### Command Name as Data

This version uses a Swift `String` to represent command names.
That appears to be wasteful (because a Swift string is an expensive Unicode
String),
but actually seems to have no measurable performance impact.

We tested a branch in which the command-name is wrapped in a plain `Data`
and used that as a key.

Potential follow up:
Command lookup seems to play no significant role,
but one thing we might try is to wrap the ByteBuffer in a small struct
w/ an efficient and targetted, case-insensitive hash.

### Avoid NIO Pipeline for non-BB

The "idea" in NIO is that you form a pipeline of handlers.
At the base of that pipeline is the socket, which pushes and receives
`ByteBuffer`s from that pipeline.
The handlers can then perform a set of transformations.
And one thing they can do, is parse the `ByteBuffer`s into higher level 
objects.

This is what we did originally (0.5.0) release:

```
Socket 
  =(BB)=>
    NIORedis.RESPChannelHandler 
      =(RESPValue)=>
        RedisServer.RedisCommandHandler
      <=(RESPValue)
    NIORedis.RESPChannelHandler 
  <=(BB)=
Socket
```

When values travel the NIO pipeline, they are boxed in `NIOAny` objects.
Crazy enough just this boxing has a very high overhead for non-ByteBuffer
objects, i.e. putting `RESPValue`s in and out of `NIOAny` while passing
them from the parser to the command handler, takes about *9%* of the runtime
(at least in a sample below ...).

To workaround that, `RedisCommandHandler` is now a *subclass*
of `RESPChannelHandler`.
This way we never wrap non-ByteBuffer objects in `NIOAny` and the pipeline
looks like this:

```
Socket 
  =(BB)=>
    RedisServer.RedisCommandHandler : NIORedis.RESPChannelHandler 
  <=(BB)=
Socket
```

We do not have a completely idle system for more exact performance testing,
but this seems to lead to a 3-10% speedup (measurements vary quite a bit).

Follow-up:
- get `MemoryLayout<RESPValue>.size` down to max 24, and we can avoid a malloc
  - but `ByteBuffer` (and `Data`) are already 24
- made `RESPError` class backed in swift-nio-redis. Reduces size of 
  `RESPValue` from 49 to 25 bytes (still 1 byte too much)
  - @weissi suggest backing `RESPValue` w/ a class storage as well,
    we might try that. Though it takes away yet another Swift feature (enums)
    for the sake of performance.


### Worker Sync Variants

#### GCD DispatchQueue for synchronization

Originally the project used a `DispatchQueue` to synchronize access to the
in-memory databases.

The overhead of this is pretty high, so we switched to a RWLock for a ~10% speedup.
But you don't lock a NIO thread you say?!
Well, this is all very fast in-memory database access which in *this specific case*
is actually faster than the capturing a dispatch block and submitting that to a queue
(which also involves a lock ...)

#### NIO.EventLoop instead of GCD

We wondered whether a `NIO.EventLoop` might be faster then a `DispatchQueue`
as the single threaded synchronization point for the worker thread
(`loop.execute` replacing `queue.async`).

There is no measurable difference. GCD is a tinsy bit faster.

#### Single Threaded

Also tested a version with no threading at all (Node.js/Noze.io style).
That is, not just lowering the thread-count to 1, but taking out all `.async`
and `.execute` calls.

This is surprisingly fast, the synchronization overhead of `EventLoop.execute`
and `DispatchQueue.async` is very high.

Running a single-thread optimized version still reached ~75% of the
dual thread variant (but at a lower CPU load).

Follow up:
If we would take out CoW data structures, which wouldn't be necessary anymore
in the single-threaded setup, it sounds quite likely that this might go faster
than the threaded variant.


## Instruments

I've been running Instruments on Redi/S. With SwiftNIO 1.3.1.
Below annotated callstacks.

Notes:
- just `NIOAny` boxing (passing RESPValues in the NIO pipeline) has an overhead 
  of *9%*!
  - this probably implies that just directly embedding NIORedis into
    RedisServer would lead to that speedup.
- from `flush` to `Posix.write` takes NIO another 10%

### Single Threaded

This is the single threaded version, to remove synchronization overhead
from the picture.

```
redis-benchmark -p 1337 -t get -n 1000000 -q
```

- Selector.whenReady: 98.4%
    - KQueue.kevent 2.1%
    - handleEvent 95.4%
        - readFromSocket 89.8%
            - Posix.read 8.7%
            - RedisChannel.read() 77.2%
                - decodedValue(_:in:) 71.2%
                    - 1.3% alloc/dealloc
                    - decodedValue(:in:) 68.8%
                        - wrapInbountOut: 1.8%
                        - RedisCommandHandler: 66.2% (parsing ~11%)
                            - unwrapInboundIn: 1.7%
                            - parseCommandCall: 4.7%
                                - Dealloc 1.3%
                                - stringValue 1.3% (getString)
                                - Uppercased 0.7%
                            - callCommand: 55.3%
                                - Alloc/dealloc 2%
                                - withKeyValue 51.6%
                                    - release_Dealloc - 1.6%
                                    - Data init, still using alloc! 0.2%
                                    - Commands.GET 48.4%
                                        - ctx.write(46.8%)
                                            - writeAndFlush 45%
                                                - RedisChannelHandler.write 8%
                                                    - Specialised RedisChannelHandler.write 6.7%
                                                        - unwrapOutboundIn 2.6%
                                                        - wrapOutboundOut 0.6%
                                                        - ctx.write 2.8%
                                                            - Unwrap 2.5%
                                                - Flush 36.2%
                                                    - pendingWritesManager 32.7%
                                                        - Posix.write 26.3%
                                            - NIOAny 1.2%
                                                - Allocated-boxed-opaque-existential

### Multi Threaded w/ GCD Worker Thread

- Instruments crashed once, so numbers are not 100% exact, but very close

```
redis-benchmark -p 1337 -t set -n something -q
```

- GCD: worker queue 17.3%
    - GCD overhead till callout: 3%
    - worker closure: 14.3%
    - SET: 13.8%, 12.8% in closure
        - ~2% own code
        - 11% in:
            - 5% nativeUpdateValue(_:forKey:)
            - 1.3% nativeRemoveObject(forKey:)
            - 4.7% SelectableEventLoop.execute (malloc + locks!)
    - Summary: raw database ops: 5.3%, write-sync 4.7%, GCD sync 3%+, own ~2%
- EventLoops: 82.3%, .run 81.4%
    - PriorityQueue:4.8%
    - alloc/free 2.1%
    - invoke
        - READ PATH - 37.9%
            - selector.whenReady 36.1%
                - KQueue.kevent(6.9%)
                - handleEvent (28.7%)
                    - readComplete 2.1%
                        - flush 1.4%              **** removed flush in cmdhandler
                    - readFromSocket(25%)
                        - socket.read 5.3%
                            - Posix.read 4.9%
                        - alloc 0.7%
                        - invokeChannelRead 18.2%
                            - RedisChannel.read 17.6% (Overhead: Parser=>Cmd: 5.2%) **
                                - 0.4% alloc, 0.3% unwrap
                                - BB.withUnsafeRB 16.6% (Parser)
                                    - decoded(value:in) 14.9%
                                        - dealloc 0.5%, ContArray.reserveCap 0.2%
                                        - decoded(value:in:) 13.5% (recursive top-level array!)
                                            - wrapInboundOut 0.7%
                                            - fireChannelRead 12.6%
                                                - RedisCmdHandler 12.4% **
                                                    - unwrapInboundIn 1.1%
                                                    - parseCmdCall 2.1%
                                                        - RESPValue.stringValue 0.6%
                                                        - dealloc 0.6%
                                                        - upper 0.4%
                                                        - hash 0.1%
                                                    - callCommand 6.7%
                                                        - RESPValue.keyValue 1.4%
                                                            - BB.readData(length:) DOES AN alloc?
                                                                - the release closure!
                                                        - Commands.SET 4.8%
                                                            - ContArray.init 0.2%
                                                            - runInDB 3.3% (pure sync overhead)
        - WRITE PATH - 31.1% (dispatch back from DB thread)
            - Commands.set 30.4%
                - cmdctx.write 30% (29.6% specialized)  - 1.2% own rendering overhead
                    - writeAndFlush 28.5%
                        - flush 18.7%
                            - socket flush 17.9%
                                - Posix.write 14%
                        - write 9.6%
                            - RedisChannelHandler.write 9.6%
                                - specialised 8.7% ???
                                - ByteBuffer.write - 3%
                                - unwrapOutboundIn - 1.4%
                                - ctx.write 1.2% (bubble up)
                                - integer write 1% (buffer.write(integer:endianess:as:) ****
                    - NIOAny 0.8%
        - 1.5% dealloc