piątek, 6 stycznia 2017

Getting all parent directories of a path

edit: reddit updates

Few minutes ago I needed to solve trivial problem of getting all parent directories of a path. It's very easy to do it imperatively, but it would simply not satisfy me. Hence, I challenged myself to do it declaratively in Python.

The problem is simple, but let me put an example on the table, so it's even easier to imagine what are we talking about.

Given some path, e.g.

/home/szborows/code/the-best-project-in-the-world

You want to have following list of parents:

/home/szborows/code
/home/szborows
/home
/

It's trivial to do this using split and then some for loop. How to make it more declarative?
Thinking more mathematically (mathematicians will perhaps cry out to heaven for vengeance after reading on, but let me at least try...) we simply want to get all of the subsets from some ordered set S that form prefix w.r.t. S. So we can simply generate pairs of numbers (1, y), representing all prefixes where y belongs to [1, len S). We can actually ignore this constant 1 and just operate on numbers.
In Python, to generate numbers starting from len(path) and going down we can simply utilize range() and [::-1] (this reverses collections, it's an idiom). Then join() can be used on splited path, but with slicing from 1 to y. That's it. And now demonstration:

>>> path = '/a/b/c/d'
>>> ['/' + '/'.join(path.split('/')[1:l]) for l in range(len(path.split('/')))[::-1] if l]
['/a/b/c', '/a/b', '/a', '/']

But what about performance? Which one will be faster - imperative or declarative approach? Intuition suggests that imperative version will win, but let's check.

On picture below you can see timeit (n=1000000) results for my machine (i5-6200U, Python 3.5.2+) for three paths:

short_path = '/lel'
regular_path = '/jakie/jest/srednie/zagniezdzenie?'
long_path = '/z/lekka/dlugasna/sciezka/co/by/pierdzielnik/mial/troche/roboty'

Implementations used:

def imper1(path):
    result = []
    for i in range(1, len(path.split('/'))):
        y = '/'.join(path.split('/')[:i]) or '/'
        result.append(y)
    return result

def imper2(path):
    i = len(path) - 1
    l = []
    while i > 0:
        while i != 0 and path[i] != '/':
            i -= 1
        l.append(path[:i] or '/')
        i -= 1
    return l

def decl1(path):
    return ['/' + '/'.join(path.split('/')[1:l])
            for l in range(len(path.split('/')))[::-1] if l]

def decl2(path):
    return ['/' + '/'.join(path.split('/')[1:-l])
            for l in range(-len(path.split('/'))+1, 1) if l] 
 
# decl3 hidden. read on ;-)


It started with imper1 and decl1. I noticed that imperative version is faster. I tried to speed up declarative function by replacing [::-1] with some numbers tricks. It helped, but not to the extend I anticipated. Then, I though about speeding up imper1 by using lower-level constructs. Unsurprisingly while loops and checks were faster. Let me temporarily ignore decl3 for now and play a little with CPython bytecode.

By looking at my results not everything is so obvious. decl{1,2} turned out to have decent performance with 4-part path, which looks like reasonable average.

I disassembled decl1 and decl2 to see the difference in byte code. The diff is shown below.

30 CALL_FUNCTION    1 (1 positional, 0 keyword pair) | 30 CALL_FUNCTION    1 (1 positional, 0 keyword pair)
33 CALL_FUNCTION    1 (1 positional, 0 keyword pair) | 33 CALL_FUNCTION    1 (1 positional, 0 keyword pair)
36 CALL_FUNCTION    1 (1 positional, 0 keyword pair) | 36 UNARY_NEGATIVE
39 LOAD_CONST       0 (None)                         | 37 LOAD_CONST       4 (1)  
42 LOAD_CONST       0 (None)                         | 40 BINARY_ADD
45 LOAD_CONST       5 (-1)                           | 41 LOAD_CONST       4 (1)  
48 BUILD_SLICE      3                                | 44 CALL_FUNCTION    2 (2 positional, 0 keyword pair)
51 BINARY_SUBSCR



As we can see [::-1] is implemented as three loads and build slice operations. I think this could be optimized if we had special opcode like e.g. BUILD_REV_SLICE. My little-optimized decl2 is faster because one UNARY_NEGATIVE and one BINARY_ADD is less than LOAD_CONST, BUILD_SLICE and BINARY_SUBSCR. Performance gain here is pretty obvious. No matter what decl2 must be faster.

What about decl2 vs imper1?
It's more complicated and it was a little surprise that such a longer bytecode can be slower than shorter counterpart.

  3           0 BUILD_LIST               0        
              3 STORE_FAST               1 (result)
                                             
  4           6 SETUP_LOOP              91 (to 100)                     
              9 LOAD_GLOBAL              0 (range)
             12 LOAD_CONST               1 (1)
             15 LOAD_GLOBAL              1 (len)
             18 LOAD_FAST                0 (path)
             21 LOAD_ATTR                2 (split)
             24 LOAD_CONST               2 ('/')
             27 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             30 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             33 CALL_FUNCTION            2 (2 positional, 0 keyword pair)
             36 GET_ITER                        
        >>   37 FOR_ITER                59 (to 99)
             40 STORE_FAST               2 (i)

  5          43 LOAD_CONST               2 ('/')
             46 LOAD_ATTR                3 (join)
             49 LOAD_FAST                0 (path)
             52 LOAD_ATTR                2 (split)
             55 LOAD_CONST               2 ('/')
             58 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             61 LOAD_CONST               0 (None)
             64 LOAD_FAST                2 (i)
             67 BUILD_SLICE              2
             70 BINARY_SUBSCR
             71 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             74 JUMP_IF_TRUE_OR_POP     80
             77 LOAD_CONST               2 ('/')
        >>   80 STORE_FAST               3 (y)

  6          83 LOAD_FAST                1 (result)
             86 LOAD_ATTR                4 (append)
             89 LOAD_FAST                3 (y)
             92 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             95 POP_TOP
             96 JUMP_ABSOLUTE           37
        >>   99 POP_BLOCK

  7     >>  100 LOAD_FAST                1 (result)
            103 RETURN_VALUE


The culprit was LOAD_CONST in decl{1,2} that was loading list-comprehension as a code object. Let's see how it looks, just for the record.

>>> dis.dis(decl2.__code__.co_consts[1])
 21           0 BUILD_LIST               0
              3 LOAD_FAST                0 (.0)
        >>    6 FOR_ITER                51 (to 60)
              9 STORE_FAST               1 (l)
             12 LOAD_FAST                1 (l)
             15 POP_JUMP_IF_FALSE        6
             18 LOAD_CONST               0 ('/')
             21 LOAD_CONST               0 ('/')
             24 LOAD_ATTR                0 (join)
             27 LOAD_DEREF               0 (path)
             30 LOAD_ATTR                1 (split)
             33 LOAD_CONST               0 ('/')
             36 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             39 LOAD_CONST               1 (1)
             42 LOAD_FAST                1 (l)
             45 UNARY_NEGATIVE
             46 BUILD_SLICE              2
             49 BINARY_SUBSCR
             50 CALL_FUNCTION            1 (1 positional, 0 keyword pair)
             53 BINARY_ADD
             54 LIST_APPEND              2
             57 JUMP_ABSOLUTE            6
        >>   60 RETURN_VALUE


So this is how list comprehensions look like when converted to byte code. Nice! Now performance results make more sense. In the project I was working on my function for getting all parent paths was called in one place and perhaps contributed to less than 5% of execution time of whole application. It would not make sense to optimize this piece of code. But it was delightful journey into internals of CPython, wasn't it?

Now, let's get back to decl3. What have I done to make my declarative implementation 2x faster on average case and for right-part outliers? Well... I just reluctantly resigned from putting everything in one line and saved path.split('/') into separate variable. That's it.

So what are learnings?
  • declarative method turned out to be faster than hand-crafter imperative one employing low-level constructs.
    Why? Good question! Maybe because bytecode generator knows how to produce optimized code when it encounters list comprehension? But I have written no CPython code, so it's only my speculation.
  • trying to put everything in one line can hurt - in described case split() function was major performance dragger
reddit-related updates:
Dunj3 outpaced me ;) - his implementation, which is better both w.r.t. "declarativeness" and performance:
list(itertools.accumulate(path.split('/'), curry(os.sep.join)))

syntax highlighting done with https://tohtml.com/python/

wtorek, 3 stycznia 2017

Logstash + filebeat: Invalid Frame Type, received: 1

Post for googlers that stumble on the same issue - it seems that "overconfiguration" is not a great idea for Filebeat and Logstash.

I've decided to explicitly set ssl.verification_mode to none in my Filebeat config and then I got following Filebeat and Logstash errors:

filebeat_1  | 2017/01/03 07:43:49.136717 single.go:140: ERR Connecting error publishing events (retrying): EOF
filebeat_1  | 2017/01/03 07:43:50.152824 single.go:140: ERR Connecting error publishing events (retrying): EOF
filebeat_1  | 2017/01/03 07:43:52.157279 single.go:140: ERR Connecting error publishing events (retrying): EOF
filebeat_1  | 2017/01/03 07:43:56.173144 single.go:140: ERR Connecting error publishing events (retrying): EOF                                                          
filebeat_1  | 2017/01/03 07:44:04.189167 single.go:140: ERR Connecting error publishing events (retrying): EOF



logstash_1       | 07:42:35.714 [Api Webserver] INFO  logstash.agent - Successfully started Logstash API endpoint {:port=>9600}                                                                                                         
logstash_1       | 07:43:49.135 [nioEventLoopGroup-4-1] ERROR org.logstash.beats.BeatsHandler - Exception: org.logstash.beats.BeatsParser$InvalidFrameProtocolException: Invalid Frame Type, received: 3                                      
logstash_1       | 07:43:49.139 [nioEventLoopGroup-4-1] ERROR org.logstash.beats.BeatsHandler - Exception: org.logstash.beats.BeatsParser$InvalidFrameProtocolException: Invalid Frame Type, received: 1                                      
logstash_1       | 07:43:50.150 [nioEventLoopGroup-4-2] ERROR org.logstash.beats.BeatsHandler - Exception: org.logstash.beats.BeatsParser$InvalidFrameProtocolException: Invalid Frame Type, received: 3                                      
logstash_1       | 07:43:50.154 [nioEventLoopGroup-4-2] ERROR org.logstash.beats.BeatsHandler - Exception: org.logstash.beats.BeatsParser$InvalidFrameProtocolException: Invalid Frame Type, received: 1                        
logstash_1       | 07:43:52.156 [nioEventLoopGroup-4-3] ERROR org.logstash.beats.BeatsHandler - Exception: org.logstash.beats.BeatsParser$InvalidFrameProtocolException: Invalid Frame Type, received: 3                                      
logstash_1       | 07:43:52.157 [nioEventLoopGroup-4-3] ERROR org.logstash.beats.BeatsHandler - Exception: org.logstash.beats.BeatsParser$InvalidFrameProtocolException: Invalid Frame Type, received: 1                                      
logstash_1       | 07:43:56.170 [nioEventLoopGroup-4-4] ERROR org.logstash.beats.BeatsHandler - Exception: org.logstash.beats.BeatsParser$InvalidFrameProtocolException: Invalid Frame Type, received: 3                                      
logstash_1       | 07:43:56.175 [nioEventLoopGroup-4-4] ERROR org.logstash.beats.BeatsHandler - Exception: org.logstash.beats.BeatsParser$InvalidFrameProtocolException: Invalid Frame Type, received: 1


It seems it's better to stay quiet with Filebeat :) Hopefully this helped to resolve your issue.

wtorek, 13 grudnia 2016

std::queue's big default footprint in assembly code

Recently I've been quite busy and now I'm kind of scrounging back into C++ world. Friend of mine told me about IncludeOS project and I thought that it may be pretty good exercise to put my hands on my keyboard and help in this wonderful project.

To be honest, the learning curve is quite steep (or I'm getting too old to learn so fast) and I'm still distracted by a lot of other things, so no big deliverables so far... but by just watching discussion on Gitter and integrating it with what I know I spotted probably obvious, but a little bit surprising thing about std::queue.

std::queue is not a container. Wait, what?, you ask. It's a container adapter. It doesn't have implementation. Instead, it takes other implementation, uses it as underlying container and just provides some convenient interface for end-user. By the way it isn't the only one. There are others like std::stack and std::priority_queue to name a few.

One of the dimension in which C++ shines are options for customizing stuff. We can customize things like memory allocators. In container adapters we can customize this underlying container if we decide that the one chosen by library writers isn't good match for us.

By default, perhaps because std::queue requires fast access at the beginning and end, it's underlying container is std::deque. std::deque provides O(1) complexity for pushing/popping at both ends. Perfect match, isn't it?

Well, yes if you care about performance at the cost of increased binary size. As it turns out by simply changing std::deque to std::vector:

std::queue<int> qd;
std::queue<int, std::vector<int>> qv;

Generated assembly code for x86-64 clang 3.8 (-O3 -std=c++14) is 502 and 144 respectively.

I know that in most context binary size is secondary consideration, but still I believe it's an interesting fact that the difference is so big. In other words there must be a lot of things going on under the bonnet of std::deque. I don't recommend changing deque to vector in production - it can seriously damage your performance.

You can play around with the code here: https://godbolt.org/g/XaLhS7 (code based on Voultapher example).

środa, 7 września 2016

elasticdiff

Recently I needed to compare two ElasticSearch indices. To be honest I was pretty sure that I'll find something in the Internet. It was a surprise that no such a tool do exist. I thought that this is good time to pay off portion of my debt to open-source :)

Enter ElasticDiff

elasticdiff is a simple project hosted on GitHub which allows you to compare two ES indices without pain. It is early development, so don't expect much, but for simple indices it works pretty well.

Usage is quite trivial:
python3 elasticdiff.py -t docType -i key http://1.2.3.4/index1 http://2.3.4.5/index1



Output was designed to imitate diff command:
only in left: a5
only in left: a1
only in right: a4
only in right: a2
entries for key a3 differ
--- 

+++ 

@@ -1,4 +1,4 @@

 {
   "key": "a3",
-  "value": "DEF"
+  "value": "XYZ"
 }
Summary:
2 entries only in left index
2 entries only in right index
1 common entries
  0 of them are the same
  1 of them differ

Hopefully someone will find this tool useful.

More information available at GitHub page. I'm planning releasing this on PyPi when it gets more mature. I will update this post when this happens.

poniedziałek, 25 lipca 2016

niedziela, 8 maja 2016

me @ ACCU 2016 - what every C++ programmer should know about modern compilers

In Poland we have this proverb "pierwsze koty za płoty"*. Basically we say it when we do something for the first time in our lives. After finding a translation, I was quite amused - "the first pancake is always spoiled". Recently I had my debut abroad on ACCU 2016 conference. You decide if it was spoiled pancake or not :)

My short 15min talk was about modern compilers. I start with basic introduction, which may be a bit surprising, then I move to C++ standard perspective and undefined behaviors, then there's a part about optimizations and finally I'm covering the ecosystem that has grown around compilers in recent years (sanitizers, 3rd tools based on compilers, etc.). Enjoy!


Slides: with comments / without comments

GCC 6.1 released [link]
If GCC was released two weeks earlier I would definitely include one slide about how verbose and helpful compilers are nowadays.
Several years ago they were unhelpful and very quiet - even if something was going terribly wrong (and the programmer would discover this in run-time) they were happily compiling assuming that the programmer knows what she's doing.
And now.. what a change. The compilers became so friendly that GCC 6.1 can even let you know that you have unmerged git content in the file!

* direct translation doesn't work :) - "first cats over the fences".

czwartek, 17 marca 2016

Really simple JSON server

As some of you may already know there's this json-server project which aims to provide an easy way to create fake servers returning JSON responses over HTTP. This is very great project, but if you need something simpler that does not necessarily follow all of the REST rules, then you're out of luck.

For instance, if you use json-server with PUT/POST requests then underlying database will change.

really-simple-json-server, on the other hand, is an effort to create really simple JSON server. To actually show you how simple it is, let's have a look at example routes (example.json).

{
    "/config": {"avatar_upstream_server": "172.17.42.42"},
    "/u/1/friends": ["slavko", "asia", "agata", "zbyszek", "lucyna"],
    "/u/2/friends": ["adam", "grzegorz"],
    "/u/3/friends": [],
    "/u/4/friends": ["slavko"]
}

Then, assuming all dependencies (see below) are installed it's all about starting a server:

$ ./server.py --port 1234 example.json

And we can start querying the server!

$ curl http://localhost:1234/config
{"avatar_upstream_server": "172.17.42.42"}

The project uses Python 3.5.1 along with aiohttp package. It is shipped with Docker image, so it's pretty easy to start hacking.

$ docker build -t szborows/python351_aiohttp .
$ docker run -it -v $PWD:/app:ro szborows/python351_aiohttp /bin/bash -c "/app/server.py --port 1234 /app/example.json"