Jython, My Love Affair Continues: Interpreters Are Not Evil!

Java is the world's most popular language and Python is the third.
Jython is sitting pretty right now; more people need to realise how
cool it really is!

Two and a half years ago when I started working with Morgan Stanley I only know of Python; now I am a keen Python evangelist and a huge fanboy of Jython.

Sadly, I have also resigned from my roles as a consultant to Morgan Stanley. Do not worry though, I hope to be starting a new and exciting challenge in the new year.

Anyhow, I digress. Back to Jython (and by extension). Python is an interpreter; I have been known not to favour interpreters in the past. In all honestly, I pretty much do not favour them now for the same reasons. But, the use cases for them is increasing. This is due to two effects: there is hybrid programming and just in time compilation.

Now, neither of these are new. They are old concepts. Java started to use JIT back in the nineties; I am sure there are earlier examples than that, but I expect Java is the technology which introduced JIT to most people. The thing is that JIT had gotten a lot better over the years. This means that Python does not need to be slow; Pypy is quite quick indeed: 

So - even though interpreter can be evil...
An older post from Nerds Central.
... I am no longer convinced they are always evil. I guess that is only reasonable as actually I lot of my doctorate was performed in the interpreted language Tcl. This is where we get back to the other use case: hybrid programming.

I hybrid programming, the interpreter is elevated; it starts to define the order of action and their interactions but the interpreter does not perform those actions itself. The following piece of Sonic Field Python performs convolution reverberation:

@sf_parallel
def reverbInner(signal,convol,grainLength):
    mag=sf.Magnitude(+signal)
    if mag>0:
        signal_=sf.Concatenate(signal,sf.Silence(grainLength))
        len=sf.Length(+signal_)
        signal_=sf.FrequencyDomain(signal_)
        signal_=sf.CrossMultiply(convol,signal_)
        signal_=sf.TimeDomain(signal_)
        newMag=sf.Magnitude(+signal_)
        signal_=sf.NumericVolume(signal_,mag/newMag)        
        # tail out clicks due to amplitude at end of signal
        return sf.Realise(sf.Clean(sf.Cut(0,len,signal_)))
    else:
        -convol
        return sf.Realise(signal)

@sf_parallel
def reverberate(signal,convol):
    print "Reverberate"
    grainLength = sf.Length(+convol)
    convol_=sf.FrequencyDomain(sf.Concatenate(convol,sf.Silence(grainLength)))
    signal_=sf.Concatenate(signal,sf.Silence(grainLength))
    out=[]
    for grain in sf.Granulate(signal_,grainLength):
        (signal_i,at)=grain
        signal_i=sf.Realise(signal_i)
        out.append((reverbInner(signal_i,+convol_,grainLength),at))
    -convol_

    return sf.Clean(sf.Normalise(sf.MixAt(out)))

This has some aspects of declarative programming about it. Whilst we see loops that which they are looping over is immensely complex compared to the loop its self. The Python has no notion of the underlying mathematics; indeed, even the type of the objects with which it is working  are not relieved to the Python. Further, there is an abstraction and optimisation layer underneath this Python which means how the mathematical challenge described by Python is solved is not entierly expressed here; many optimisations and much reordering can occur.

The interpreted, dynamic language Python lends it self to expressing concepts at a high level. I personally find it easy to lay out complex ideas in Python. Using hybrid programming, the Python then 'gets out of the way' of performance. The type safe, heavy lifting language underneath (in this case Java) can handle the job of getting the work done. This means my hybrid program is not wasting significant computer power on interpretation; it is not evil.

My hope is that over time the barriers between Python and that which it is driving will drop further. Python can become faster, even in the fast of dynamic dispatch, give enough effort. Indeed, even Java and C++ face dynamic dispatch performance issues. C++ has shown us the solution is to materialise semantically-dynamic dispatch into static dispatch at compile time. I am sure that eventually similar approaches will be further developed for the JIT compilers of dynamic languages and the performance barriers will fall further. I have some ideas how this can be achieved for Java as well - maybe I will even get the time to try them one day.

In summary, a confluence of JIT, materialised dynamic dispatch and hybrid programming is permitting the old paradigm of hybrid programming to have new life. Over time this approach may come to be completely dominant. As long as we retain the performance goals of hybrid programming and do not collapse into old, bad habits of attempting to implement too much in the interpreter and forgetting the evil that lies there, I can safely say this future is good, not evil.

Isn't sun.misc.Unsfe going away in Java 9

Dispel the serpents of doubt.
[ I love mixing mythologies ]

No - from the man himself.

So stop worrying. Removing the performance benefits of sun.misc.Unsafe would fatally wound Java and other JVM languages. Whilst the Unsafe class itself is labile, the functionality it contains will remain.

Recently I was presenting at a Java technology internal conference for a client. Brian Goetz also presented. Aside from his usual, and basically incorrect, statements about the quality of the lambda implementation in Java 8 and that it takes linear time to scan a class path (sorry - but I have had to say that - please stop saying these things Brain, you make yourself look silly) he did say some really interesting, and reassuring stuff about Unsafe and project Panama.

What is the situation?

I am going to paraphrase from memory now so these are not Mr Goetz's words exactly but more what I took away from them:
  1. Unsafe came about as a 'dumping ground' for features which were required for performance reasons. Each feature in Unsafe is there to service other parts of the JDK.
  2. Over time the community learned to use it for other things apart from internal JDK classes.
  3. Now Oracle see Unsafe as an 'incubator' for new features.
  4. Features are/will-be tried out in Unsafe and then can migrate through two paths: First - they are rubbish and will be dropped. Second - they are great, the community loves them and so they will be solidified in to a main stream JDK class.
So, off heap access is going nowhere. However, in Java 9 the enhanced volatile access code in Unsafe is going to migrate to project Valhalla: http://openjdk.java.net/jeps/193

BG also noted that the off heap access in Unsafe is going to stay there for Java 9. Whilst Panama is likely to completely replace Unsafe offheap eventually, that will not be in 9.

What does this mean for techniques like that I described in the Java Advent Calendar this year? Well, it means that in Java 9 they will just work. In Java 10 and above the low level calls might change. This makes it a good idea to hide those calls behind an abstraction. Now, that abstraction must be final and avoid virtual dispatch (ouch - stop shouting - yes I did mean that) but the high performance Java community are getting pretty good at that sort of thing these days. Once we have these abstractions in place we can migrate from 9 to 10 seamlessly.

Performance, Performance, Performance

There is a lot of concern about the idea 'Unsafe is going away'. At that conference mentioned, someone asked Brain Goetz something like 'can you promise the performance features of Unsafe are not going away'. He broadly did. That is important as it indicates sanity. We might think that Java is a 'Blue Collar Language' (what ever that means); but performance matters to blue collar coders just as much as white collar ones; actually, it probably matters more. Whilst researchers and theorists might love their functional definitions and declarative abstractions, people in day jobs selling yet another plastic action figure on Black Friday need one thing and one thing only - performance.

Seriously, whilst the chattering classes are flocking in their thousands to electric vehicles and champagne bars, the blue collar workers of the US are out buying 500hp pickup trucks and knocking back JD. The picture is much the same with vodka and golf GTIs in the UK.

Unsafe let Java properly break into performance territory. With technology like https://lmax-exchange.github.io/disruptor/ the concept of low latency Java was born [yes, it does use Unsafe - see  here https://lmax-exchange.github.io/disruptor/docs/com/lmax/disruptor/util/Util.html ]. Azul Zing is breaking down the garbage collection barriers to huge memory sizes https://www.azul.com/products/zing/. Even super heavy lifting technology like COBOL is now running on the JVM https://www.youtube.com/watch?v=ZqjI3AZJhI8.

For Java to retain its number one place as the world's most popular programming language (http://nerds-central.blogspot.co.uk/2015/11/cobol-in-top-20-of-tiobe-what-does-this.html) it absolutely must get more performant, never less. Oracle certainly bang on about performance enough (https://blogs.oracle.com/thejavatutorials/entry/learn_more_about_performance_and) so I can only assume they are sane enough to realise that performance is absolutely key to the success of Java and therefore their ongoing sales of support and Java based products.


A work stealing task schedular which supports a simple closure based submission system, lazy evaluation and recursive submission.

Performance, and we are cooking.
A work stealing task schedular which supports a simple closure based submission system, lazy evaluation and recursive submission. 

This has also been posted on Sonic Field.

The schedular has gone though a lot of changes over the years; this new work steeling feature, I believe, really fixes a lot of the deficiencies in the previous design without adding and complexity to the user experience.


So let's start by an explanation of the schedular in general. This is a task based lazy schedular. We create a closure around a 'task' and then return a 'SuperFuture' for it. The thing about SuperFuture is that it does not do anything until it has to. Unlike some Future based task modules, this one does not compute stuff as it turns up; rather it computes stuff as it is needed. 

When I restarted work on this a few days again I realised I had forgotten how the previous, simpler version worked. To avoid this mistake again, I have very, very heavily commented the code. So, rather than duplicate everything, I leave the code and comments to explain the rest.



Note - this code is all AGPL 3 - please respect copyright.


# For Copyright and License see LICENSE.txt and COPYING.txt in the root directory
import threading
import time
from java.util.concurrent import Callable, Future, ConcurrentLinkedQueue, \
                                 ConcurrentHashMap, Executors, TimeUnit
from java.util.concurrent.locks import ReentrantLock

from java.lang import System
from java.lang import ThreadLocal
from java.lang import Thread
from java.util import Collections
from java.util.concurrent import TimeUnit
from java.util.concurrent.atomic import AtomicLong, AtomicBoolean

"""
Work Stealing Lazy Task Scheduler By Dr Alexander J Turner

Basic Concepts:
===============

- Tasks are closures. 
- The computation of the task is called 'execution'.
- They can be executed in any order* on any thread.
- The data they contain is immutable.
- Making a task able to be executed is called submission.
- Execution is lazy**.
- Tasks are executed via a pool of threads and optionally one 'main' thread.
- Ensuring the work of a task is complete and acquiring the result (if any)
  is called 'getting' the task.

Scheduling Overview:
====================

* Any order, means that subitted tasks can be executed in any order though there
can be an order implicit in their dependencies.

IE taskA can depend on the result of taskB. Therefore.
- taskA submits taskA for execution.
- taskB submits taskC, taskD and taskE for execution.
- taskB now 'gets' the results of taskC, taskD and taskE

In the above case it does not matter which order taskC, taskD and taskE are
executed.

** Lazy execution means that executing a task is not done always at task 
submission time. Tasks are submitted if one of the following conditions is 
met.
- the maximum permitted number of non executing tasks has been reached. See
  'the embrace of meh' below.
- the thread submitting the task 'gets' and of the tasks which have not yet
  been submitted.
- a thread would is in the process of 'getting' a task but would block waiting
  for the task to finish executing. In this results in 'work stealing' (see
  below) where any other pending tasks for any threads any be executed by the
  thread which would otherwise block.

Embrace of meh
==============

Meh, as in a turm for not caring or giving up. This is a form of deadlock in
pooled future based systems where the deadlock is causes by a circular 
dependency involving the maximum number of executors in the pool rather than a
classic mutex deadlock. Consider this scenario:

- There are only two executors X and Y
- taskA executes on X
- taskA sumbits and then 'gets' taskB
- taskB executes on Y
- taskB submits and then 'gets' taskC
- taskC cannot execute because X and Y are blocked
- taskB cannot progress because it is waiting for taskC
- taskA cannot progress because it is waiting for taskB
- the executors cannt free to execute task as they are blocked by taskA and 
  taskB
- Meh

The solution used here is a soft upper limit to the number of tasks submitted
to the pool of executors. When that upper limit is reached, new tasks are 
not submitted for asynchronous execution but are executed immediately by the
thread which submits them. This prevents exhaustion of the available executors
and therefore prevents the embrace of meh.

Exact computation of the number of running executors is non tricky with the
code used here (java thread pools etc). Therefore, the upper limit is 'soft'.
In other words, sometimes more executors are used than the specified limit. This
is not a large issue here because the OS scheduler simply time shares between 
the executors - which are in fact native threads.

The alternative of an unbounded upper limit to the number of executors in not
viable; every simple granular synthesis or parallel FFT work can easily 
exhaust the maximum number of available native threads on modern machines. Also,
current operating systems are not optimised for time sharing between huge
numbers of threads. Finally, there is a direct link between the number of
threads running and the amount of memory used. For all these reasons, a soft
limited thread pool with direct execution works well.

Work Stealing
=============
Consider:
- taskA threadX submits taskB
- taskA threadX gets taskB
- taskB threadY submits taskC and taskD
- taskB threadY gets taskC
- there are no more executors so taskC is executed on threadY
- at this point taskC is pending execution and taskA threadX is waiting for the
- result of taskB on threadY.
- threadX can the stop waiting for taskB and 'steal' taskC.

Note that tasks can be executed in any order on any thread.

"""

# CONSTANTS AND CONTROL VARIABLES
# ===============================

# The maximum number of executors. Note that in general the system
# gets started by some 'main' thread which is then used for work as 
# well, so the total number of executors is often one more than this 
# number. Also note that this is a soft limit, synchronisation between
# submissions for execution is weak so it is possible for more executors
# to scheduled.
SF_MAX_CONCURRENT = int(System.getProperty("sython.threads"))

# Tracks the number of current queue but not executing tasks
SF_QUEUED         = AtomicLong()

# Tracks the number of executors which are sleeping because they are blocked
# and are not currently stealing work
SF_ASLEEP         = AtomicLong()

# Marks when the concurrent system came up to make logging more human readable
SF_STARTED        = System.currentTimeMillis()

# Causes scheduler operation to be logged
TRACE             = str(System.getProperty("sython.trace")).lower()=="true"

# A thread pool used for the executors 
SF_POOL    = Executors.newCachedThreadPool()

# A set of tasks which might be available for stealing. Use a concurrent set so
# that it shares information between threads in a stable and relatively 
# efficient way. Note that a task being in this set does not guarantee it is
# not being executed. A locking flag on the 'superFuture' task management
# objects disambiguates this to prevent double execution. 
SF_PENDING = Collections.newSetFromMap(ConcurrentHashMap(SF_MAX_CONCURRENT*128,0.75,SF_MAX_CONCURRENT))

# EXECUTION
# =========

# Define the logger method as more than pass only is tracing is turned on
if TRACE:
    # Force 'nice' interleaving when logging from multiple threads
    SF_LOG_LOCK=ReentrantLock()
    print "Thread\tQueue\tAsleep\tTime\tMessage..."
    def cLog(*args):
        SF_LOG_LOCK.lock()
        print "\t".join(str(x) for x in [Thread.currentThread().getId(),SF_QUEUED.get(),SF_ASLEEP.get(),(System.currentTimeMillis()-SF_STARTED)] + list(args))
        SF_LOG_LOCK.unlock()
else:
    def cLog(*args):
        pass

cLog( "Concurrent Threads: " + SF_MAX_CONCURRENT.__str__())
    
# Decorates ConcurrentLinkedQueue with tracking of total (global) number of
# queued elements. Also remaps the method names to be closer to python lists
class sf_safeQueue(ConcurrentLinkedQueue):
    # Note that this is actually the reverse of a python pop, this is actually
    # equivalent to [1,2,3,4,5].pop(0).
    def pop(self):
        SF_QUEUED.getAndDecrement()
        r = self.poll()
        SF_PENDING.remove(r)
        return r
    
    def append(self, what):
        SF_QUEUED.getAndIncrement()
        SF_PENDING.add(what)
        self.add(what)

# Python implements Callable to alow Python closers to be executed in Java
# thread pools
class sf_callable(Callable):
    def __init__(self,toDo):
        self.toDo=toDo
    
    # This method is that which causes a task to be executed. It actually
    # executes the Python closure which defines the work to be done
    def call(self):
        cLog("FStart",self.toDo)
        ret=self.toDo()
        cLog("FDone")
        return ret

# Holds the Future created by submitting a sf_callable to the SF_POOL for
# execution. Note that this is also a Future for consistency, but its work
# is delegated to the wrapped future. 
class sf_futureWrapper(Future):
    def __init__(self,toDo):
        self.toDo   = toDo

    def __iter__(self):
        return iter(self.get())
    
    def isDone(self):
        return self.toDo.isDone()
    
    def get(self):
        return self.toDo.get()

# Also a Future (see sf_futureWrapper) but these execute the python closer
# in the thread which calls the constructor. Therefore, the results is available
# when the constructor exits. These are the primary mechanism for preventing
# The Embrace Of Meh.
class sf_getter(Future):
    def __init__(self,toDo):
        self.toDo=toDo
        cLog("GStart",self.toDo)
        self.result=self.toDo()
        cLog("GDone")

    def isDone(self):
        return True

    def get(self):
        return self.result

# Queues of tasks which have not yet been submitted are thread local. It is only
# when a executor thread would become blocked that we go to work stealing. This
# class managed that thread locallity.
# TODO, should this, can this, go to using Python lists rather than concurrent
# linked queues.
class sf_taskQueue(ThreadLocal):
    def initialValue(self):
        return sf_safeQueue()

# The thread local queue of tasks which have not yet been submitted for
# execution
SF_TASK_QUEUE=sf_taskQueue()

# The main coordination class for the schedular. Whilst it is a future
# it actually deligates execution to sf_futureWrapper and sf_getter objects
# for synchronous and asynchronous operation respectively
class sf_superFuture(Future):

    # - Wrap the closure (toDo) which is the actual task (work to do)
    # - Add that task to the thread local queue by adding self to the queue
    #   thus this object is a proxy for the task.
    # - Initialise a simple mutual exclusion lock.
    # - Mark this super future as not having been submitted for execution. This
    #   is part of the mechanism which prevents work stealing resulting in a
    #   task being executed twice.
    def __init__(self,toDo):
        self.toDo=toDo
        queue=SF_TASK_QUEUE.get()
        queue.append(self)
        self.mutex=ReentrantLock()
        self.submitted=False

    # Used by work stealing to submit this task for immediate execution on the
    # the executing thread. The actual execution is delegated to an sf_getter
    # which executes the task in its constructor. This (along with submit) use
    # the mutex to manage the self.submitted field in a thread safe way. No
    # two threads can execute submit a super future more than once because
    # self.submitted is either true or false atomically across all threads. 
    # The lock has the other effect of synchronising memory state across cores
    # etc.
    def directSubmit(self):
        # Ensure this cannot be executed twice
        self.mutex.lock()
        if self.submitted:
            self.mutex.unlock()
            return
        self.submitted=True
        self.mutex.unlock()
        # Execute
        self.future=sf_getter(self.toDo)
    
    # Normal (non work stealing) submition of this task. This might or might not
    # result in immediate execution. If the total number of active executors is
    # at the limit then the task will execute in the calling thread via a
    # sf_getter (see directSubmit for more details). Otherwise, the task is 
    # submitted to the execution pool for asynchronous execution.
    #
    # It is important to understand that this method is not called directly
    # but is called via submitAll. submitAll is the method which subits tasks
    # from the thread local queue of pending tasks.
    def submit(self):
        # Ensure this cannot be submitted twice
        self.mutex.lock()
        if self.submitted:
            self.mutex.unlock()
            return
        self.submitted=True
        self.mutex.unlock()
        
        # See if we have reached the parallel execution soft limit
        count=SF_POOL.getActiveCount()
        cLog("Submit")
        if count<SF_MAX_CONCURRENT:
            # No, so submit to the thread pool for execution
            task=sf_callable(self.toDo)
            self.future=sf_futureWrapper(SF_POOL.submit(task))
        else:
            # Yes, execute in the current thread
            self.future=sf_getter(self.toDo)
        cLog("Submitted")

    # Submit all the tasks in the current thread local queue of tasks. This is
    # lazy executor. This gets called when we need results. 
    def submitAll(self):
        queue=SF_TASK_QUEUE.get()
        while(len(queue)):
            queue.pop().submit()

    # The point of execution in the lazy model. This method is what consumers
    # of tasks call to get the task executed and retrieve the result (if any).
    # This therefore acts as the point of synchronisation. This method will not
    # return until the task wrapped by this super future has finished executing.
    #
    # A note on stealing. Consider that we steal taskA. TaskA then invokes
    # get() on taskB. taskB is not completed. The stealing can chain here; 
    # whilst waiting for taskB to complete the thread can just steal another
    # task and so on. This is why we can use the directSubmit for stolen tasks.  
    def get(self):
        cLog( "Submit All")
        # Submit the current thread local task queue
        self.submitAll()
        cLog( "Submitted All")
        t=System.currentTimeMillis()
        c=1
        # There is a race condition where by the submitAll has been called
        # which recursively causes another instance of get on this super future
        # which results in there being no tasks to submit but we get to this 
        # point before self.future has been set. This tends to resolve itself
        # very quickly as the empty calls to submit all do not do very much work 
        # so the origianl progresses and sets the future. Rather than a complex
        # and potentially brittle locking system, we just spin waiting for the
        # future to be set. This works fine on my Mac as it only ever seems to 
        # spin once, so the cost is pretty much the same as a locking approach
        # basically, one quantum. If this starts to spin a lot in the future
        # a promise/future approach could be used. 
        while not hasattr(self,"future"):
            c+=1
            Thread.yield()
        t=System.currentTimeMillis()-t
        cLog( "Raced: ", c ,t)
        
        # This is where the work stealing logic starts
        # isDone() tells us if the thread would block if get() were called on
        # the future. We will try to work steal if the thread would block so as
        # not to 'waste' the thread. This if block is setting up the
        # log/tracking information
        if not self.future.isDone():
            SF_ASLEEP.getAndIncrement();
            cLog("Sleep")
            nap=True
        else:
            nap=False
            cLog("Get")
        # back control increasing backoff of the thread trying to work steal.
        # This is not the most efficient solution but as Sonic Field tasks are
        # generally large, this approach is OK for the current use. We back of
        # between 1 and 100 milliseconds. At 100 milliseconds we just stick at
        # polling every 100 milliseconds.
        back=1
        # This look steal work until the get() on the current task will 
        # not block
        while not self.future.isDone():
            # Iterate over the global set of pending super futures
            # Note that the locking logic in the super futures ensure 
            # no double execute.
            it=SF_PENDING.iterator()
            while it.hasNext():
                try:
                    # reset back (the back-off sleep time)
                    back=1
                    toSteal=it.next()
                    it.remove()
                    cLog("Steal",toSteal.toDo)
                    # Track number of tasks available to steal for logging
                    SF_ASLEEP.getAndDecrement();
                    # The stollen task must be performed in this thread as 
                    # this is the thread which would otherwise block so is
                    # availible for further execution
                    toSteal.directSubmit()
                    # Now we manage state of 'nap' which is used for logging
                    # based on if we can steal more (i.e. would still block)
                    # Nap also controls the thread back off
                    if self.future.isDone():
                        nap=False
                    else:
                        nap=True
                        SF_ASLEEP.getAndIncrement();
                except Exception, e:
                    # All bets are off
                    cLog("Failed to Steal",e.getMessage())
                    # Just raise and give up
                    raise
            # If the thread would block again or we are not able to steal as
            # nothing pending then we back off. 
            if nap==True:
                if back==1:
                    cLog("Non Pending")
                Thread.sleep(back)
                back+=1
                if back>100:
                    back=100
        if nap:
            SF_ASLEEP.getAndDecrement();
        # To get here we know this get will not block
        r = self.future.get()
        cLog("Wake")
        # Return the result if any
        return r

    # If the return of the get is iterable then we delegate to it so that 
    # this super future appears to be its embedded task
    def __iter__(self):
        return iter(self.get())
            
    # Similarly for resource control for the + and - reference counting 
    # semantics for SF_Signal objects
    def __pos__(self):
        obj=self.get()
        return +obj

    def __neg__(self):
        obj=self.get()
        return -obj

# Wrap a closure in a super future which automatically 
# queues it for future lazy execution
def sf_do(toDo):
    return sf_superFuture(toDo)
    
# An experimental decorator approach equivalent to sf_do
def sf_parallel(func):
    def inner(*args, **kwargs): #1
        return func(*args, **kwargs) #2
    return sf_do(inner)

# Shut the execution pool down. This waits for it to shut down
# but if the shutdown takes longer than timeout then it is 
# forced down.
def shutdown_and_await_termination(pool, timeout):
    pool.shutdown()
    try:
        if not pool.awaitTermination(timeout, TimeUnit.SECONDS):
            pool.shutdownNow()
            if not pool.awaitTermination(timeout, TimeUnit.SECONDS):
                print >> sys.stderr, "Pool did not terminate"
    except InterruptedException, ex:
        pool.shutdownNow()
        Thread.currentThread().interrupt()

# The default shutdown for the main pool  
def shutdownConcurrnt():
    shutdown_and_await_termination(SF_POOL, 5)