Lua Lanes - multithreading in Lua

Description

Lua Lanes is a Lua extension library providing the possibility to run multiple Lua states in parallel. It is intended to be used for optimizing performance on multicore CPU's and to study ways to make Lua programs naturally parallel to begin with.

Lanes is included into your software by the regular require "lanes" method. No C side programming is needed; all APIs are Lua side, and most existing extension modules should work seamlessly together with the multiple lanes.

See comparison of Lua Lanes with other Lua multithreading solutions.

Features:

Lanes have separated data, by default. Shared data is possible with Linda objects.
Communications is separate of threads, using Linda objects.
Data passing uses fast inter-state copies (no serialization required)
"Deep userdata" concept, for sharing userdata over multiple lanes
Millisecond level timers, integrated with the Linda system.
Threads can be given priorities -2..+2 (default is 0).
Lanes are cancellable, with proper cleanup.
No application level locking - ever!

Limitations:

coroutines are not passed between states
sharing full userdata between states needs special C side preparations (-> deep userdata)
network level parallelism not included

Supported systems

Lua Lanes supports the following operating systems:

Mac OS X PowerPC / Intel (10.4 and later)
Linux x86
Windows 2000/XP and later (MinGW or Visual C++ 2005/2008)

The underlying threading code can be compiled either towards Win32 API or Pthreads. Unfortunately, thread prioritation under Pthreads is a JOKE, requiring OS specific tweaks and guessing undocumented behaviour. Other features should be portable to any modern platform.

Building and Installing

Lua Lanes is built simply by make on the supported platforms (make-vc for Visual C++). See README for system specific details and limitations.

To install Lanes, all you need are the lanes.lua and lua51-lanes.so|dll files to be reachable by Lua (see LUA_PATH, LUA_CPATH). Or use Lua Rocks package management.

 > luarocks search lanes
 ... output listing Lua Lanes is there ...
 > luarocks install lanes
 ... output ...

Creation

The following sample shows preparing a function for parallel calling, and calling it with varying arguments. Each of the two results is calculated in a separate OS thread, parallel to the calling one. Reading the results joins the threads, waiting for any results not already there.

 require "lanes"
 f= lanes.gen( function(n) return 2*n end )
 a= f(1)
 b= f(2)
 print( a[1], b[1] ) -- 2 4

func= lanes.gen( [libs_str | opt_tbl [, ...],] lane_func )




lane_h= func( ... )

The function returned by lanes.gen() is a "generator" for launching any number of lanes. They will share code, options, initial globals, but the particular arguments may vary. Only calling the generator function actually launches a lane, and provides a handle for controlling it.

Lanes automatically copies upvalues over to the new lanes, so you need not wrap all the required elements into one 'wrapper' function. If lane_func uses some local values, or local functions, they will be there also in the new lanes.

libs_str defines the standard libraries made available to the new Lua state:

	(nothing)	no standard libraries (default)
	`"base"` or `""`	root level names, `print`, `assert`, `unpack` etc.
	`"coroutine"`	`coroutine.*` namespace (part of base in Lua 5.1)
	`"debug"`	`debug.*` namespace
	`"io"`	`io.*` namespace
	`"math"`	`math.*` namespace
	`"os"`	`os.*` namespace
	`"package"`	`package.*` namespace and `require`
	`"string"`	`string.*` namespace
	`"table"`	`table.*` namespace
	`"*"`	all standard libraries

Initializing the standard libs takes a bit of time at each lane invocation. This is the main reason why "no libraries" is the default.

opt_tbl is a collection of named options to control the way lanes are run:

	`.cancelstep` N / true	By default, lanes are only cancellable when they enter a pending `:receive()` or `:send()` call. With this option, one can set cancellation check to occur every `N` Lua statements. The value `true` uses a default value (100).
	`.globals` globals_tbl	Sets the globals table for the launched threads. This can be used for giving them constants. The global values of different lanes are in no manner connected; modifying one will only affect the particular lane.
	`.priority` -2..+2	The priority of lanes generated. -2 is lowest, +2 is highest. Implementation and dependability of priorities varies by platform. Especially Linux kernel 2.6 is not supporting priorities in user mode.

Free running lanes

The lane handles are allowed to be 'let loose'; in other words you may execute a lane simply by:

 lanes.gen( function() ... end ) ()

Normally, this kind of lanes will be in an eternal loop handling messages. Since the lane handle is gone, there is no way to control such a lane from the outside, nor read its potential return values. Then again, such a lane does not even normally return.

Status

str= lane_h.status

The current execution state of a lane can be read via its status member, providing one of these values: ⁽²

	`"pending"`	not started, yet
	`"running"`	running
	`"waiting"`	waiting at a Linda `:receive()` or `:send()`
	`"done"`	finished executing (results are ready)
	`"error"`	met an error (reading results will propagate it)
	`"cancelled"`	received cancellation and finished itself

This is similar to coroutine.status, which has: "running" / "suspended" / "normal" / "dead". Not using the exact same names is intentional.

Results and errors

A lane can be waited upon by simply reading its results. This can be done in two ways.

[val]= lane_h[1]

Makes sure lane has finished, and gives its first (maybe only) return value. Other return values will be available in other lane_h indices.

If the lane ended in an error, it is propagated to master state at this place.

[...]|[nil,err,stack_tbl]= lane_h:join( [timeout_secs] )

Waits until the lane finishes, or timeout seconds have passed. Returns nil on timeout, nil,err,stack_tbl if the lane hit an error, or the return values of the lane. Unlike in reading the results in table fashion, errors are not propagated.

stack_tbl is an array of "<filename>:<line>" strings, describing where the error was thrown. Use table.concat() to format it to your liking (or just ignore it).

If you use :join, make sure your lane main function returns a non-nil value so you can tell timeout and error cases apart from succesful return (using the .status property may be risky, since it might change between a timed out join and the moment you read it).

 require "lanes"
 f= lanes.gen( function() error "!!!" end )
 a= f(1)
 --print( a[1] ) -- propagates error
 v,err= a:join() -- no propagation
 if v==nil then
 error( "'a' faced error"..tostring(err) ) -- manual propagation
 end

Cancelling

bool= lane_h:cancel( [timeout_secs=0.0,] [force_kill_bool=false] )

Sends a cancellation request to the lane. If timeout_secs is non-zero, waits for the request to be processed, or a timeout to occur. Returns true if the lane was already done (in "done", "error" or "cancelled" status) or if the cancellation was fruitful within timeout period.

If the lane is still running and force_kill is true, the OS thread running the lane is forcefully killed. This means no GC, and should generally be the last resort.

Cancellation is tested before going to sleep in receive() or send() calls and after executing cancelstep Lua statements. A currently pending receive or send call is currently not awakened, and may be a reason for a non-detected cancel.

Finalizers

set_finalizer( finalizer_func )

void= finalizer_func( [error] )

The error call is used for throwing exceptions in Lua. What Lua does not offer, however, is scoped finalizers that would get called when a certain block of instructions gets exited, whether through peaceful return or abrupt error.

Since 2.0.3, Lanes prepares a function set_finalizer for doing this. Any functions given to it will be called in the lane Lua state, just prior to closing it. They are not called in any particular order.

An error in a finalizer itself overrides the state of the regular chunk (in practise, it would be highly preferable not to have errors in finalizers). If one finalizer errors, the others may not get called.

Lindas

Communications between lanes is completely detached from the lane handles themselves. By itself, a lane can only provide return values once it's finished, or throw an error. Needs to communicate during runtime are handled by Linda objects, which are deep userdata instances. They can be provided to a lane as startup parameters, upvalues or in some other Linda's message.

Access to a Linda object means a lane can read or write to any of its data slots. Multiple lanes can be accessing the same Linda in parallel. No application level locking is required; each Linda operation is atomic.

 require "lanes"
 local linda= lanes.linda()
 local function loop( max )
 for i=1,max do
 print( "sending: "..i )
 linda:send( "x", i ) -- linda as upvalue
 end
 end
 a= lanes.gen("",loop)( 10000 )
 while true do
 local val= linda:receive( 3.0, "x" ) -- timeout in seconds 
 if val==nil then
 print( "timed out" )
 break
 end
 print( "received: "..val )
 end

Characteristics of the Lanes implementation of Lindas are:

keys can be of number, string or boolean type
values can be any type supported by inter-state copying (same limits as for function parameters and upvalues)
consuming method is :receive (not in)
non-consuming method is :get (not rd)
two producer-side methods: :send and :set (not out)
send allows for sending multiple values -atomically- to a given key
receive can wait for multiple keys at once
individual keys' queue length can be limited, balancing speed differences in a producer/consumer scenario (making :send wait)

h= lanes.linda()

bool= h:send( [timeout_secs,] key, ... )
[val, key]= h:receive( [timeout_secs,] key [, ...] )

= h:limit( key, n_uint )

The send and receive methods use Linda keys as FIFO stacks (first in, first out). Timeouts are given in seconds (millisecond accuracy). If using numbers as the first Linda key, one must explicitly give nil as the timeout parameter to avoid ambiguities.

By default, stack sizes are unlimited but limits can be enforced using the limit method. This can be useful to balance execution speeds in a producer/consumer scenario.

Note that any number of lanes can be reading or writing a Linda. There can be many producers, and many consumers. It's up to you.

send returns true if the sending succeeded, and false if the queue limit was met, and the queue did not empty enough during the given timeout.

Equally, receive returns a value and the key that provided the value, or nothing for timeout. Note that nils can be sent and received; the key value will tell it apart from a timeout.

Multiple values can be sent to a given key at once, atomically (the send will fail unless all the values fit within the queue limit). This can be useful for multiple producer scenarios, if the protocols used are giving data in streams of multiple units. Atomicity avoids the producers from garbling each others messages, which could happen if the units were sent individually.

When receiving from multiple slots, the keys are checked in order, which can be used for making priority queues.

linda_h:set( key, [val] )
[val]= linda_h:get( key )

The table access methods are for accessing a slot without queuing or consuming. They can be used for making shared tables of storage among the lanes.

Writing to a slot overwrites existing value, and clears any possible queued entries. Table access and send/receive can be used together; reading a slot essentially peeks the next outcoming value of a queue.

Granularity of using Lindas

A single Linda object provides an infinite number of slots, so why would you want to use several?

There are some important reasons:

Access control. If you don't trust certain code completely, or just to modularize your design, use one Linda for one usage and another one for the other. This keeps your code clear and readable. You can pass multiple Linda handles to a lane with practically no added cost.
Namespace control. Linda keys have a "flat" namespace, so collisions are possible if you try to use the same Linda for too many separate uses.
Performance. Changing any slot in a Linda causes all pending threads for that Linda to be momentarily awakened (at least in the C level). This can degrade performance due to unnecessary OS level context switches.

On the other side, you need to use a common Linda for waiting for multiple keys. You cannot wait for keys from two separate Linda objects at the same time.

Actually, you can. Make separate lanes to wait each, and then multiplex those events to a common Linda, but... :).

Timers

= lanes.timer( linda_h, key, date_tbl|first_secs [,period_secs] )

Timers can be run once, or in a reoccurring fashion (period_secs > 0). The first occurrence can be given either as a date or as a relative delay in seconds. The date table is like what os.date("*t") returns, in the local time zone.

Once a timer expires, the key is set with the current time (in seconds, same offset as os.time() but with millisecond accuracy). The key can be waited upon using the regular Linda :receive() method.

A timer can be stopped simply by first_secs=0 and no period.

 require "lanes"
 local linda= lanes.linda()
 -- First timer once a second, not synchronized to wall clock
 --
 lanes.timer( linda, "sec", 1, 1 )
 -- Timer to a future event (next even minute); wall clock synchronized 
 --
 local t= os.date( "*t", os.time()+60 ) -- now + 1min
 t.sec= 0
 lanes.timer( linda, "min", t, 60 ) -- reoccur every minute (sharp)
 while true do
 local v,key= linda:receive( "sec", "min" )
 print( "Timer "..key..": "..v )
 end

NOTE: Timer keys are set, not queued, so missing a beat is possible especially if the timer cycle is extremely small. The key value can be used to know the actual time passed.

Design note: Having the API as lanes.timer() is intentional. Another alternative would be linda_h:timer() but timers are not traditionally seen to be part of Lindas. Also, it would mean any lane getting a Linda handle would be able to modify timers on it. A third choice could be abstracting the timers out of Linda realm altogether (timer_h= lanes.timer( date|first_secs, period_secs )) but that would mean separate waiting functions for timers, and lindas. Even if a linda object and key was returned, that key couldn't be waited upon simultaneously with one's general linda events. The current system gives maximum capabilities with minimum API, and any smoothenings can easily be crafted in Lua at the application level.

Locks etc.

Lanes does not generally require locks or critical sections to be used, at all. If necessary, a limited queue can be used to emulate them. lanes.lua offers some sugar to make it easy:

 lock_func= lanes.genlock( linda_h, key [,N_uint=1] )
 lock_func( M_uint ) -- acquire
 ..
 lock_func( -M_uint ) -- release

The generated function acquires M entries from the N available, or releases them if the value is negative. The acquiring call will suspend the lane, if necessary. Use M=N=1 for a critical section lock (only one lane allowed to enter).

Note: The locks generated are not recursive. That would need another kind of generator, which is currently not implemented.

Similar sugar exists for atomic counters:

 atomic_func= lanes.genatomic( linda_h, key [,initial_num=0.0] )
 new_num= atomic_func( [diff_num=+1.0] )

Each time called, the generated function will change linda[key] atomically, without other lanes being able to interfere. The new value is returned. You can use either diff 0.0 or get to just read the current value.

Note that the generated functions can be passed on to other lanes.

Other issues

Limitations on data passing

Data passed between lanes (either as starting parameters, return values, upvalues or via Lindas) must conform to the following:

Booleans, numbers, strings, light userdata, Lua functions and tables of such can always be passed.
Cyclic tables and/or duplicate references are allowed and reproduced appropriately, but only within the same transmission.
- using the same source table in multiple Linda messages keeps no ties between the tables
Objects (tables with a metatable) are copyable between lanes.
- metatables are assumed to be immutable; they are internally indexed and only copied once per each type of objects per lane
C functions (lua_CFunction) referring to LUA_ENVIRONINDEX or LUA_REGISTRYINDEX might not work right in the target
- rather completely re-initialize a module with require in the target lane
Full userdata can be passed only if it's prepared using the deep userdata system, which handles its lifespan management
- in particular, lane handles cannot be passed between lanes
coroutines cannot be passed

Required of module makers

Most Lua extension modules should work unaltered with Lanes. If the module simply ties C side features to Lua, everything is fine without alterations. The luaopen_...() entry point will be called separately for each lane, where the module is require'd from.

If it, however, also does one-time C side initializations, these should be covered into a one-time-only construct such as below.

 int luaopen_module( lua_State *L )
 {
 static char been_here; /* 0 by ANSI C */
 /* Calls to 'require' serialized by Lanes; this is safe.  
 */
 if (!been_here) {
 been_here= 1;
 ... one time initializations ...
 }
 ... binding to Lua ...
 }

Deep userdata in your own apps

The mechanism Lanes uses for sharing Linda handles between separate Lua states can be used for custom userdata as well. Here's what to do.

Provide an identity function for your userdata, in C. This function is used for creation and deletion of your deep userdata (the shared resource), and for making metatables for the state-specific proxies for accessing it. Take a look at linda_id in lanes.c.
Create your userdata using luaG_deep_userdata(), which is a Lua-callable function. Given an idfunc, it sets up the support structures and returns a state-specific proxy userdata for accessing your data. This proxy can also be copied over to other lanes.
Accessing the deep userdata from your C code, use luaG_todeep() instead of the regular lua_touserdata().

Deep userdata management will take care of tying to __gc methods, and doing reference counting to see how many proxies are still there for accessing the data. Once there are none, the data will be freed through a call to the idfunc you provided.

NOTE: The lifespan of deep userdata may exceed that of the Lua state that created it. The allocation of the data storage should not be tied to the Lua state used. In other words, use malloc/free or similar memory handling mechanism.

Lane handles don't travel

Lane handles are not implemented as deep userdata, and cannot thus be copied across lanes. This is intentional; problems would occur at least when multiple lanes were to wait upon one to get ready. Also, it is a matter of design simplicity.

The same benefits can be achieved by having a single worker lane spawn all the sublanes, and keep track of them. Communications to and from this lane can be handled via a Linda.

Beware with print and file output

In multithreaded scenarios, giving multiple parameters to print() or file:write() may cause them to be overlapped in the output, something like this:

 A: print( 1, 2, 3, 4 )
 B: print( 'a', 'b', 'c', 'd' )
 1 a b 2 3 c d 4

Lanes does not protect you from this behaviour. The thing to do is either to concentrate your output to a certain lane per stream, or to concatenate output into a single string before you call the output function.

Performance considerations

Lanes is about making multithreading easy, and natural in the Lua state of mind. Expect performance not to be an issue, if your program is logically built. Here are some things one should consider, if best performance is vital:

Data passing (parameters, upvalues, Linda messages) is generally fast, doing two binary state-to-state copies (from source state to hidden state, hidden state to target state). Remember that not only the function you specify but also its upvalues, their upvalues, etc. etc. will get copied.
Lane startup is fast (1000's of lanes a second), depending on the number of standard libraries initialized. Initializing all standard libraries is about 3-4 times slower than having no standard libraries at all. If you throw in a lot of lanes per second, make sure you give them minimal necessary set of libraries.
Waiting Lindas are woken up (and execute some hidden Lua code) each time any key in the Lindas they are waiting for are changed. This may give essential slow-down (not measured, just a gut feeling) if a lot of Linda keys are used. Using separate Linda objects for logically separate issues will help (which is good practise anyhow).
Linda objects are light. The memory footprint is two OS-level signalling objects (HANDLE or pthread_cond_t) for each, plus one C pointer for the proxies per each Lua state using the Linda. Barely nothing.
Timers are light. You can probably expect timers up to 0.01 second resolution to be useful, but that is very system specific. All timers are merged into one main timer state (see timer.lua); no OS side timers are utilized.
Lindas are hashed to a fixed number of "keeper states", which are a locking entity. If you are using a lot of Linda objects, it may be useful to try having more of these keeper states. By default, only one is used (see KEEPER_STATES_N), but this is an implementation detail.

Cancelling cancel

Cancellation of lanes uses the Lua error mechanism with a special lightuserdata error sentinel. If you use pcall in code that needs to be cancellable from the outside, the special error might not get through to Lanes, thus preventing the Lane from being cleanly cancelled. You should throw any lightuserdata error further.

This system can actually be used by application to detect cancel, do your own cancellation duties, and pass on the error so Lanes will get it. If it does not get a clean cancellation from a lane in due time, it may forcefully kill the lane.

The sentinel is exposed as lanes.cancel_error, if you wish to use its actual value.

Change log

Jan-2009 (2.0.3):

Added 'finalizer' to lane options. (TBD: not implemented yet!)
Added call stack to errors coming from a lane.

Jul-2008 (2.0):

Too many changes to list (you'll need to re-read this manual)

For feedback, questions and suggestions: