Planet Smalltalk

April 18, 2014

Benoit St-Jean - James Robertson : 1961-2014

Une chanson que j’associerai à tout jamais à James… C’était l’intro de ses podcasts…

Deeply, de Bôa

Classé dans:musique, Smalltalk Tagged: Bôa, Deeply, musique, Smalltalk

Eliot Miranda - Primitives and the Partial Read Barrier

Aperitivo If you’ve read my post on the partial read barrier you’ll know that one of the issues in the new Spur VM is how primitives deal with forwarding objects amongst their arguments. Recall that become: is lazy and is implemented by copying objects and updating the originals to be forwarding pointers to the copies. […]

Benoit St-Jean - James Robertson

C’est avec une grande tristesse que nous apprenons le décès de James Robertson, figure bien connue de tous dans la communauté Smalltalk…

Nos pensées et condoléances à sa famille et ses amis.

Classé dans:Smalltalk Tagged: James Robertson, Smalltalk

Benoit St-Jean - SciSmalltalk 0.9

À tous ceux qui ont à faire du traitement numérique, des statistiques ou dont les programmes Smalltalk nécessitent des mathématiques non triviales, sachez que le code de l’excellent livre de Didier H. Besset, Object-Oriented Implementation of Numerical Methods: An Introduction with Java & Smalltalk, a été porté (et est maintenu) depuis un certain temps sur Pharo Smalltalk.

La version 0.9 du package SciSmalltalk vient de sortir tel qu’annoncé ici!

Les initiés reconnaîtront, ci-dessous, un attracteur de Lorentz. Le graphique a été généré par l’engin de visualisation Roassal!


Vous pouvez suivre le développement de SciSmalltalk sur Twitter (@SciSmalltalk), sur GitHub (ici), sur Google Groups (ici) ou sur la mailing list de Google Groups (

Le port original du code de Besset est disponible ici pour ceux qui oseraient s’aventurer à amener SciSmalltalk/DhbNumerical sur une autre plate-forme Smalltalk…

Classé dans:mathématiques, Packages, Pharo, Smalltalk, statistiques Tagged: attracteur de Lorentz, Didier H. Besset, mathématiques, Object-Oriented Implementation of Numerical Methods: An Introduction with Java & Smalltalk, Pharo, Roassal, SciSmalltalk, Smalltalk, statistiques, traitement numérique

Pharo News - [ANN] SciSmalltalk v0.9 is out

The SciSmalltalk project provides tools for scientific computation in Smalltalk.

more here:


Pharo Weekly - Commits: March 3










Benoit St-Jean - Sista

Pour ceux qui s’intéressent aux VM Smalltalk (Cog), aux compilateurs et à l’optimisation, un excellent article (en anglais) sur Sista par Clément Béra ici.

Classé dans:Cog, Machine virtuelle, Smalltalk, VM Tagged: Cog, compilateur, optimisation, Sista, Smalltalk, VM

Clément Béra - The Sista chronicles V: First optimized method running

Hey guys,

For the last 3 weeks, I have been in San Francisco to work on Sista with Eliot Miranda, especially on the communication between the in-image optimizer and the Cog runtime / Cog JIT. Last tuesday, for the first time, the VM detected a hot spot via a tripping counter, then called the in-image optimizer, which optimized the method, installed it, and at the next call the optimized method was used (Do not misunderstand me, on-the-fly stack replacement does not work yet). This is a big step. However, the VM right now does not implement all the additional sista bytecodes and inlined primitives so the optimized methods performance is quite low :-(.

With the current settings we (we = Eliot Miranda and I) have, while running the Pharo kernel tests, 5350 counters tripped, which in fact corresponds to 187 different tripping methods (so yeah, the assumption that the code usually run almost always the same methods is verified). Basically the optimizer was able to optimize 80 methods out of the 187. The optimizer failures were mostly due to:

  • (2750/5350 cases) send data information forbidding inlining (Right now the optimization is cancelled if some critical sends are not inlined):
    • 950 megamorphic inline cache on the inlining path
    • 600 polymorphic inline cache on the inlining path
    • 1150 missing send data on the inlining path. This could mean that the counter trip on an uncommon case, i.e., the tripping method was called 65000 times from a certain method, but the last time, when the counter tripped, it was called from another method rarely used. It can also be due to some unexpected machine code garbage collection
  • (1250/5350 cases) the VM not being stable enough right now in the mapping machine code pc -> bytecode pc (only the Sista VM is not stable)

Interestingly, some non local returns are very hard to optimize, but in the current model, I had only 17 optimizations out of 5350 that cancelled due to unoptimizable non local returns. Therefore I assume this is not a common case.

Another interesting point, out of over a thousand successful method optimizations, only a 5 of them exceeded 50 ms, which is the maximum pause a user can tolerate. As the Spur garbage collector of Pharo 4 will double Pharo’s performance, this case is therefore also uncommon (at least on my Macbook 2.5Ghz Intel Core i5 with lots of applications running aside).

There is a long way to go before an alpha version though. I noticed that the optimizer decompilation and the inlining optimizations passes are *very* stable (thanks Ronie Salgado for showing me a good design for the inliner), but the range optimizations are not finished, and the byte code generation is not very good and even sometimes leads to incorrect byte codes. My next move will be on the code generation part to stabilize the overall optimizer.

Recently I simplified a lot the algorithms to convert the control flow graph to SSA (single static assignment) and the liveness analysis by studying Crankshaft, javascript V8 optimizing compiler (See information about Crankshaft on this blog and the clean C++ source code here). The new algorithm’s performance is very similar for the SSA conversion (even though the algorithm is much simpler), and the liveness analysis is just *way* faster for big methods.

The V8′s design influence on sista is nice, as I was for a long time strongly influenced by LLVM’s design (information about LLVM here), whereas the implementation of static C compiler usually prefers slow algorithm finding the perfect solution for an issue instead of a fast algorithm finding a suboptimal but almost perfect solution. However, you can still feel the C compilers influence by looking at the control flow graph manipulations which are much more aggressive than V8′s ones. In some way we could say that Sista is an hybrid optimizer based on GCC’s, LLVM’s and V8′s design. This is partly because some of Sista’s features (postponing the optimization to the a background process when it takes more than 50 ms, incremental optimizations) allows to spend more time on the optimizer.

The byte code generation have a main issues right now: the number of temporaries in optimized method is huge. There are several reasons for that:

  • The deSSA (algorithm that removes phi nodes) performs poorly. This is because I took it from an old prototype, and it is supposed to merge multiple temporaries only if they were the same temporaries in their unoptimized method, forcing the optimizer to create extra temporaries at each inlining pass. The idea was to keep values for the deoptimization process. However, having now implemented the deoptimization and having it working, I can tell you that this is complete non sense. The Sista intermediate representation is completely independent of any temporary representation, in fact a temporary in the Sista IR just means a slot on the stack that have the SSA property. And I didn’t invent anything there, Crankshaft’s intermediate representation hydrogen is also temporary variable independent.
  • After inlining multiple methods, the number of temps of each method, and in addition their “hidden” temporaries due to loops or stack top values, are added.
  • Currently the deoptimization requires that each value needed in the deoptimization scopes to be put in a temporary variables in the byte code, whereas for most of them I could just use the top of the stack. This is not a real problem, as the solution mostly consists in spending time on improving it.

To improve this issue, I am completely rewriting the conversion from the SSA control flow graph to Opal’s IR, the bytecode generator input. The main difference is that the deSSA algorithm will now merge as many temps as possible thanks to Briggs-Chaitin graph coloring algorithm and use a V8-inspired algorithm for liveness analysis. This should greatly decrease the number of temporaries (especially after inlining).

[BEGIN graph coloring digression ...]

The general idea of Briggs-Chaitin graph coloring algorithm is that you want to color each node of a graph such as there can’t be 2 nodes with a relationship having the same color. In the real world, each node is a sista IR temporary (let’s say a stack location, it’s not really a temporary) before the algorithm, and after the algorithm each node having the same color can share the same temporary variable index in the byte code. This algorithm is usually used for register allocation, but sista doesn’t allocate the registers, this is left to the low level JIT. However, my issue of reducing the number of temporary variables is clearly similar to register allocation.

As I am able to visualize everything I do in a few minutes thanks to Roassal (It took me only 4 min 42 sec to program this visualization, Roassal is *so* helpful, thank you Roassal guys), here are some visualizations of the Graph Coloring algorithm:

10 nodes and 3 colors:
Screen Shot 2014-04-17 at 3.46.33 PM

12 nodes and 4 colors:
Screen Shot 2014-04-17 at 3.51.12 PM

30 nodes and 7 colors:
Screen Shot 2014-04-17 at 5.31.13 PM

Now I just need to make it work nicely with my deSSA and byte code generation passes :-). I will be in vacation for the next two weeks so it may take a while.

[END graph coloring digression ...]

By the way, I know I still need to write down what method I optimize when a counter trips based on the current execution stack and why (and I said I would do so, I’m sorry I haven’t done it yet). For the ones that have read Urs Hölzle phd, you know that the method to optimize is not necessarily the method that has the tripping counter. Therefore you need to add some logic to find the best method to optimize. That is done and working, but I need to spend time to write about it…

Anyway I still recommend to read Urs phd, my explanation will obviously be less good than Urs’ ones, as he has a better English style and as I discussed with Eliot Miranda recently, the way he explains things make the optimizer’s concepts looking like they are very simple whereas they are very complex, which shows how good he is.

Hope you enjoyed the post !

PS: If you are interested in a specific aspect of Sista that I do not mention in my blog posts, please say it in the comment, I will answer you or perhaps even write a dedicated post.

April 17, 2014

Pharo Weekly - Font Awesome for Seaside


FontAwesome is (as you may know) an iconic font designed for the user with Twitter Bootstrap.

I now created a "FontAwesome for Seaside" project - which is a small Seaside wrapper for the 
FontAwesome project (using the latest version 4.0.3). 

This is intended as an addition to the already available "Bootstrap for Seaside" [2] project 
that I wrote and both should allow you to give your Smalltalk based web application a nice 
stylish look.

The project is located on STHub, see [1] where you will also find the documentation.

A live demo can be found on

To try yourself load it from the Pharo config browser or read the docu to see what is required.



Pharo Weekly - Mapless

Sebastian Sastre juste released Mapless.

Mapless is a small framework for storing objects in a key->data fashion (i.e.: noSQL databases) without requiring any kind of object-data map. So far only MongoDB is supported. It can use Redis for reactivity (pub/sub) and cache.


I wanted to persist objects with extremely low friction and extremely low maintenanceand great scaling and availability capabilities so Mapless is totally biased towards that. This framework is what I came up with after incorporating my experience withAggregate.

There is no spoon…

There is no object-relational impedance…

There is no instVars…

only persistence :D
Code and instructions here:
All MIT, enjoy

Torsten Bergmann - Mapless

Mapless is a small framework for storing objects in a key->data fashion (i.e.: noSQL databases) without requiring any kind of object-data map. So far only MongoDB is supported. Read more here or on GitHub.

Torsten Bergmann - Aggregate

Aggregate is a small persistance framework with a clean API and full ACID features that uses OmniBase as backend and supports BTree-based indexing. Find more on GitHub

Torsten Bergmann - Versionner for Pharo 3.0

How to use the new Versionner tool in Pharo 3.0:

ESUG news - [ANN] FontAwesome for Seaside

Torsten Bergmann announce FontAwesome for Seaside:

FontAwesome is (as you may know) an iconic font designed for the user with Twitter Bootstrap.

I now created a "FontAwesome for Seaside" project - which is a small Seaside wrapper for the FontAwesome project (using the latest version 4.0.3).

This is intended as an addition to the already available "Bootstrap for Seaside" [2] project that I wrote and both should allow you to give your Smalltalk based web application a nice stylish look.

The project is located on STHub, see [1] where you will also find the documentation.

A live demo can be found on

To try yourself load it from the Pharo config browser or read the docu to see what is required.

Thx T.

Torsten Bergmann - FontAwesome for Seaside

FontAwesome is (as you may know) an iconic font designed for the user with Twitter Bootstrap.

I now created a "FontAwesome for Seaside" project - which is a small Seaside wrapper for the FontAwesome project (using the latest version 4.0.3).

This is intended as an addition to the already available "Bootstrap for Seaside" project that I wrote and both should allow you to give your Smalltalk based web application a nice stylish look.

The project is located on STHub, there you will also find the documentation. A live demo can be found on

To try yourself load it from the Pharo 3.0 configuration browser or read the docu to see what is required.

April 16, 2014

Cincom Smalltalk - Smalltalk Digest: April Edition

The April edition of The Cincom Smalltalk Digest is available now.

Smalltalk Jobs - Smalltalk Developer, UK, France, Belgium, Netherlands, Germany or Switzerland

Cincom’s ChannelStream project is looking for a developer.  ChannelStream uses VisualWorks and Seaside.  The candidate can be based in the UK, France, Belgium, Netherlands, Germany or Switzerland (and must obviously be OK with some travel and some remote working, screen-sharing and etc.).  Strong Smalltalk development skills are wanted.  The role includes occasional work onsite with customers.

Filed under: Employment

Cincom Smalltalk - Do You Want to Play “2048”?

By Arden Thomas, Cincom Smalltalk Product Manager.

Pharo News - AstroCloud - using Pharo and Roassal to visualise Molecular Clouds

AstroCloud is an application that uses Roassal engine (built on top of Pharo Smalltalk) to visualize astronomy images. Read more or have a look at the pictures here.

Code is on SmalltalkHub.


External Page: 1294375_617967844956483_8154602562247258197_o.png

JR's Smalltalk 4 You - ST 4U 558: Tabbed Browsing in CST

Instead of opening multiple browsers in VW or OS, you can simply open multiple views in a single browser

April 15, 2014

Smalltalk Jobs - Smalltalk Jobs – 4/15/14

  • Atlanta, GAJunior Gemstone Database Administrator at IntercontinentalExchange (ICE)
    • Required Skills:
      • Bachelor’s Degree in Computer Science or commensurate experience.
      • Basic Smalltalk
      • Strong communication skills
      • Must work well in a fast-paced, deadline driven environment
      • Organized, strong initiative and teamwork orientation are necessary
    • Wanted Skills:
      • VisualWorks
      • GemStone administration (but company will train for the DBA role)
      • UNIX experience and scripting
      • Knowledge of basic DB principles
Good luck with your job hunting,
James T. Savidge

View James T. Savidge's profile on LinkedIn

This blog’s RSS Feed

Filed under: Employment

Torsten Bergmann - AstroCloud - using Pharo Smalltalk and Roassal engine to visualize Stars and Molecular Clouds

AstroCloud is an application that uses Roassal engine (built on top of Pharo Smalltalk) to visualize astronomy images. Read more or have a look at the pictures here.

Code is on SmalltalkHub.

Cincom Smalltalk - The New Class ‒ PowerState

The Windows operating system provides a comprehensive, system-wide set of power-management features. This enables systems to extend battery life and save energy, reduce heat and noise and help ensure data reliability.

Arden Thomas - 2048


Afternoon app – 2048

You may or may have not heard about a popular new game called 2048.

We have some very exciting new work in our development builds, a new source code editor, which I wanted to exercise.

Around the same time, I saw the article linked above about the game 2048.  It is a simple four by four grid with numbers that you slide around, trying to add them up to …… 2048!


I downloaded a free version of 2048 onto my iPhone, and there is also browser based version here. I checked it out and it looked pretty interesting, so I decided to implement a simple version of it as an “afternoon app”.  An Afternoon app is a simple time-boxed implementation.  Since Smalltalk lets you do a lot in a short amount of time, it is a great choice.

I built it and published it on the public repository.  You can find it as TwentyFortyEight.

It is an MVC designed app with these three classes.




I reverse engineered it by observation.  I tried a couple simple ways to process the moves, then refined them.    My #process:  method takes an array of the four items in a row or column, and looks like this:

process: array

“process an array of the grid”

self compress: array.

self combine: array.

self compress: array.


#compress:         slides the elements down to the far end of the array.

#combine:           adds adjacent cells of the same number, one pass

The second compress is required in situations where there are two combines in the one pass.

Since you can move four directions, up, down, left, right, it gives #process the row or column for right, down respectively.  For left and up it reverses the array for #process, and then flips it back for integration into the grid.

I made an assumption which turned out to process incorrectly in some circumstances.  Rather than fix it, I left it as a challenge to find (not too hard, not to trivial) and change for those who want to tinker with the application.   There is another needed fix (hint: the new value that appears) which is a very easy fix.

How can the implementation be improved?  Other than the needed fix mentioned above, the application does not detect when the game is over, or allow you to restart it.  These are easy fixes.

Another relatively easy improvement would be to add the running score.  When numbers are combined, their value can be added to the score tally.

A big improvement that would take some more time would be to add animation.  This would probably require model restructuring or changes to communicate in terms of the moves in order for the view to animate them. Visually, adding animation would be a big improvement.  I welcome any developers to give it a try and share their results.  I may give it a shot myself.

What else could you do?  You could try using different algorithms for auto-playing the game, in order to discover or test heuristics in order to maximize your score.

If you have any more ideas or any questions, I would be happy to hear from you.  Reach me at the email below.

Best Regards and happy Smalltalking!

Arden Thomas

Cincom Smalltalk Product Manager

Pharo News - Visualizing Delphi with Moose

Speaker: Stephan Eggermont

Moose provides tools allowing the analysis, visualization and refactoring of Delphi source code.

Talk from Fosdem2014.

Alan Knight - Dubious Panegyric

Another one of Rixford Knight's articles from the Atlantic, and a couple more photographs. My father was particularly fond of the line "I'd walk a mile to kick a goat."

Dubious Panegyric

by Rixford Knight

The poor man's cow is a sensitive index to business conditions because, when things get tough, lots of people begin to think about getting a goat. Correlation is inverse, of course – the more goats the less business.

Possession of a goat may also be an indication of intellectuality. Even in dire circumstances, only a person of considerable originality would think of getting a goat. Several of my friends have goats and they are all originals.

We may have something here, and I hope so because I have been rationalizing for years trying to justify my keeping a goat. I have hitched her to a wagon, taught her tricks, eaten her progeny, and milked her, and have not yet found a reasonable basis on which to excuse my ownership of her.

Whatever it indicates, the goat population in Vermont is on the increase. People are becoming "goat conscious." I have this from the lady who sold me my goat and who wished me to understand that the expression meant that the demand for goats was increasing, and that her price was not unreasonable.

Another goat dealer whom I interviewed assured me that once I had kept a goat I would never be without one. I can appreciate this too, because once you have a goat it is very hard to get rid of her. There are sound economic reasons why it is hard to dispose of a goat, but there are other reasons also. It is hard to love a cow, who has no personality and does nothing but give milk, but a goat abounds in personality and becomes as much a part of the family as a mischievous dog – even more than a dog, because a goat's capacity for mischief far exceeds a dog's.

Legend has it that goats will eat tin cans, but this is not so. They will eat only the labels from tin cans, and even here they are fastidious and will not eat the labels from corned-beef cans. This is not out of sympathy for a sister ruminant, because they will eat dried beef scraps if the scraps are salted.

A goat likes apples but will not accept one from which you have first taken a bite. She will accept it, however, after the offending portion has been cut off with a knife, provided the knife has not previously been used to slice plug tobacco. A goat does not dislike tobacco, because she will eat cigarettes. I know one poor woman who denied herself a package of cigarettes a day in order to feed them to her goat who loved them so. The goat is dead and the poor woman is still alive – but this does not mean that cigarettes were bad for the goat that her deprivation of them was good for the woman because the woman is in better health now than when the goat was alive.

It is said that goat meat tastes like mutton, but this has not been our experience. We could not eat it, and neither would the dog or the cat. On the other hand, some mutton that we bought tasted just like goat and yet we ate it, and so did the cat, but not the dog. Once we were given meat which was said to be goat and tasted like beef. We ate it gratefully and said nothing to anyone about the bullet we found in it. The dog and cat watched reproachfully while we ate.

People who keep goats are even more interesting than the goats themselves. Not all peculiar people keep goats, but all people who keep goats are peculiar. This does not apply to foreigners, but it applies without fail to Americans not brought up in the goat tradition.

People keep goats believing them to be a cheap source of milk. They need a cheap source of milk because they do not fit into the present established American way of life, and they do not fit this life because they are peculiar. Yet they are peculiar only if the American way of life is unpeculiar, which is not established.

This makes a hard problem. If a person does not like singing commercials, baker's bread, working for a boss, or keeping up with the Joneses, he can avoid these things by moving to the country and keeping a goat. But keeping a goat involves many hardships. The person must decide whether he is inherently antipathetic to singing commercials and the like, or merely allergic to them from environmental causes. If the latter, it might be easier for him to train himself to like them – say in ten years – than to keep a goat and avoid them.

Of course it is possible to graduate from a goat to a cow. I have a cow but still retain the goat. A friend was more fortunate. He didn't like his goat and got rid of her. "You can't keep one behind a fence," he said. "So you have to tether her out – by a chain, not a rope. A goat is not like a pole bean that always runs anti-clockwise. And whichever way she runs she will stick to it and will immediately wind herself up on the stake and then tie a knot in the chain. She will stand still until you have got a knot nearly untied, and then will give you a jump and catch your thumb in the loop. I'd walk a mile to kick a goat."

But a cow is not the solution either. A cow gives lots of milk but she also eats lots. There is a saying that you should keep a number of cows or none at all.

The man with one or two cows can't afford expensive haying equipment and will have to work by hand. A cow lives on pasture for six months and on hay for six months. While on pasture, she operates by wrapping her tongue around a tuft of grass, pulling it into her mouth, seizing it between her lower teeth and upper lip, and tearing it off. She does this steadily and stows away grass at about the rate at which a subsistence farmer, armed with a scythe, can cut hay, turn it, and lug it into the barn.

Now get the mathematics of this: six months pasture, six months barn. Farmer cuts hay at same rate as cow in pasture eats grass. Of course the cow must rest at times, and when she does, the farmer may feel free to drop his scythe and relax with his wife. But he must watch out. The minute the cow gets up from her nap he must go back to his scythe or he will be short of hay next winter and will have to buy some – which is fatal to the man who has been hoping to get away from Wall Street and the American way of life.

My goat has kidded three times, the first time giving us one buck which, by a shrewd business maneuver, I succeeded in giving away. The next time she gave us two bucks, and the third time three bucks which didn't live for the christening. I am now holding my breath and wondering how the goat population manages to maintain itself, to say nothing of increasing.

Two lusty kids will take about all the milk an average doe can provide, and before being weaned may need more. It seems illogical to keep a goat in order to save the expense of a cow and then have to get a cow to feed the goats, but that is what one goat raiser had to do. He had two goats he wanted to sell. One, for $20, gave three quarts of milk a day. The other, for $40, gave three and one-half quarts a day. I detected an interesting, possibly formidable, mathematical progression here and, after several moments' thought, asked if a four-quart goat would cost $60 or $80. My efforts to discover the combination of his pricing system finally irritated the man and I left him and bought my goat from a woman who had only one for sale and whose pricing system therefore, as I have explained, was based on the immutable law of supply and goat consciousness.

My doe gives a little under three quarts a day when she freshens, but keeps this up for only a couple of months. For a family with children it is necessary to have two goats and stagger their breeding dates six months.

Five or six goats will eat as much as one cow but will not give as much milk unless they cost more than the cow. Supply catching up with demand is unlikely to alter this cost ration in a way favorable to the goats, because it is not too hard to raise one calf, but raising five or six kids will drive a person crazy.

Looked at in another way, it is not economical for me to keep a cow that gives fifteen to twenty quarts a day when all I need is four or five. Of course there is butter, but by the time the utensils are washed we have spent two hours making it, and even at a dollar a pound we could do better with our time by buying butter. As for cottage cheese – don't mention the word to me.

Here is an odd fact about dairy products. A given quantity of milk sells for a higher price than the cream taken from that milk. Butter made from that cream will bring less than the cream. In other words, the more work you put into it the less you get out of it. This is actually true if you can get retail milk customer who will pay up.

The same is true of goats as opposed to cows. In respect of time and the work involved, keeping a goat is an expensive way of getting one's milk. Yet the people who get their milk in this way seem healthier and happier than those who court efficiency and get their milk in a bottle. As I've said – people who keep goats are peculiar.

April 14, 2014

Benoit St-Jean - Bee Smalltalk

Un nouveau venu est apparu dans notre bien-aimée communauté Smalltalk  : Bee Smalltalk.

Plus d’infos dans le vidéo ici.

Sur un autre sujet, la version 4.5 de Squeak Smalltalk est maintenant disponible! Changement majeur, une espèce de namespace appelé "environments".  Bien que je n’ai pas encore eu le temps d’évaluer cette nouvelle fonctionnalité, disons seulement que c’est un apport bénéfique et attendu!

Autre nouveauté, la version 0.12.4 de Amber Smalltalk vient à peine de sortir!


Classé dans:Amber, Bee Smalltalk, Smalltalk, Squeak Tagged: Amber, Bee Smalltalk, Smalltalk, Squeak

Bee SmaRT - Working on multithreading

A Smalltalk runtime that is able to execute concurrent sequences of code would be very useful. Seeing the rise of multicore/multiprocessor hardware and the direction the industry is taking, increasing the amount of cores in processors each year, makes us think we should be prepared.

The simplest way of making use of multiple cores is to just avoid threads and directly work on multiple processes, connected by some kind of IPC. This is easier because different processes share nothing, except what you explicitly choose. Isolation helps program safety concurrency wise, as two fully separated processes cannot have race conditions between them.

Then you would ask, why not just use processes? The main reason is that threads can be lighter and, in some way, more straightforward (as different threads share all their memory and resources). But also the question can be stated conversely, why not threads? Giving the option of using threads would be nice, so why not? There are lots of VMs wandering around out there supporting a myriad of different languages, but not many allow multithreaded execution of code (some notable exceptions are JVM, Jikes and CLR). Being able to execute many threads can simplify some VM implementation details and complicate others. Then, in order to understand how we can make a multithreaded Bee, we should better deep into what are the inherent problems of concurrent code execution.

There's an important point to add here. If we want to explore if it is possible to support concurrent independent smalltalk threads, we should do it a this stage, and wait no longer. As we'll see here, threading support can affect many system design decisions, so adding it later can require lots of changes and make it too costly. Also, adding threading support to a minimal system is a lot easier than to a big one. Then, to facilitate development, we should strive to make it work very well from the beginning, and then analyze how it behaves with each addition to the system.

Basic implementation and testing

The basic problems that arise when writing multithreaded code are related to resource access. Usually, race conditions start popping, because of unsynchronized concurrent accesses to memory locations. Good programmers write code in a way that avoids most pitfalls that lead to race conditions, but there always are some corners to cut. Avoiding global variables is the starting point of thread-safe code. Unfortunately, Smalltalk is plagued by global variables. Symbols, classes, the Smalltalk global, all the meta-model is global in Smalltalk, so we have a bad starting point. But if we think a bit, many of these things are immutable and most rarely change.

The way we test the multithreading scenario is also interesting. For now, we decided to go for a simple test. Instead of implementing and testing some known multithreaded algorithm, we just take all of our tests and run them in parallel. That is,

  | selectors results |
  selectors := self allTests.
  results := selectors
    concurrentCollect: [:selector | self runTest: selector].
  ^results conform: [:result | result]

FixedSizeCollection>>#concurrentCollect: aBlock
  | answer size token |
size := self size.
token := Object new.
answer := self species new: size.
answer atAllPut: token.
  to: size
  do: [:index | [| x y |
    x := self at: index.
    y := aBlock evaluateWith: x.
    answer at: index put: y] forkThread].
  [answer includes: token] whileTrue: [].

We still have to think the API we are going to implement but this is a nice starting point. Finally, we should emphasize that are planning the 'do nothing' kind of thread support here. This means, we are not going to add synchronization locks all around the system. It is the developer who knows how threads are going to be used, should him be the one that configures the necessary locking. But there are some points were the system should know that there can be many threads out there, lets explore them.


A good example is object allocation. In bee, objects are allocated by aGCSpace. When receiving new, the class calculates the amount of space needed and sends #allocate: bytes to the global gcSpace. If many threads are allocating objects in the same gcSpace, they should better synchronize the allocation. Look at this naive implementation of allocate:

GCSpace>>#allocate: size
| answer next |
answer := nextFree.
next := answer + size.
nextFree := next.

Now suppose two threads fall into #allocate: at roughly the same time. It could happen that both threads read the same value of nextFree (the one before updating), returning a same address for two different objects. This will quickly lead to random memory corruption and undeterministic program crashes. There are two main ways of solving this: replacing the global gcSpace with many smaller spaces, or adding some kind of mutex. The figure shows both options, notice that while shared spaces need a mutex, split spaces get only half of the total address space for each one.

There's also the option to mix both things, each gcSpace having independent pages, wouldn't need a mutex to allocate individual objects withing the page. When the page is exhausted it could take more pages from a global space, accessed with a mutex. That would be the most efficient trade-off, but would complicate implementation. For now, lets check the simplest mutex solution:

GCSpace>>#allocate: size
    | answer next |
    mutex _busyWait.
    answer := nextFree.
    next := answer + size.
    nextFree := next.
    mutex _release.

where mutex is a 4-bytes byteArray initialized to -1, (-1 meaning lock is free).

low-level mutex
The figure shows the implementation of both #_busyWait and #_release. The array starts with -1. xchg swaps the values of [eax] and ecx, atomically. Thus, the value of ecx after xchg can be 0 or -1. If it is -1, it means the lock was free, so we are clear to go. If its value is 0, it means somebody else has already reserved the lock and we have to wait. Notice that in that case the xchg didn't make any change in memory, as both [eax] and ecx were 0. In this implementation there is no "wait", we continue reading the mutex until it becomes -1. For #_release, we have to use lock prefix to give dec instruction atomic semantics (xchg is always atomic thus doesn't require lock prefix). Also notice that just after acquiring the lock we read the value of next free. We have to be careful that processors don't reorder reads and writes in this sensitive code sections, or the lock would become useless. We know this won't be a problem because instructions with lock semantics assure this reordering will not happen.

This was a simple solution so we didn't need to split the global gcSpace. But there are other not thread-safe places. When looking at native code for example, there are some inherently unsafe points.

Call site patching

If you recall the post about lookup, after doing lookup and before method activation comes call site patching. Supose we are sending #yourself, before patching we have this assembly at call site:

[68 4-bytes address] push #_yourself
[FF 15 4-bytes address] call [(Object>>#_lookupAndInvoke:) nativeCode]

which the call site patcher transforms to:

[90] nop
[90] nop
[90] nop
[90] nop
[90] nop
[90] nop
[E8 4-bytes displacement] call (Object>>#yourself) nativeCode

The transformation is done by overwriting the 11 bytes, with three consecutive 4-byte writes (one byte is written twice). A second thread could fall after the first patch of 4 bytes, finding this

[90] nop
[90] nop
[90] nop
[90] nop
[??] (#_yourself oop most significant byte)
[FF 15 4-bytes address] call [(Object>>#_lookupAndInvoke:) nativeCode]

the are not many satisfying solutions to this problem. One trick could be to insert a back jump at first write, like this

[EB FE] jmp -2
[90] nop
[90] nop
[??] (#_yourself most significant byte)
[FF 15 4-bytes address] call [(Object>>#_lookupAndInvoke:) nativeCode]

but what happens if thread B was already decoding the push instruction when the write was issued. Is decoding atomic? This approach isn't satisfactory so we didn't apply it. Another way (we haven't implemented it yet), is to avoid instruction replacement in favor of instruction mutation, always issuing the same kind of call. We could directly do this from the beginning:

[E8 4-bytes address] call (Object>>#_lookupAndInvokeYourself:) nativeCode

notice that this (Object>>#_lookupAndInvokeYourself:) would indirectly call lookup:

(Object>>#_lookupAndInvokeYourself:) nativeCode
    [68 4-bytes address] push #_yourself
    [FF 15 4-bytes address] jmp [(Object>>#_lookupAndInvoke:) nativeCode]

this is almost exactly the same code we had at call site, just that we have extracted it into a separate native code, and changed the call for a jump (avoiding a double call). The patch of the call site will not replace the instruction -it will still be a direct call- but will only mutate the address:

[E8 4-bytes address] call (Object>>#yourself:) nativeCode

the problem here is that movs are not guaranteed to be atomic, unless aligned to the size you are moving. Then we should assure that the call site 4-byte displacement is correctly aligned (inserting nops if needed), like this:

0x100000 [90] nop
0x100001 [90] nop
0x100002 [90] nop
0x100003 [E8 4-bytes displacement] call (Object>>#yourself) nativeCode

and we are still assuming this won't break instruction decoding. We shall experiment and see what happens, but luckily AMD Application Programming manual (3.11 Cross-Modifying Code), seems to prove us right. We could also have a per-thread code cache, but I think it's a bit expensive and unnecesary. For now, we just disabled call-site patching in multithreading targets.

Global Lookup

Other problem with lookup is the global dispatch cache. This is not a thread-safe object and it will never be. Lookup is executed frequently and can be slow, so we cannot leave other threads blocked waiting for a lock. This cache contains an internal lookup table that is altered on frequently so accessing it without a lock corrupts it quickly. The solution we implemented is to have one dispatch cache per thread (you could also have one per core to optimize resources, the dispatch cache measures roughly 4kb). To allow this, we reified the Thread object. A thread object contains a dispatchCache. When created, a thread executes a block, first doing initialization:

Thread class>>#newOn: aBlock
    ^self new execute: aBlock

Thread>>#execute: aBlock
    | method entrypoint |
    method := self _lookup: #value:.
    entrypoint := [
        self setCurrent.
        aBlock value].
    method newThreadOn: entrypoint

Thread>>#value: aBlock 
    aBlock value

  KernelLibrary tlsSetValue: tlsThreadIndex with: self.
  KernelLibrary tlsSetValue: tlsGlobalDispatchCacheIndex with: globalLookup

The code is a bit complex because we have to call CreateThread function (done by newThreadOn:). We are not executing the incoming block directly but putting it inside entryPoint, which first initializes the thread and then evaluates it. You can see in #setCurrent that we implemented Thread Local Storage support. This allows us to access thread-specific information (like the currentThread object), though fs segment register indirection.

Thread class>>#current
  ^tlsThreadIndex _getThreadValue

Windows mantains the fs segment register to always point to the TIB (thread information block) segment. This TIB is a buffer that contains a lot of thread information, and has some free slots for storing user-specific thread data. We are making use of one of these slots to store the current thread object, from which we can obtain the dispatch cache. In the figure we can see the assembly code required.


A little race condition we found was related to FFI. Consider this method:

CLibrary>>#strlen: aString
<api: strlen ulong ulongReturn>

it makes a call to the native strlen function in C standard library, passing aString address and converting the result to an integer. At compile time, the compiler sees the api: pragma and insterts an extra literal to the generated compiled method, a byte array. This byte array encodes the address of strlen and the types of the arguments and return value. The function address makes the first 4 bytes of the byte array, and is calculated lazily: it is initialized to FF FF FF FF, and set the first time the method is invoked.

atomic vs. non atomic memory writes
With multithreaded applications, many methods may be activating the method at the same time. All of these activations should read FF FF FF FF or the correct address of strlen, and there shouldn't be any problem. But there is. If the read and write of the address are not atomic, an invocation may find that the address is something like FF FF xx xx, where the last 2 bytes have been written and the first 2 not. As unlikely you think this can be, it happened to us!

The simplest way to avoid this problem is two fold: first, take advantage of x86 architecture, which guarantees atomic 32-bit read/writes if the memory address read/written is 32-bit aligned; second, to issue the read/write as one 32-bits operation (and not to split it in 2 16-bits or 4 8-bits). The figure above shows two examples of not-atomic operations, and an atomic one. The first store is unaligned so the processor doesn't guarantee atomicity. The second is split in two stores. Each of them is 16-bit aligned and, as a consequence, 16-bit atomic, but not 32-bit. A second thread could read the memory just after first 16-bit write and before the second 16-bit write. The third is an aligned write. x86 arquitecture assures this will be issued atomically.

Going back to our problem, we know the function address is 32-bit aligned because it lays at the beginning of the byte array which, as any other object, is 32-bit aligned. But our uLongAtOffset:/uLongAtOffset:put: implementation was splitting the ulong in two, to avoid problems with big integers. The solution was to change our implementation of uLongAtOffset family assuring that reads and writes are done in one operation. As usual with race conditions, solving the problem took just minutes, finding it out, many hours.


Here we gave many implementation details of some of the problems we faced when entering this crazy multithreading world. We are still learning, so there might still be undetected problems, but we believe there aren't going to appear major road blocks. There are of course some unanswered questions. We are still thinking how we are going to face thread barriers, like how are we going to stop all threads for GC? We could just maintain a counter that is increased on activations and back-jumps, and peek for events from time to time. This might reduce performance, maybe we can avoid counting by doing some tricks with SuspendThread. In this latter case we should be careful as SuspendThread is not thought to be used within the same process but from the outside (mainly for debuggers).

To conclude this post, we attach a simple multithreaded test program. This just runs all tests in parallel and shows the results in console. To run,

$> wine bee.sll runAllConcurrent

to run on linux/mac; on windows, rename to bee.exe. There isn't much too see but the non-deterministic order of finalization of the tests. If you were hoping to see a command line interface that parses smalltalk, compiles, nativizes and finally executes it, stay tuned, we already have a working prototype and need to find time to write(no joke!). See you in next post!

Here is bee.sll

Cincom Smalltalk - Some Thoughts on the Discontinuation of Windows XP Support

As of April 8 2014, Microsoft® discontinued support for Windows® XP. Also, after April 8, Microsoft will no longer provide security updates or any of the various support-related options and updates for XP.