Planet Smalltalk

October 22, 2018

Cincom Smalltalk - Smalltalks 2018 Conference Schedule Announced

Smalltalks 2018, the 12th conference on Smalltalk-based technologies, research and industry applications, will be held from October 31 to November 2 at Universidad Nacional de Salta, in Salta, Argentina. And, we’re happy to say that the conference schedule has been announced.

The post Smalltalks 2018 Conference Schedule Announced appeared first on Cincom Smalltalk.

Hernán Morales Durand - Pharo Script of the Day: Unzip, the Smalltalk way

Hi everybody. Today a simple but useful script to uncompress a ZIP file in the current image directory. Notice the #ensure: send, Smalltalk provides an very elegant way to evaluate a termination block:

| zipArchive fileRef |
zipArchive := ZipArchive new.
fileRef := 'myFile.zip' asFileReference.
[ zipArchive
readFrom: fileRef fullName;
extractAllTo: FileSystem workingDirectory ]
ensure: [ zipArchive close ].

October 21, 2018

Pharo Weekly - [Ann] Smacc Book V1.0

The book around Smacc: the compiler-compiler framework is now available in pdf and html.

http://books.pharo.org/booklet-Smacc/

S. Ducasse

 

October 20, 2018

Pierce Ng - Glorp Mapping Existing Schema - Part 2

This is the second post in a short series on the topic. The last post looked at the tables GROUPS and TEAMS in the OpenFootball relational database schema. There is also the table GROUPS_TEAMS, usually known as a link table, which, ahem, "relates" the GROUPS and TEAMS table. GROUPS_TEAMS has the following schema:

CREATE TABLE IF NOT EXISTS "groups_teams" (
  "id" integer PRIMARY KEY AUTOINCREMENT NOT NULL, 
  "group_id" integer NOT NULL, 
  "team_id" integer NOT NULL, 
  "created_at" datetime NOT NULL,
  "updated_at" datetime NOT NULL
);

A row in GROUPS_TEAMS with group_id of XXX and team_id of YYY means that the team represented by team_id YYY is in the group with group_id XXX.

Let's modify the Smalltalk class OFGroup to handle the linkage, by adding the inst-var 'teams' and creating accessors for it.

OFObject subclass: #OFGroup
  instanceVariableNames: 'eventId title pos teams'
  classVariableNames: ''
  package: 'OpenFootball'

Next, modify the mapping for OFGroup in OFDescriptorSystem:

classModelForOFGroup: aClassModel
  self virtualClassModelForOFObject: aClassModel.
  aClassModel newAttributeNamed: #eventId type: Integer.
  aClassModel newAttributeNamed: #title type: String.
  aClassModel newAttributeNamed: #pos type: Integer.
  "Next item is for linking OFGroup with OFTeam."
  aClassModel newAttributeNamed: #teams collectionOf: OFTeam.

descriptorForOFGroup: aDescriptor
  | t | 
  t := self tableNamed: 'GROUPS'.
  aDescriptor table: t.
  self virtualDescriptorForOFObject: aDescriptor with: t.
  (aDescriptor newMapping: DirectMapping)
    from: #eventId
    type: Integer
    to: (t fieldNamed: 'event_id').
  (aDescriptor newMapping: DirectMapping)
    from: #title
    type: String
    to: (t fieldNamed: 'title').
  (aDescriptor newMapping: DirectMapping)
    from: #pos
    type: Integer
    to: (t fieldNamed: 'pos'.
  "Next item is for linking OFGroup with OFTeam."
  (aDescriptor newMapping: ManyToManyMapping)
    attributeName: #teams.

"No change to #tableForGROUPS:."

It is now necessary to add the table GROUPS_TEAMS to OFDescriptorSystem:

tableForGROUPS_TEAMS: aTable
  | gid tid |
  self virtualTableForOFObject: aTable.
  gid := aTable createFieldNamed: 'group_id' type: platform integer.
  aTable addForeignKeyFrom: gid to: ((self tableNamed: 'GROUPS') fieldNamed: 'id').
  tid := aTable createFieldNamed: 'team_id' type: platform integer.
  aTable addForeignKeyFrom: tid to: ((self tableNamed: 'TEAMS') fieldNamed: 'id').

Now let's fetch the OFGroup instances with their linked OFTeam instances.

| vh |
Transcript clear.
OFDatabase dbFileName: 'wc2018.db'
  evaluate: [ :db |
    db session accessor logging: true. "This shows the generated SQL."
    vh := String streamContents: [ :str | 
      (db session read: OFGroup) do: [ :ea | 
        str nextPutAll: ea title; nextPut: Character cr.
        ea teams do: [ :team | 
          str nextPutAll: '- ', team title; nextPut: Character cr ]]]].
vh

The above snippet produces the following output:

Group A
- Egypt
- Russia
- Saudi Arabia
- Uruguay
<some output omitted>
Group H
- Senegal
- Japan
- Poland
- Colombia

In the snippet, logging is enabled, and the SQL generated by Glorp is displayed in the Transcript (with whitespace inserted for readability). What we see is the infamous "N+1 selects problem" in action - the first SELECT fetches the GROUPS rows, then, for each group_id, there is a corresponding SELECT to fetch the TEAMS rows.

SELECT t1.id, t1.created_at, t1.updated_at, t1.event_id, t1.title, t1.pos
 FROM GROUPS t1  an OrderedCollection()

SELECT t1.id, t1.created_at, t1.updated_at, t1.key, t1.title
 FROM TEAMS t1, GROUPS_TEAMS t2
 WHERE ((t2.team_id = t1.id) AND (t2.group_id = ?))  an OrderedCollection(1)

SELECT t1.id, t1.created_at, t1.updated_at, t1.key, t1.title
 FROM TEAMS t1, GROUPS_TEAMS t2
 WHERE ((t2.team_id = t1.id) AND (t2.group_id = ?))  an OrderedCollection(2)

SELECT t1.id, t1.created_at, t1.updated_at, t1.key, t1.title
 FROM TEAMS t1, GROUPS_TEAMS t2
 WHERE ((t2.team_id = t1.id) AND (t2.group_id = ?))  an OrderedCollection(3)

SELECT t1.id, t1.created_at, t1.updated_at, t1.key, t1.title
 FROM TEAMS t1, GROUPS_TEAMS t2
 WHERE ((t2.team_id = t1.id) AND (t2.group_id = ?))  an OrderedCollection(4)

SELECT t1.id, t1.created_at, t1.updated_at, t1.key, t1.title
 FROM TEAMS t1, GROUPS_TEAMS t2
 WHERE ((t2.team_id = t1.id) AND (t2.group_id = ?))  an OrderedCollection(5)

SELECT t1.id, t1.created_at, t1.updated_at, t1.key, t1.title
 FROM TEAMS t1, GROUPS_TEAMS t2
 WHERE ((t2.team_id = t1.id) AND (t2.group_id = ?))  an OrderedCollection(6)

SELECT t1.id, t1.created_at, t1.updated_at, t1.key, t1.title
 FROM TEAMS t1, GROUPS_TEAMS t2
 WHERE ((t2.team_id = t1.id) AND (t2.group_id = ?))  an OrderedCollection(7)

SELECT t1.id, t1.created_at, t1.updated_at, t1.key, t1.title
 FROM TEAMS t1, GROUPS_TEAMS t2
 WHERE ((t2.team_id = t1.id) AND (t2.group_id = ?))  an OrderedCollection(8)

Fortunately Glorp is cleverer than this, and provides a way to avoid the N+1 problem, by using the message #alsoFetch:.

| vh |
Transcript clear.
OFDatabase dbFileName: 'wc2018.db'
  evaluate: [ :db |
    | query |
    db session accessor logging: true.
    query := Query read: OFGroup.
    query alsoFetch: [ :ea | ea teams ]. " <== See me. "
    vh := String streamContents: [ :str | 
      (db session execute: query) do: [ :ea | 
        str nextPutAll: ea title; nextPut: Character cr.
        ea teams do: [ :team | 
          str nextPutAll: '- ', team title; nextPut: Character cr ]]]].
vh

Same output as before, but this time the SQL (pretty-printed by hand for readability) is much shorter and properly takes advantage of the SQL language.

SELECT t1.id, t1.created_at, t1.updated_at, t1.event_id, t1.title, t1.pos, 
       t2.id, t2.created_at, t2.updated_at, t2.key, t2.title
FROM GROUPS t1 
INNER JOIN GROUPS_TEAMS t3 ON (t1.id = t3.group_id) 
INNER JOIN TEAMS t2 ON (t3.team_id = t2.id) 
ORDER BY t1.id  an OrderedCollection()

Hernán Morales Durand - Pharo Script of the Day: Massive uncontrolled send and log of unary messages

Want to play and break your VM today? Try this useless saturday script just for fun:

| outStream |
outStream := FileStream newFileNamed: 'unary_sends.txt'.
Smalltalk allClasses
reject: [ : cls | (cls basicCategory = #'Kernel-Processes') or: [ cls = HashedCollection ] ]
thenDo: [ : cls |
cls class methodDictionary
select: [: sel | sel selector isUnary ]
thenCollect: [ : cm |
| result |
result := [ cls perform: cm selector ]
on: Error
do: [ :ex | (ex messageText includes: 'overridden') ifTrue: [ ex pass ] ].
[ result asString ]
on: Error
do: [ : ex2 | result := ex2 messageText ].
outStream nextPutAll: cls asString;
nextPutAll: '>>';
nextPutAll: cm selector asString;
tab;
nextPutAll: result asString; cr. ] ] .
outStream close.

October 19, 2018

Hernán Morales Durand - Pharo Script of the Day: A quiz game script to test your Collection wisdom

I want to play a game :) The following script implements an "Is this Sequenceable?" kind of quiz. You are presented with a series of inspectors with method sources in the image, without its class name. And by looking only the source code you have to guess if the method belongs to a SequenceableCollection hierarchy or not. If you miss, you can see the class and its class hierarchy. At the end of the game, you are presenter your score:

| hits n |
hits := 0.
n := 3.
n timesRepeat: [
| mth cls i |
cls := (Collection withAllSubclasses select: #hasMethods) atRandom.
mth := cls methodDict atRandom.
i := GTInspector openOn: mth sourceCode.
((self confirm: 'Method belongs to a Sequenceable Collection?') = (cls isKindOf: SequenceableCollection class))
ifTrue: [ UITheme builder message: 'Good!'. hits := hits + 1 ]
ifFalse: [ UITheme builder message: 'Method class is ' , cls asString , '. Class hierarchy: ' , (cls allSuperclassesExcluding: Object) asArray asString ].
i close ].
UITheme builder message: 'Your score: ' , hits asString , ' / ' , n asString.

What could be done to enhance the script? At first it would be really nice to add an option "Cannot determine with the displayed source"... (TBD) actually there are a lot of possibilities, like asking if it has any Critics, or if could be optimized, etc. Enjoy!

October 18, 2018

Pierce Ng - Glorp Mapping Existing Schema - Part 1

Using OpenFootball-Glorp for illustration, this post is the first in a series on mapping an existing normalized database schema and other fun Glorp stuff. As usual, I'm using SQLite for the database.

Consider the tables GROUPS and TEAMS.

CREATE TABLE IF NOT EXISTS "groups" (
  "id" integer PRIMARY KEY AUTOINCREMENT NOT NULL, 
  "event_id" integer NOT NULL, 
  "title" varchar NOT NULL, 
  "pos" integer NOT NULL, 
  "created_at" datetime NOT NULL, 
  "updated_at" datetime NOT NULL
);

CREATE TABLE IF NOT EXISTS "teams" (
  "id" integer PRIMARY KEY AUTOINCREMENT NOT NULL, 
  "key" varchar NOT NULL, 
  "title" varchar NOT NULL, 
-- many other columns omitted for now --
  "created_at" datetime NOT NULL, 
  "updated_at" datetime NOT NULL
);

As it happens, every table in OpenFootball has columns "id", "created_at" and "updated_at", where "id" is that table's primary key. Let's take advantage of Smalltalk's inheritance and class hierarchy to map these columns and tables:

Object subclass: #OFObject
  instanceVariableNames: 'pid createdAt updatedAt'
  classVariableNames: ''
  package: 'OpenFootball'

"Maps to GROUPS."
OFObject subclass: #OFGroup
  instanceVariableNames: 'eventId title pos'
  classVariableNames: ''
  package: 'OpenFootball'

"Maps to TEAMS."
OFObject subclass: #OFTeam
  instanceVariableNames: 'key title'
  classVariableNames: ''
  package: 'OpenFootball'

By convention, the Glorp mapping is encapsulated in the class OFDescriptor, which has these supporting methods:

virtualClassModelForOFObject: aClassModel
  aClassModel newAttributeNamed: #pid type: Integer.
  aClassModel newAttributeNamed: #createdAt type: DateAndTime.
  aClassModel newAttributeNamed: #updatedAt type: DateAndTime.

virtualDescriptorForOFObject: aDescriptor with: aTable
  (aDescriptor newMapping: DirectMapping)
    from: #pid
    to: (aTable fieldNamed: 'id'). "This is the primary key mapping."
  (aDescriptor newMapping: DirectMapping)
    from: #createdAt
    type: DateAndTime
    to: (aTable fieldNamed: 'created_at').
  (aDescriptor newMapping: DirectMapping)
    from: #updatedAt
    type: DateAndTime
    to: (aTable fieldNamed: 'updated_at').

virtualTableForOFObject: aTable
  (aTable createFieldNamed: 'id' type: platform serial) bePrimaryKey.
  aTable createFieldNamed: 'created_at' type: platform datetime.
  aTable createFieldNamed: 'updated_at' type: platform datetime.

The mapping for OFGroup is as follows:

classModelForOFGroup: aClassModel
  self virtualClassModelForOFObject: aClassModel.
  aClassModel newAttributeNamed: #eventId type: Integer.
  aClassModel newAttributeNamed: #title type: String.
  aClassModel newAttributeNamed: #pos type: Integer.

descriptorForOFGroup: aDescriptor
  | t | 
  t := self tableNamed: 'GROUPS'.
  aDescriptor table: t.
  self virtualDescriptorForOFObject: aDescriptor with: t.
  (aDescriptor newMapping: DirectMapping)
    from: #eventId
    type: Integer
    to: (t fieldNamed: 'event_id').
  (aDescriptor newMapping: DirectMapping)
    from: #title
    type: String
    to: (t fieldNamed: 'title').
  (aDescriptor newMapping: DirectMapping)
    from: #pos
    type: Integer
    to: (t fieldNamed: 'pos'.

tableForGROUPS: aTable
  self virtualTableForOFObject: aTable.
  aTable createFieldNamed: 'event_id' type: platform integer.
  aTable createFieldNamed: 'title' type: platform varchar.
  aTable createFieldNamed: 'pos' type: platform integer.

The mapping for OFTeam is similar and I've not shown it here for brevity.

To round out the scene setting, OFDatabase, the "database interface" class, has class-side convenience methods to run snippets like so:

OFDatabase 
  dbFileName: 'wc2018.db'
  evaluate: [ :db |
    db session read: OFGroup ]

To be continued...

Pharo Weekly - [ann] MemCached Pharo client

Hi,

I copied the (Pharo) Memcached client to https://github.com/svenvc/memcached where is lives in Tonel format with a Baseline and a working Travis CI build against an actual memcached server in the worker.

More about memcached

- https://en.wikipedia.org/wiki/Memcached
- http://memcached.org

Acknowledgements

The original project can be found at http://www.squeaksource.com/memcached.html

As far as I can see it was written by Philippe Marschall and Ramon Leon. I ported the codebase to Pharo. This repository is a recent copy with some cleanups.

I intend to maintain this as I need it myself. The codebase should still maintain its original portability (minus the meta info, possibly).

Sven

October 17, 2018

Pharo Weekly - Internship around 3D/Pharo

October 16, 2018

Hernán Morales Durand - Pharo Script of the Day: Find your IP address

I' back :)

Today let's update the PSotD blog with a script to find your IP address using Zinc HTTP Components. Credits also to Sven Van Caekenberghe which helped me to figure out why Zn was getting a 403

ZnClient new
systemPolicy;
beOneShot;
url: 'http://ifconfig.me/ip';
accept: ZnMimeType textPlain;
headerAt: 'User-Agent' put: 'curl/7.54.0';
timeout: 6000;
get.

Pharo Weekly - [Ann] Nano Pi Neo + MCP9808 :)

Nano Pi Neo running PharoThings and controlling the GPIOs to take the temperature out of the MCP9808 sensor and turn on/off a LED

 

Great job allex.

Pharo Weekly - PharoThings HC-SR04 ultrasonic sensor

October 15, 2018

Torsten Bergmann - Squeak 5.2. released

October 14, 2018

Adriaan van Os - Photos of the 2018 ESUG Conference in Cagliari

My set of photos of the 2018 ESUG Conference in Cagliari is now complete and available here.

26th ESUG Conference, Cagliari, 2018

Pharo Weekly - [Ann] New roassal video

October 13, 2018

October 12, 2018

Hernán Morales Durand - Pharo Script of the Day: Generate random strings

Today's script is just a one-liner to generate 10 random Strings:

(Generator on: [ : g | 10 timesRepeat: [ g yield: UUID new asString36 ] ]) upToEnd.

Pharo Weekly - [Call for testers] Pharo Launcher 1.4.5

Hi all,

We have been working these last ~two weeks with Christophe on the stability of the launcher. We have prepared version 1.4.5, and we would like to have some feedback.
So, we would really LOVE, if somebody can play with this version and send us feedback. Specially, if your username in your machine has characters that encoded take more than 1 byte, we really would like your feedback. We have tested with japanese characters, and others like î,ü, etc, but the more the better.
The main focus was on:

 – correct management of encodings (in all platforms)
   – of environment variables
   – of files and paths
   – of commands called through OSProcess
 – better error management in case of edge cases (like when we cannot determine the version of an image)
Just FYI: the major limitation of the launcher right now (and it was like that since ever) is that we cannot call external processes with non-ascii characters in windows. This happens because ProcessWrapper uses the ascii version of the windows API to create a process.
With what we have learnt this week, we would like to push some of these fixes to Pharo7 too soon:
 – Correct encoding/decoding of environment variables in linux/osx
 – Ability to access the encoded version of environment variables in linux/osx to give users control over the encoding they want (or even access binary data)
 – Correct encoding/decoding of environment variables in windows by using the correct windows API (current primitiveGetenv in windows uses Ascii version too…)
In the long term, we also need a solution to enhance or replace ProcessWrapper using the W (wide) version of the windows API. But that is far more work…

October 11, 2018

Hernán Morales Durand - Pharo Script of the Day: Colorizing nucleotides

Some days ago I experimented a bit to colorize a random DNA sequence given an alphabet and the desired sequence size, with a little help of BioSmalltalk. This is what I've got:

| text attributes |
text := ((BioSequence forAlphabet: BioDNAAlphabet) randomLength: 6000) sequence asText.
attributes := Array new: text size.
1 to: text size do: [ : index |
attributes at: index put: {
(TextColor color: (BioDNAAlphabet colorMap at: (text at: index))) } ].
text runs: (RunArray newFrom: attributes).
text.

I built a color map for every nucleotide, based on the alphabet size. This is because in biological sequences (proteins, DNA, RNA) you have a different set of letters.

I should say I don't like the final result. Specially the lack of column alignment:


This seems to persist even trying other attributes

| text attributes |
text := ((BioSequence forAlphabet: BioDNAAlphabet) randomLength: 6000) sequence asText.
attributes := Array new: text size.
1 to: text size do: [ : index |
attributes at: index put: {
(TextColor color: (BioDNAAlphabet colorMap at: (text at: index))) .
(TextKern kern: 4) } ].
text runs: (RunArray newFrom: attributes).
text.

Maybe efforts in Bloc would make it easier for aligning text.





October 10, 2018

Pharo Weekly - MDL in Pharo at Google Dev Fest Brussels

Philippe Back from HighOctane is presenting MDL Seaside at Google DevFest Brussels. MDL developed by Cyril Ferlicot and available at

https://github.com/DuneSt/MaterialDesignLite

Well done Phil!

https://www.youtube.com/watch?v=JhmmoEtAq20

Stef

 

 

Hernán Morales Durand - Pharo Script of the Day: One minute frequency image saver

You can save the image every 60 seconds (or any other frequency) to avoid loss changes to the image with the following script:

[ [ true ] whileTrue: [
(Delay forSeconds: 60) wait.
Smalltalk snapshot: true andQuit: false
] ] forkAt: Processor userInterruptPriority named: 'Image Saver '.

You can use the Process Browser under the World menu to terminate or pause the process.

Cincom Smalltalk - Are you fiscally responsible? If not…

Ways to Be Fiscally Responsible when Spending “Use It or Lose It” Budget for 2018 Julie Windsor of Talentia Software UK recently discussed the budget reform that’s happening around the globe. Over […]

The post Are you fiscally responsible? If not… appeared first on Cincom Smalltalk.

October 09, 2018

Hernán Morales Durand - Pharo Script of the Day: Create a directory tree at once

Suppose you want to create a directory tree at once. Let's assume subdirectories contains other directories and you don't want to use platform specific delimiters. We can do it in Pharo using the almighty #inject:into: and the FileSystem API.

| rootPath |
rootPath := Path / FileSystem disk store currentDisk / 'App1'.
#(
#('Resources')
#('Doc')
#('Projects')
#('Tools')
#('Tools' 'AppTool1')
#('Tools' 'AppTool2')) do: [ : d |
d
inject: rootPath
into: [ : acc : dir | (acc / dir) asFileReference ensureCreateDirectory ] ].

Hope you liked it

October 08, 2018

Hernán Morales Durand - Pharo Script of the Day: Execute command in a MSYS2 MinGW64 context

For this to work first ensure you have the MSYS2 bin directory added to the PATH environment variable. Just run the following from command line and add "c:\msys64\usr\bin\" to the end of the PATH variable:


systempropertiesadvanced

We will use ProcessWrapper, although with limited features, it works perfectly for simple tasks. And now you can run all those complex bash shell commands from Pharo :) For example to get the CPU frequencies in GHz:

| process output answer cmd |

process := ProcessWrapper new.
cmd := '"{ echo scale=2; awk ''/cpu MHz/ {print $4 "" / 1000""}'' /proc/cpuinfo; } | bc"'.
output := process
useStdout;
useStderr;
startWithShellCommand: 'set CHERE_INVOKING=1 & set MSYSTEM=MINGW64 & set MSYS2_PATH_TYPE=inherit & "c:\msys64\usr\bin\bash.exe" -c ' , cmd;
upToEnd.
^ (answer := process errorUpToEnd) isEmpty not
ifTrue: [ answer ]
ifFalse: [ output ].

Pharo Weekly - [Ann] Release version v1.3.0 of MDL

https://github.com/DuneSt/MaterialDesignLite/releases/tag/v1.3.0

Add compatibility for Gemstone smalltalk (b83d742) and (622dbdb)
MDLCell should implement an offset feature (0ae17ef)
MDLCell should allow to rorder the cells depending on the layout (desktop/tablet/phone) (a8e77dd)
Gemstone

Add OrderedDictionary to Gemstone compatibility package (b83d742)
GemStone expects Blocks for ifNotNil: and friends. What does this code do? (b83d742)
Bug Fixes

Closing button of MDLDialogWidget should not be of submit type but of button type (9d54da1)
MDLMenuButtonWidget should use the ID system of MDLWidget (8ad61b9)
MDLCalendar should use the id system of MDLWidget instead of recreating one (01e1f61)
Month and year selection does not work on MDLCalendarWidget (dc915cd)
First snackbar demo is broken (9497c65)
Cleaning

Deprecate #mdlMultilineTextField since we already have #mdlTextArea which is the common name in HTML5 (ef1e0a6)
Deprecate MDLCheckboxWidget since it does not brings anything more than the brushes (0630493)
Typo in MDLProgressBarWidget, #hyde should be #hide (a362b33)
Remove dependency to Morphic (#detectIndex:) (ab02a1f)
MDLCardTitleText should not be able to respond to #borde or #expand (7f2e2cf)
Remove dependency to JQueryUI (9ed3a6f)
Remove duplication between MDLButton and MDLAnchorButton (99b3266)
MDLCardTag has unused variables (431d7d1)
MDLCardMenu should not be able to respond to #borde or #expand (b59094d)
Remove dependency to Seaside-Development (89fa553)
Deprecate useless MDLFooterLogo since we already have MDLLogo (fa7d7985)
Remove duplication between MDLIconToggleLabel and MDLIcon>>#toggle (fa7d798)
Infrastructure

Improve code coverage. This release increased the code coverage from 3% to 61%
Add tests. The number of tests increased from 8 to 485
Add Coverall to CI (5a37a85)
Add Demo about not raised colored buttons (7a55891)
Demo

Add demo on Elevation (f9a387c)
**UX: ** Icons in list should be clickable (43e3187)
**UX: ** Improve global UX of the demo (43e3187)
Add demo about MDLBadge>>noBackaground (f097d8f)
Add demo to explicit MDLBadge>>overlap option (f097d8f)
Add demo about MDLCell>>#hideDesktop/#hideTablet/#hidePhone (aabc92b)
Add demo about MDLCell>>#stretch/#bottom/#top/#middle (250a4b2)

October 07, 2018

Hernán Morales Durand - Pharo Script of the Day: k-shingles implementation

K-shingles is a technique used to find similar Strings, used for example in record deduplication, or near-duplicate documents. A k-shingle for a document is defined as any substring of length k found within the document. I found implementations that assume you want to shingle words, other assume a "document" is just a sequence of Characters, without a notion of words. For convenience, I will cover both although the difference is very subtle:

  • k is always a positive integer.
  • Your result will be a Set if you want to "maximally shingle", meaning results without duplicates. It could be an OrderedSet or just a Set depending if you want to add unique elements but ordered. Otherwise it will be an arrayed collection.
  • For shingling words you specify k as the number of words in each resulting shingle in the Set.
  • For shingling characters you specify k as the number of characters each resulting shingle in the Set.
  • "k should be picked large enough that the probability of any given shingle appearing in any given document is low". From Jeffrey Ullman's book.
  • The Jaccard similarity coefficient (a.k.a Tanimoto Coefficient, a token based edit distance) uses k-shingles.
So for word shingling:

| k s |
k := 2.
s := 'a rose is a rose is a rose' findTokens: ' '.
(1 to: s size - k + 1) collect: [ : i | (s copyFrom: i to: i + k - 1) asArray ]

For different values of k we will have:

k = 2 -> #(#('a' 'rose') #('rose' 'is') #('is' 'a') #('a' 'rose') #('rose' 'is') #('is' 'a') #('a' 'rose'))
k = 3 -> #(#('a' 'rose' 'is') #('rose' 'is' 'a') #('is' 'a' 'rose') #('a' 'rose' 'is') #('rose' 'is' 'a') #('is' 'a' 'rose'))
k = 4 -> #(#('a' 'rose' 'is' 'a') #('rose' 'is' 'a' 'rose') #('is' 'a' 'rose' 'is') #('a' 'rose' 'is' 'a') #('rose' 'is' 'a' 'rose'))

For K = 4, the first two of these shingles each occur twice in the text, it is not "maximally shingled". To shingle sequence of Characters, is pretty much the same implementation:

| k s |
k := 2.
s := 'abcdabd'.
(1 to: s size - k + 1)
collect: [ : i | s copyFrom: i to: i + k - 1 ]
as: OrderedSet.

And in this case we have:

k = 2 -> "an OrderedSet('ab' 'bc' 'cd' 'da' 'bd')"
k = 3 -> "an OrderedSet('abc' 'bcd' 'cda' 'dab' 'abd')"
k = 4 -> "an OrderedSet('abcd' 'bcda' 'cdab' 'dabd')"
You can find this implemented in the StringExtensions package. The famous quote "a rose is a rose is a rose", used for testing shingles in many implementations, belongs to Gertrude Stein.

David A. Smith - arcos


The arcos platform embedded in my blog.