Ola niƱos. Hope you recover from the shock of seeing a new entry on this blog. Haven't been here in a while. Looks like this is all still up and running. I'm surprised I still remember my mt password.
So why this entry ? Well, I've done another thing I haven't done in a while: I wrote yet another note in LaTeX solving some simple problem in a pompous way. I used Scala to implement the solution. Scala is coming along nicely, it feels natural to me and it has some powerful abstractions and features. There's definitely a buzz about it. A book on it is now available, the Eclipse plugin behaves well and there are open source projects that use it.
Before I get completely rusty with LaTeX, here's another one in the series of absurdly trivial papers: this one derives a program that establishes if there are 4 points from a set of points on a circle that form a rectangle.
Got these from the scala mailing list: interesting functional programming algorithms (scala enthusiasts are porting them over from haskell)
Finger Trees: A Simple General-purpose Data Structure
This blog has just hired a new editor-in-chief Nadia.
She started Oct. 15th (7.2 lb and 20.5 inches) and has already had a positive impact on the site: the blog has been upgraded to mt 3.2, the photos slideshow page has a new thumbnail viewer on the right side when viewing albums, the papers section has a new entry and the blog links sidebar has been cleaned up and has a couple of new additions. But even Nadia admits in a recent interview with "Blog Editors-In-Chief Magazine" that unless she hires new staff writers not much content will come to this blog in the next couple of months.
So in the meantime here's another edition of "piles of cool stuff". A lot of time has passed since the last edition so many cool things accumulated on the observer desk but unfortunately a lot of things were also lost again because they were not recorded right away. What remains is:
With this theorem taken from "Elementary Number Theory in Nine Chapters" by James J. Tattersall (p. 204, Theorem 6.20) we can choose m and p such that f(x) = x^m mod(p) is bijective: choose m and p so gcd(m, p - 1) = 1.
To convince yourself that f is not always bijective try m = 5 and p = 11 for an example with lots of collisions.
An interesting side problem to the bridge crossing puzzle is to find and generate all the possible ways of crossing the bridge under the constraints of the puzzle.
Here's how it can be solved.
An interesting side problem to the bridge crossing puzzle is to find and generate all the possible ways of crossing the bridge under the constraints of the puzzle.
Here's how it can be solved.
Came across this puzzle:
Four people begin on the same side of a bridge. You must send them across to the other side in the fastest time possible. It is night. There is one flashlight. A maximum of two people can cross at a time. Any party who crosses, either one or two people, must have the flashlight to see. The flashlight must be walked back and forth, it cannot be thrown, etc. Each person walks at a different speed. A pair must walk together at the rate of the slower person’s pace, based on this information: Person 1 takes t1 = 1 minutes to cross, and the other persons take t2 = 2 minutes, t3 = 5 minutes, and t4 = 10 minutes to cross, respectively.
The obvious solution is to let the fastest guy cross with every other person and bring the flashlight back. But this wouldn't be a very good puzzle if you couldn't do better than that.
Here's a page that shows a general solution with n people and k subsets allowed to cross at the same time. It uses dynamic programming and has exponential running time.
This guy has collected a large set of emails and usenet articles about the puzzle.
But the coolest solution (I love this stuff :-) ) comes from here. Günter Rote presents in an elegant way how to map this problem to a subgraph in a graph problem (for the case of at most two people crossing). It has a greedy algorithm solution in polynomial time.
I dug up this excellent paper The binding roots of symbolic AI: a brief review of the Cyc project by Deniz Yuret. In the course of reviewing the Cyc project he manages to give a short history of symbolic A.I., discuss the origins of logic, frame the A.I. discussion and explain the types of arguments one can make, talk about knowledge representations and models of thinking and much more.
A system that relies solely on deductive reasoning has systemic limitations that prevent it from manifesting, well... intelligence.
The number of facts you can deduce from what you know, is not restricted to the deductive closure of everything in your symbolic memory. You add to this the analogue information you have in the memories of your other representational systems. You further add the ability of one system being able to set up experiments to be run in another.
If deductive reasoning is limiting what else can a system use ?
Here's another quote from the paper:
I criticized Cyc for having only an explicit representation, and relying on deductive inference as the main computational machinery. How else is it possible to represent things and do inference?
To illustrate the point, I will give an example from Herbert Simon, commenting on visual imagery to Gazzaniga [Gazzaniga, 1985]: ``Imagine a rectangle. Draw a line from the top right-hand corner to the bottom left corner. Now draw a line from the middle of the diagonal to the bottom right corner. Now approximately one third of the distance from the top right corner along the top line, drop a perpendicular line down to the lower edge. How many lines do you intersect?''
Introspection is typically not considered valid scientific evidence in cognitive science. However if you felt like you were drawing the rectangle in your mental sketchpad while reading the question, it did not mislead you this time. There are conclusive brain imaging studies that show that some visual regions of the brain that are active during perception are also active during thinking and imagination.
....
This illustrates a powerful inference engine. To find out whether a flying bird touches ground, you might have used this machinery for a blink of a second to see that it does not. Maybe the picture was drawn for you while you were hearing the sentence. To ``deduce'' that the Pisa tower is an unbalanced structure, you can imagine your body tilted at that angle, and the balance sensors in your ears will tell you it is not very stable.
....
The imagination-perception loop also illustrates the implicit representation. The fact that flying birds do not touch ground does not have to be represented anywhere in your knowledge. It is implicit in the procedures that convert the verbal description to a visual image, and the procedures which can look at the image and answer queries.
But wait,
In fact, as you might have suspected already, there is a bug with our scheme of inference using imagery. And this bug stems from the realization of the previous section, that we cannot substitute different representations for one another. When I started the description, ``imagine a rectangle...'', you were imagining a specific rectangle. Your image did not capture the concept of a rectangle in general. The answer to that particular puzzle happens to hold for any rectangle, so you were able to solve it. But it didn't have to be that way. The typical mistake of geometry students is to draw an equilateral triangle instead of ``any'' triangle and be misled by their own drawing into wrong conclusions. Good thing we brought this up. I was almost suggesting a constant time algorithm for the NP-hard problem of inference.
Whenever you pass information from one representational system to another, knowledge gets lost. You have to add defaults to visualize a concept, or you have to choose particular abstractions to verbalize a picture.
To conclusively prove a statement including abstract concepts like triangle using pictures, one would have to consider the picture of every possible triangle. Typically, however, a few carefully selected examples suffice to span the whole set and convince us of a proposition. How to select the right examples is a hard problem, because it partly depends on the question asked. This is the heuristic component of our ``constant time'' algorithm, and it may fail at times.
The fact that some mechanism of inference is not complete, does not render it useless. It may be better to have a system that works 99% of the time, and then try to focus on learning the exceptions.
This starts to look more like inductive inference rather than deductive inference. The difference from the standard idea of ``induction'' is that the mind has the ability to generate its own examples. Instead of going out to a field and try to find flying birds, you can just imagine them to infer they don't touch the ground.
The paper mentions G. Polya and his distinction between demonstrative (deductive) and plausible (inductive) reasoning. Polya taught plausible reasoning to his students and wrote a couple of books on this: How to Solve It and Mathematics and Plausible Reasoning.
The paper concludes with a description of "high-level" reasoning and with a list of principles that the author Deniz Yuret would use to build the next Cyc:
Symbols are not the only kind of data structures that should flow in a system. There is room for other modalities.
Deduction is not the only way to reach new facts. Using the appropriate modality to imagine situations and testing them by the efficient engines of the perceptual machinery can be the main inference engine.
Complementary representational systems can express what each one by itself cannot express.
Complementary representational systems can infer things that each one by itself could not infer efficiently.
Finally, higher level domains can be built as analogical layers on primitive domains, and inherit their inference efficiency.
For A.I. and machines it's not that easy to see how this will be accomplished; for us humans it's well known how we can improve our plausible (inductive) reasoning: practice, practice, practice. Solve many puzzles :-) ....
Got this link from a coworker.
Play it for a while and you learn to hate Juan. I realized quickly that it's not trivial to find a winning strategy.
I eventually stumbled upon the concept of game states, winning states and losing states. For one row or two rows it's straightforward to prove by induction which states are winning states. For one-row games you should modify the game slightly to make it more interesting: have a row of n stones and allow each player to take for example 1, 2 or 3 stones. What is the winning strategy ? What have all winning positions in common ? (Depends on n).
But even after my coworker gave me a hint saying "binary representation" I couldn't come up with a property that characterizes winning states for games with more than two rows.
Well, after another hint telling me that this game is also known as Nim I found the solution on the web: It falls into the domain of impartial combinatorial games and these two papers give a pretty good description of the theory behind it:
Thomas S. Ferguson: Game Theory
Harold Reiter: Games and Representations
This is part two of my weak attempts to solve ewd666 (go here to read part one).
This weekend I tried to attack the problem again and through some dumb luck I think I made some progress.
We left off in part one with the desire to find some property of S that will tell Sam that Paul cannot know the numbers. This time I went to hunt for facts of sums of two integers. The easiest way to hunt is of course with Google. After a few attempts with search terms that didn't yield anything interesting I tried it with "sum prime numbers" since prime numbers always play a big role in number theory problems.
Number one search result for "sum prime numbers" on Google is this page, a pretty interesting read. What caught my eye was Goldbach's Conjecture: every even integer greater than 2 can be written as the sum of two primes.
For our problem this means that S cannot be even. If S were even then Sam couldn't claim that he already knew that Paul cannot know the numbers because S = x + y with x and y prime so Paul could get the numbers from P = x * y.
It's a conjecture but hunting a little further reveals that it has been checked for all integers up to 6 x 10^16 so we're safe with our numbers in the range of [2..99].
This eliminates a lot of cases. I still feel there are way to many cases left so I'm not sure I made progress. It's hard to keep my curiosity in check and not go read the solution. I'm getting weaker, the temptation is great...
I recently got the Postscript Red Book (the reference manual) as a gift from a good friend. He probably couldn't think of any reason to keep it. Postscript has found its niche in the printer world and has long since faded from the computing mainstream. It once was the API for displaying things on screen on NeXTstep in the form of Display Postscript and it was a fairly widespread 2D vector graphic file format in the form of EPS.
This all made me think back to an experiment with my wife and a possible new use for a language like Postscript (My friends and coworkers will love this blog entry. They know I'm a Postscript aficionado and their worst suspicions about this alleged PS fixation will be confirmed with this entry.) About a year ago my wife wanted to learn a little programming. She is a graphic and web designer, knows her tools of the trade like Photoshop, Illustrator, Dreamweaver, Flash and the like. But since these things are becoming fairly scriptable with Javascript and Actionscript she wanted to start writing little programs.
This presented me with the task of how to explain to an absolute beginner what programming is all about. Believe it or not, after a few false starts with other languages I chose Postscript as the teaching language. Ok, stay with me and hear me out, it's not as crazy as it sounds.
Here's a rough outline of the steps involved:
In the first step we simplify the available computing model significantly. We get rid of almost all operators and operands and only keep literal numbers and arithmetic operators. Our student is then asked to play a game: the student becomes a computer and needs to execute the programs we write. We have to use non-programming terms when describing the computing model the student will pretend to be. Thus the rules of the game are
- the program is a list of words
- the student will read the list from left to right doing something after reading each word:
- if the word is a number set it aside on a pile of numbers on top of the numbers already seen
- if the word is not a number it is an instruction that involves taking two numbers from the top of the pile, doing the arithmetic operation mentioned by the word with the two numbers and throwing the result back on top of the pile.
- when the last word is read from the list the number on top of the pile is the result of the game
This is all done with pencil and paper or even better with little scraps of paper with numbers or arithmetic operations written on them. We designate a place on our desk as the pile where our student can start stacking up the numbers. We provide empty scraps of paper on which the student can write results to throw them on the pile.
We then write a few programs, in the beginning fairly short ones, and let the student execute them. We gradually increase the length of the programs and let the student get comfortable with the game.
In the next step we reverse the roles. Now we are the computer executing programs written by the student. Notice the change: in the first step the student only had to follow the rules of the game, now he needs to think how to make us do what he wants under the same rules of the game. We can start the game with a few numbers already on the pile and the task of adding the two top ones, subtracting the result from the one below etc. Our student is writing programs without even realizing it, we haven't departed from the comfortable game world, there's no talk about variables, statements, syntax, types etc., things that overwhelm beginners.
The next step enriches our game a little. First we introduce new instructions, this time instructions that change the order in our pile of numbers (let's call it out as stack from now on). We can use the simple stack manipulation instructions like dup, exch, pop, copy, count and more complicated ones like roll. We play the same game first with the student as the computer so that he gets comfortable with the new instructions and what they mean and do, and then with us as the computer executing programs the student writes. The desired end results need to be clearly defined so the student can write programs that achieve these results after giving them to us for execution.
We can continue this game, gradually enriching it. First we can introduce an additional literal: strings, i.e. pieces of text. Strings are like numbers that when read are put onto the stack and then we can introduce instructions that act on strings. This is a useful addition because for one it illustrates the need to resolve ambiguity. How are strings distinguished from instructions in the list of words? We as the dumb computer program executors cannot do it without an additional rule that says that strings are enclosed by round braces and instructions are not. The round braces are not part of the string but only help to detect them in the list of program words. In addition to that the introduction of strings makes stacks more interesting. The rules of the game need to clarify that it is not allowed to add a number and a string etc. Having two types of literal objects makes the game more interesting and the problems the student needs to solve with the programs he writes more challenging (example: initially the stack has a number and a string, write a program that ends with the stack having the sum of the length of the string added to the initial number on top of the stack).
We play this game until the student is absolutely comfortable solving problems in it. The advantage here is that we chose a stack-based language with very easy concepts and very easy syntax. We also can play with it using pencil and paper which eliminates all the upfront housekeeping noise like: start the interpreter, write this into a file, point the interpreter to the file, specify these file paths etc. With us pretending to be the computer we can give much more human error messages and react much more subtle to our student than a silicone computer would ever do.
The next step is dictionaries and literals that are referenced by name with the literal itself stored in the dictionary. Lots of examples and practice again, first with the student as the computer and then with the student writing programs solving little problems involving these new concepts.
We continue with control flow, first the ifelse. Branching is important and needs to be practiced a lot. New literals, the booleans are needed here too.
Once we introduce loops it gets really fun. If our student gets comfortable we can even start going to a real computer, use a postscript interpreter and introduce some instructions that actually draw on screen. This is where our student is completely hooked and we can let him start to play on his own and write his own programs.
We can also start associating back these concepts to the actual target language the student wanted to learn like Javascript. With examples he already knows and understands in Postscript we explain terms like syntax, flow, branching, loops and literals etc basically the elements of a programming language. We then show what those are in say Javascript.
And finally we give the student a good introductory Javascript tutorial book and retreat exhausted into the background...
If you dare, take a member of your family on this discovery trip over the Holidays, a member who's a complete rookie but always wondered what you do in your profession. Oh, and in case you wonder, yes, our marriage survived the experiment.
A good mathematical notation is in my opinion a necessity when solving mathematical problems. It allows to capture and communicate mathematical reasoning in a concise, efficient and unambiguous way. Anybody doubting this should just try to read old mathematical texts from before the time when modern mathematical notation was common. Mathematical historians usually quote the old masters using up a whole book page worth of text and then translate that for us in modern mathematical notation using an equation in one line or so.
Choosing the right notation for the job is also very important. Expressions in a notation should be easy to manipulate, expressions should stay small but the notation should not hide properties of the mathematical objects involved. My dad always gives this example of a bad notation choice: In some schools EE students learn to use + for "or" and "*" for "and" in boolean expressions. This is bad because in boolean algebra "or" distributes over "and" and "and" distributes over "or". But we the chosen notation one would write A (B + C) = A B + A C and A + (B C) = (A + B) (A + C). The second one looks unnatural because it doesn't work in integer algebra so students are reluctant to use it for boolean expressions even though it's perfectly fine there.
I tried to make all of this lead to this paper by E.W. Dijkstra where he explains the notational conventions that he used and why. He is one of the first to take notation very seriously and he pushed for corrections to the modern mathematical notation (For example he criticized the familiar use of the classical summation symbol as sloppy and misleading, he repeatedly argued for numbering starting from zero, ...).
I don't know how many parts there will be about ewd666. It's a wicked puzzle, tough to crack. I found it on the website that collects Edsger W. Dijkstra's notes (a wonderful collection btw).
Let's state the problem again:
Of two unknown integer numbers in the range [2..99] inclusive Paul is told the product P and Sam is told the sum S. When asked whether they know the numbers, the following dialogue takes place between Paul and Sam:
Paul: "I don't know them."
Sam: "I already knew that."
Paul: "Then I now know the numbers."
Sam: "Then I now know them too."
The problem is to determine the two numbers from the above data and to prove that the solution is unique.
Spoiler alert: don't continue to read if you want to find your own solution (also don't read more than page one from the pdf file above). I don't have a solution yet, these are pretty raw ideas. Like I said, it's a tough one for me.
Paul says he doesn't know the numbers. He knows the value of the product P = x * y and has two unknowns in x and y. His problem is cleary underdetermined. We should ask ourselves in what cases would Paul know the numbers ie in what cases the problem becomes determined.
We can use the integer factorization of P. If the integer factorization for example yields P = p * q, with p, q primes it's clear that Paul knows the numbers. Are there any other cases ? Here's an example: P = 318, integer factorization of P is in this case P = 318 = 2 * 3 * 53. The only solution that keeps the numbers in the range [2..99] is x = 2 * 3 and y = 53 (interchanging x and y values is considered the same solution).
Refining this idea we get that P = x * y is a unique solution if
forall e: e | x => e * y > 99
( e | x means that e is divisor of x).
The proof is done by contradiction: assume there exists an e | x with e * y <= 99. Then it is clear that x1 -> x / e and y1 -> y * e is another solution of P in contradiction with our assumption that P = x * y is unique.
Paul doesn't know the numbers. So there
exists e: e | x and e * y <= 99.
e is a true divisor of x, x is in the range [2..99] so e can only be in the range [2..49]. e * y <= 99 so y can only be in the range [2..49]. This means that the sum cannot be bigger than 99 + 49 if there's more than one solution to Paul's problem. The sum S <= 148 is therefore a necessary condition. It is not a sufficient condition though (consider x = 2, y = 3, S = 5 < 148 but P = x * y is unique and Paul would know the numbers).
So where does all this leave us ?
We have to find a property (*) of the sum value S so that
(*) => exists e: e | x and e * y <= 99.
...to be continued (hopefully)
We're off to Europe for three weeks (first Venice and then Germany) so there won't be any entries for a while.

In the meantime you can digest this notation-heavy solution to the
twelve coins problem.
Ciao, tutti.
I always believed in this. When given a more complex problem or a trickier algorithm it helps enormously if you can express it in mathematical notation and have some calculus at your disposal that manipulates these expressions. In Java and other iterative programming languages it makes sense to work with things like precondition and postcondition (ideally as mathematical expressions) and calculi like weakest precondition, loop invariants etc. I've written a small paper that illustrates the concepts and gives some pointers to literature. I will keep posting examples to solving problems this way in this blog. Also check out this paper for another nice example of deriving a program.
Ok, I promise, this is the last entry about skewtrees. I just wanted to point out that this problem is widely known and dealt with in the algorithms literature under the name "optimal binary search trees". I found that out by accident while googling for something else. Just search for "optimal binary search trees" and you'll see what I mean. There are even top-down approaches that provide almost optimal solutions.
I finally did play with LaTex and write up in more detail the skewtree solution.
I have to say LaTex is very cool. I'm still a beginner but I feel the power behind it. You can throw pretty complex mathematical expressions at it and it lays it out nicely, no sweat. You can find the paper here.
We just discovered that the minimal solution to our problem is formed out of minimal solutions to the subproblems.
Here's a question I should have explored from the beginning: How big is our solution space ? In other words, how many BST's can we construct from a sequence {1,..., n} ?
Well, the BST property (left nodes smaller than root, right nodes larger) makes constructing BST's an exercise in partitioning {1,.., n}. We pick an arbitrary number as root node and the numbers left to it will go into the left subtree which itself will be a BST and the numbers right of it will go into the right subtree which is also a BST. There are n possible root nodes and the rest we express recursively.
So if P(n) is the number of BST of n nodes we get:
P(n) = Sum(k = 1, k <= n of P(k) * P(n - k)).
Here it helps to know your classics. I'm not talking about Homer and The Illiad here. For me it's
A quick look in there reveals that P(n) is a Catalan number and O(4^n/n^(3/2)).
So our solution space is exponential and searching it brute-force will take too long.
We have (*) though so we could try to solve it recursively. Here's (*) repeated again:
(*) cost(a LS RS) = sum(probablities of all nodes in tree) + cost(LS) + cost(RS)
mincost = min (with a = 1 to n of sum(probablities of all nodes in tree) + mincost(LS) + mincost(RS))
Here's the other key observation:
the recursive solution won't be much faster. The recursive calls get called over and over again. That means solutions of bigger trees share the same solution to smaller trees.
Time to re-read the chapter on dynamic programming in the classic.
That did it, here's a small java program that solves it.
The runtime is O(n^3).
Would a greedy algorithm have done it too ? I don't think so. This problem looks too similar to the Matrix Chain Multiplication problem which is a classic dynamic programming problem. For greedy algorithms to work you need to prove that substituting a minimal local solution with a greedy local solution yields the same minimal solution to the whole problem. I couldn't prove that and in fact when looking at the example solutions given in the problem statement you can see that greedy local choices (choose the biggest probability as your root and proceed recursively) don't result in the same trees.
Stumbled upon this interview with this year's winner of the TopCoder contest. Fascinating. I admire people who can think under pressure. I never could do that. I looked at the problems and I decided to give the hardest one a try. I gave myself unlimited time and resources (I was allowed to hit the books).
Here's the problem.
Read on to find out about my vagaries and the eventual discovery of the solution.
In "Prime Obsession" John Derbyshire writes (page 52-53):
"The other factor, much more common among mathematicians in all ages, was perfectionism.... Restaurants illustrate the point. Dishes prepared in the clatter, breakage, and yelling of an overheated kitchen appear in the public area as flawless arrangements on spotless plates, delivered by dapper murmuring waiters....Published mathematical papers often have irritating assertions of the type: 'It now follows that...,' or: 'It is obvious that...,' when it doesn't follow, and it isn't obvious at all, unless you put in the six hours the author did to supply the missing steps and checking them. There is a story about the English mathematician G.H. Hardy, whom we shall meet later. In the middle of delivering a lecture, Hardy arrived at a point in his argument where he said, 'It is now obvious that....' Here he stopped, fell silent, and stood motionless with furrowed brow for a few seconds. Then he walked out of the lecture hall. Twenty minutes later he returned, smiling, and began, 'Yes, it is obvious that...,' "
I'm going to lead you down the paths I took, not too far but far enough to see a glimpse of the overheated kitchen with the clatter and breakage. The "flawless dish" (it's actually more a home-cooked meal) will come in a few days in the form of a LaTex paper. It will be a good exercise in LaTex for me, I haven't done it in a long while. I will give a Java implementation and maybe a Haskell implementation. I'm very much still a Haskell beginner so I don't know if I can do it.
Let's start.
I always try small things first and I also try to find a formalism that lets me explore properties of the entities involved with some form of calculus.
Well, here we have BST's as entities (binary search trees). Taking a small case that's still interesting we consider trees with three nodes. I will use an ascii notation for the trees because drawing pictures in this blog entry form is hard.
Notation:
(a LS RS) is a binary search tree with root a, left subtree LS and right subtree RS. Nodes are numbers and because of the BST property nodes in LS are all smaller than a and nodes in RS are all bigger than a. If a node doesn't have a child we put nil in there.
Example:
(3 (2 1 nil) 4) is a BST rooted at 3, with the left subtree rooted at 2 and right subtree the node 4.
So for the case of three nodes y < x < z because of the BST property we can only have these possible trees:
(x y z), (y nil (x nil z)), (y nil (z x nil)), (z (x y nil) nil), (z (y nil x) nil)
Now, which one of these is best for a given assignment of node access probabilities py, px and pz.
We calculate the cost functions and compare.
cost( (x y z) ) = px + 2 py + 2 pz
cost( (y nil (x nil z)) ) = py + 2 px + 3 pz
cost( (x y z) ) < cost( (y nil (x nil z)) ) means we prefer (x y z) over (y nil (x nil z)).
So
px + 2 py + 2 pz < py + 2 px + 3 pz
or
py < px + pz.
Two observations (not proven yet, but plausible to me at that moment):
- if all probabilities are equal then a balanced BST is the best solution
- we could start with the balanced BST and then do node modifications, bottom-up to reflect the non-uniformly distributed probabilities.
If py > px + pz then we transform our BST from (x y z) to (y nil (x nil z)) by pushing node y up and node x down and to the right.
We can always push nodes up or down and preserve the BST property, even for nodes that have non-trivial subtrees.
Example:
Starting with tree (a b (c LS RS)) we find out that having c as root would be better so we push c up and a down and to the left. c cannot have more than two children though so we hang c's old left subtree LS as the right subtree of a. The BST properties stay fullfilled and we get tree (c (a b LS) RS).
I went down that road. I managed to find tree transformations that improve the situation locally. Didn't get me anywhere. It wasn't clear if visiting each node once was enough. Small examples showed that it wasn't. It was hard to prove anything. Not all was lost though. I needed to express the cost of a node as an expression of the cost of its two subnodes and by looking at the small examples I found an expression which I could also prove by induction.
(*) cost(a LS RS) = sum(probablities of all nodes in tree) + cost(LS) + cost(RS)
Now assume we know (a LS RS) has minimal cost. Does LS have minimal cost too in the set of all possible BST of nodes smaller than a ? Well, if it doesn't that means there is another subtree LS' with smaller cost. We can just plug that one into the tree rooted at a and get a new tree (a LS' RS). According to (*) then (a LS' RS) has smaller cost than (a LS RS). This contradicts our assumption that (a LS RS) has minimal cost, so LS has to be minimal too. Analog RS has to be minimal.
This means the minimal solution to our problem is formed out of minimal solutions to the subproblems.
This is a big step forward and it is time for me to stop for now.
As they say in "Matrix Reloaded": to be concluded.