Yea, I was thinking about that some years ago when I first discovered Sikuli. The reason that I had to train Sikuli for different systems and sometimes the time necessary to write new script was 5x times higher that just clicking the stuff for a month. So my automation with "manual learning" (writing scripts) needed 5 months to feel the result, and worse thing - it constantly needed improvements. So, I've abandoned Sikuli at that moment - the pivot to go for Java was also a reason to abandon it - I don't have type for prototyping in Java.

So, over these years the idea stayed the same, but now I know a little bit more about underlying technology that can enable it. I think that we need - http://caffe.berkeleyvision.org/ or http://deeplearning.net/software/theano/ plus speed up on GPU. But that's about technology. The overall scheme:

Sikuli maintains a database (or better said - matrix) of images that is known to it. Yes, that means a lot of data. Nobody says that these should be images like we do now. We don't store images in our brain - we store the effect from that images. So, the images are just the source data. The request to database - I will call it 'matrix' (just to keep everyone confused) - can answer the questions:

1/ if the current state of the image on the screen is known to it or not
2/ if there any known areas
3/ if there any known areas, with unusual content
4/ if there are any unknown areas
5/ if there is any unknown state

So, the idea is that system SHOULD KNOW ABOUT ALL states. The non-important states can be generalized, but the idea is that system ALWAYS AWARE of what's going on on the screen, and just chooses to react or not.

So, 'area' is the thing that is specific for us. If we are operating with GUI windows, we need to train the concept of window. This trained data can then be contained in some 'domain' (specific part of matrix), so that Sikuli can count windows without doing a second guess about their content. So, it is a layered concept model, much like in humans.

1. Generic screen - I see something and don't see 
2. Windows - I see windows, I know how it looks like
3. Context - I know windows, I track what is inside, how many of them and where are they

The things that can help to understand me are:
 - HMM (hidden markov models) used in speech synthesis/recognition and parameters
 - predator - self learning algorithm in video image recognition

So, we need to select how to we train system to build a loop. 

    [observe]  -->  [detect]  --> [action] --> [learn] -->  <-- repeat

[observe] is just screen watching and matching to matrix.
[detect] is evaluating the condition of the system - is it 'worried' that
there is something unknown or something is not a good match.
[action] is taken AUTOMATICALLY when system is 99.9% sure of
that's going on (e.g. it did that in those conditions 1000 times and
everything was ok).. or if system is not sure, a person can be ASKED
FOR ADVICE. 
[learn] depending on the outcome of person's feedback, the new
information about the image and conditions and the desired
outcome is saved in data (associated into matrix).

Yep. I would do this in Python, because it doesn't require
recompilation and is much much faster for prototyping. I think we
don't have such chances with Java - I will spend months just trying
to call one library from another.