The upshot is that this will require some work. There are libs we can pull in for this (like http://code.google.com/p/cjk-tokenizer/), we'll then have to manually write some glue code to wire it up in the indexing- and query parsing subsystems for S-C and u-p-a.
Anyone with a simpler solution are more than welcome to chime in :-)
I've been looking into CJK indexing in Xapian and the prospects are slightly dire... See http:// trac.xapian. org/ticket/ 180
The upshot is that this will require some work. There are libs we can pull in for this (like http:// code.google. com/p/cjk- tokenizer/), we'll then have to manually write some glue code to wire it up in the indexing- and query parsing subsystems for S-C and u-p-a.
Anyone with a simpler solution are more than welcome to chime in :-)