Looking deeper into the linked Xapian bug it seems we may be able to shoplift some code from the Pinot engine that is based on cjk-tokenizer but ported to glib2 instead of libunicode. As described in the Xapian bug it does depend on the Dijon namespace though so as they've done for the Xapian patch based on the Pinot code we must remove the Dijon usage.
As olly describes in the Xapian bug this is slightly dangerous though and may have unpredictable consequences if we ever see Unicode version mismatches between glib2 and Xapian or if they differ in their error handling (which they almost certainly do).
All of this still leaves the question open for how to handle this in S-C with Python as it's crucial that S-C and u-p-a use the *exact* same method for tokenization. If there's a mismatch between the query parser in u-p-a and how the indexed terms are generated in the S-C index we'll see no-, weird-, or random results.
Looking deeper into the linked Xapian bug it seems we may be able to shoplift some code from the Pinot engine that is based on cjk-tokenizer but ported to glib2 instead of libunicode. As described in the Xapian bug it does depend on the Dijon namespace though so as they've done for the Xapian patch based on the Pinot code we must remove the Dijon usage.
As olly describes in the Xapian bug this is slightly dangerous though and may have unpredictable consequences if we ever see Unicode version mismatches between glib2 and Xapian or if they differ in their error handling (which they almost certainly do).
All of this still leaves the question open for how to handle this in S-C with Python as it's crucial that S-C and u-p-a use the *exact* same method for tokenization. If there's a mismatch between the query parser in u-p-a and how the indexed terms are generated in the S-C index we'll see no-, weird-, or random results.