Hmm, it does indeed seem awfully late in the release process for some fairly major distro-specific patching of xapian-core. It's quite likely there will be a better solution before 11.10, and if not we can probably get the cjk-tokeniser approach in cleanly upstream by then.
My thought would be to package the cjk-tokeniser code in its own little C++ library (which can link to libxapian for the Unicode stuff since that's a public API), and then knock up a simple Python wrapper around it (with SWIG or similar or even by hand). Then you can use this for CJK locales, and Xapian's code for others, which means that any breakage won't affect other users of Xapian, and can only break for S-C in CJK locales, where the search doesn't really work currently anyway.
Hmm, it does indeed seem awfully late in the release process for some fairly major distro-specific patching of xapian-core. It's quite likely there will be a better solution before 11.10, and if not we can probably get the cjk-tokeniser approach in cleanly upstream by then.
My thought would be to package the cjk-tokeniser code in its own little C++ library (which can link to libxapian for the Unicode stuff since that's a public API), and then knock up a simple Python wrapper around it (with SWIG or similar or even by hand). Then you can use this for CJK locales, and Xapian's code for others, which means that any breakage won't affect other users of Xapian, and can only break for S-C in CJK locales, where the search doesn't really work currently anyway.