bad truncation of utf8 as column name

Bug #319796 reported by Andrew Hutchings
2
Affects Status Importance Assigned to Milestone
Drizzle
Fix Released
Low
Padraig O'Sullivan

Bug Description

When running a query with a long UTF8 column name the name is truncated half way through a UTF8 character. For example:

drizzle> SELECT "☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃"\G
*************************** 1. row ***************************
☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃�: ☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃☃
1 row in set (0.00 sec)

Revision history for this message
Jay Pipes (jaypipes) wrote :

Confirmed on trunk.

Changed in drizzle:
importance: Undecided → Low
milestone: none → low-hanging-fruit
status: New → Confirmed
Revision history for this message
Jay Pipes (jaypipes) wrote :

It's only the column name which is truncated, not the column contents. Should be a fairly easy fix...so assigning to low-hanging-fruit.

Revision history for this message
Padraig O'Sullivan (posulliv) wrote :

Assigning to myself. Seems like it could be another nice small bug to work on.

Changed in drizzle:
assignee: nobody → posulliv
Revision history for this message
Padraig O'Sullivan (posulliv) wrote :

So I've been trying to track down this bug and what I'm seeing is that the field with the column name passed in to the mysql_select function is already truncated. That field comes from the parameter select_lex->item_list passed to mysql_select from the handle_select function. Tracing the execution flow, I get to the mysql_parse function and the only place I can see select_lex->item_list populated is in sql_yacc.cc.

I know that sql_yacc.cc is generated from the sql_yacc.yy grammer by bison. So that leads me to think there could be a problem in the sql_yacc.yy file with dealing with UTF8 column names greater than a certain length.

Does this sound like I'm on the right track? Or am I missing something really obvious here? If I'm on the right track, I'll keep at it but I didn't want to waste too much time going down a track which doesn't lead anywhere.

Revision history for this message
Jay Pipes (jaypipes) wrote : Re: [Bug 319796] Re: bad truncation of utf8 as column name

Padraig wrote:
> So I've been trying to track down this bug and what I'm seeing is that
> the field with the column name passed in to the mysql_select function is
> already truncated. That field comes from the parameter
> select_lex->item_list passed to mysql_select from the handle_select
> function. Tracing the execution flow, I get to the mysql_parse function
> and the only place I can see select_lex->item_list populated is in
> sql_yacc.cc.
>
> I know that sql_yacc.cc is generated from the sql_yacc.yy grammer by
> bison. So that leads me to think there could be a problem in the
> sql_yacc.yy file with dealing with UTF8 column names greater than a
> certain length.
>
> Does this sound like I'm on the right track? Or am I missing something
> really obvious here? If I'm on the right track, I'll keep at it but I
> didn't want to waste too much time going down a track which doesn't lead
> anywhere.

Yes, that sounds right. In DRIZZLEparse and DRIZZLELex, you'll notice
that the utf8_mb macros and functions are used (overly) extensively. I
wouldn't be surprised to find that somewhere in that thicket we are
truncating something...

Keep digging. You're on the right trail I believe.

-jay

Revision history for this message
Padraig O'Sullivan (posulliv) wrote :

Ok, so I manged to track this bug to the following issues. In the Item::set_name function, the following piece of code is present which sets the column name:

name= sql_strmake(str, (name_length= cmin(length,(unsigned int)MAX_ALIAS_NAME)));

MAX_ALIAS_NAME is defined in unireg.h to be 256. After discussing this with Monty a little on IRC, he mentioned that he thinks that MAX_ALIAS_NAME is intended to be characters but the implementation is that its bytes. So this means that if the column alias is larger than 256 bytes it will be truncated when it should only be truncated if it is larger than 256 characters.

Thus, based on Monty's advice I updated this small piece of code to work on utf8 characters rather than bytes. Basically, I get the length of the utf8 string in characters and use that length to decide whether the alias name should be truncated or not.

I wanted to use the U8_LENGTH macro for this but I was unable to get it to work for me. So the way I count the utf8 characters in the strings is as follows:

/* get the length of UTF8 str in characters */
uint32_t num_chars= 0;
for (uint32_t i= 0; str[i]; i++)
{
  if (!U8_IS_TRAIL(str[i]))
    num_chars++;
}

Does that look like an ok method for counting the utf8 characters in a utf8 string? I pushed the fix for this bug to the following branch:

lp:~posulliv/drizzle/bug-fixes

and I proposed it for merging. Let me know if there is a better way for counting utf8 characters.

-Padraig

Changed in drizzle:
status: Confirmed → Fix Committed
Changed in drizzle:
status: Fix Committed → Fix Released
Monty Taylor (mordred)
Changed in drizzle:
milestone: low-hanging-fruit → aloha
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.