A character set is a set of symbols and encodings. Character data
types are represented as Unicode 2.0 sequences in Derby.
How collation works in Derby
Derby supports a wide range
of character sets and encodes all of the character sets by using the Unicode
support provided by the java.lang.Character class in the Java Virtual Machine
(JVM) in which the Derby database
runs. See the Java API documentation for the java.lang.Character class for
the exact level of Unicode Standard that is supported.
A collation is
a set of rules for comparing characters in a character set. In Derby the
collation rules affect comparisons of the CHAR and VARCHAR data types. Collation
rules also affect how the LIKE Boolean operator processes the CHAR, VARCHAR,
CLOB, and LONG VARCHAR data types.
The default Derby collation
rule is based on the binary Unicode values of the characters. So a character
is greater than (<), equal to (=), or less than (>) another character based
on the numeric comparison of the Unicode values. This rule allows for very
efficient comparisons of character strings.
Note: When LIKE comparisons
are used,
Derby compares
one character at a time for non-metacharacters. This is different than the
way
Derby processes = comparisons.
The comparisons with the = operator compare the entire character string on
left side of the = operator with the entire character string on the right
side of the = operator. See the
Differences between LIKE and equal (=) comparisons section
below.
Territory-based collation
Derby also
supports the ability to define collation rules that are appropriate to a territory,
and is referred to as territory-based collation. Derby supports the territories
that Java supports.
You can specifically set the territory of a database
when you create the database. If you do not specify a territory, Derby uses
the default territory of the JVM in which the database is created. Each JVM
can support many territories that are independent from the default territory
for the JVM. Collation support for these additional territories is provided
through the java.text.RuleBasedCollator class and the set
of rules for these territories. Refer to the JVM specification for details
of how these rules are used to provide territory specific collation. Derby currently
supports only running those rules that can be loaded dynamically from the
running JVM based on the territory attribute. Overrides to these rules by
the user are not supported.
The territory-based collation in Derby affects
how the CHAR and VARCHAR data types are compared. Specifying territory-based
collation also impacts how the LIKE Boolean operator processes CHAR, VARCHAR,
CLOB, and LONG VARCHAR data.
Territory-based collation does add extra
processing overhead to all character-based comparison operations.
Database attributes that control collation
When
you create a
Derby database,
the attributes that you set determine the collation that is used with all
of character data in the database. For example:
Table 1. The create database
attributes that control collation
| Example create commands |
Collation is driven by |
| jdbc:derby:abcDB;create=true |
Unicode codepoint collation (UCS_BASIC), which is the
default collation for Derby databases. |
| jdbc:derby:abcDB;create=true;territory=es_MX |
Unicode codepoint collation (UCS_BASIC). The collation attribute
is not set. |
| jdbc:derby:abcDB;create=true;collation=TERRITORY_BASED |
The territory of the JVM, since the territory attribute
is not set. Tip: To determine the territory of the JVM, run Locale.getDefault().
|
| jdbc:derby:abcDB;create=true;territory=es_MX;collation=TERRITORY_BASED |
The territory attribute. |
Collation examples
With Unicode codepoint collation
(UCS_BASIC), the numerical values of the Unicode encoding of the characters
are used directly for ordering. For example, the FRUIT table contains the NAME
column that uses the VARCHAR(20) data type. The contents of the NAME column
are:
- orange
- apple
- Banana
- Pineapple
- Grape
UCS_BASIC collation sorts all lower case letters before upper
case letters. The statement
SELECT * FROM FRUIT ORDER BY NAME returns:
- apple
- orange
- Banana
- Grape
- Pineapple
If the database is created with the territory attribute set to
en_US (English language, United States country code), and the collation attribute
set to TERRITORY_BASED, the results of the statement
SELECT * FROM
FRUIT ORDER BY NAME returns:
- apple
- Banana
- Grape
- orange
- Pineapple
The collation set for the database also impacts comparison operators
on character data types. For example, the statement
SELECT * FROM
FRUIT WHERE NAME > 'Banana' ORDER BY NAME returns:
UCS_BASIC collation Territory-based collation
Grape Grape
Pineapple orange
Pineapple
For information on creating case-insensitive databases, see
Creating a database with territory-based collation.
Differences between LIKE and equal (=)
comparisons
When you use territory-based collation, the comparisons
can return different results when you use the LIKE and equal
(=) operators. For example, suppose that the
Derby database
is set to use a territory where the character 'z' has same collation elements
as 'xy'. Consider the following two WHERE clauses:
- WHERE 'zcb' = 'xycb'
- WHERE 'zcb' LIKE 'xy_b'
For WHERE clause 1, Derby returns
TRUE because the collation elements for the entire string 'zcb' will match
the collation elements of the entire string 'xycb'.
For WHERE clause
2, Derby returns FALSE
because collation element for character 'z' does not match the collation element
for character 'x'. In addition, when metacharacter such as an underscore is
used with the LIKE operator, the metacharacter counts for one character in
the string value. A clause like WHERE 'xycb' LIKE '_cb' returns FALSE because
'x' is compared to the metacharacter _ and 'y' does not match 'c'.