Clustering columns
Restrict queries on clustering columns and find ranges across clustering segments.
Clustering columns order data within a partition. When a table has multiple clustering columns, the data is stored in nested sort order. The database uses the clustering information to identify where the data is within the partition. Use logical statements for clustering columns to identify the clustering segment and return slices of the data.
When a query contains no restrictions on clustering or index columns, all the data from the partition is returned.
How order impacts clustering restrictions
Because the database uses the clustering columns to determine the location of the data on the partition, you must identify the higher level clustering columns definitively using the equals (=) or IN operators. In a query, you can only restrict the lowest level using the range operators (>, >=, <, or <=).
How data is stored
CREATE TABLE numbers (
key int,
col_1 int,
col_2 int,
col_3 int,
col_4 int,
PRIMARY KEY ((key), col_1, col_2, col_3, col_4));
key | col_1 | col_2 | col_3 | col_4 -----+-------+-------+-------+------- 100 | 1 | 1 | 1 | 1 100 | 1 | 1 | 1 | 2 100 | 1 | 1 | 1 | 3 100 | 1 | 1 | 2 | 1 100 | 1 | 1 | 2 | 2 100 | 1 | 1 | 2 | 3 100 | 1 | 2 | 2 | 1 100 | 1 | 2 | 2 | 2 100 | 1 | 2 | 2 | 3 100 | 2 | 1 | 1 | 1 100 | 2 | 1 | 1 | 2 100 | 2 | 1 | 1 | 3 100 | 2 | 1 | 2 | 1 100 | 2 | 1 | 2 | 2 100 | 2 | 1 | 2 | 3 100 | 2 | 2 | 2 | 1 100 | 2 | 2 | 2 | 2 100 | 2 | 2 | 2 | 3 (18 rows)
The database stores and locates the data using a nested sort order. The data is stored in hierarchy that the query must traverse:
{ “key” : “100” { “col_1” : “1” { “col_2” : “1” { “col_3” : “1” { “col_4” : “1”, “col_4” : “2”, “col_4” : “3” }, “col_3” : “2” { “col_4” : “1”, “col_4” : “2”, “col_4” : “3” } }, “col_2” : “2” { “col_3” : “2” { “col_4” : “1”, “col_4” : “2”, “col_4” : “3” } } }, “col_1” : “2” { “col_2” : “1” { “col_3” : “1” …
To avoid full scans of the partition and to make queries more efficient, the database requires that the higher level columns in the sort order (col_1, col_2, and col_3) are identified using the equals or IN operators. Ranges are allowed on the last column (col_4).
Selecting data from a clustering segment
SELECT * FROM numbers
WHERE key = 100
AND col_1 = 1 AND col_2 = 1 AND col_3 = 1
AND col_4 <= 2;
key | col_1 | col_2 | col_3 | col_4 -----+-------+-------+-------+------- 100 | 1 | 1 | 1 | 1 100 | 1 | 1 | 1 | 2 (2 rows)
The IN operator can impact performance on medium-large datasets. When selecting multiple segments, the database loads and filters all the specified segments.
col_1
segments 1 and
2:SELECT * FROM numbers
WHERE key = 100
AND col_1 IN (1, 2)
AND col_2 = 1 AND col_3 = 1
AND col_4 <= 2;
The following visualization shows all the segments the database must load to filter multiple segments:
key | col_1 | col_2 | col_3 | col_4 -----+-------+-------+-------+------- 100 | 1 | 1 | 1 | 1 100 | 1 | 1 | 1 | 2 100 | 2 | 1 | 1 | 1 100 | 2 | 1 | 1 | 2 (4 rows)
Invalid restrictions
SELECT * FROM numbers
WHERE key = 100
AND col_4 <= 2;
InvalidRequest: Error from server: code=2200 [Invalid query] message="PRIMARY KEY column "col_4" cannot be restricted as preceding column "col_1" is not restricted"
Only restricting top level clustering columns
Unlike partition columns, a query can omit lower level clustering column in logical statements.
SELECT * FROM numbers
WHERE key = 100 AND col_1 = 1
AND col_2 > 1;
The
query returns the following
data:key | col_1 | col_2 | col_3 | col_4 -----+-------+-------+-------+------- 100 | 1 | 2 | 2 | 1 100 | 1 | 2 | 2 | 2 100 | 1 | 2 | 2 | 3 (3 rows)
Returning ranges that span clustering segments
Slicing provides a way to look at an entire clustering segment and find a row that matches values in multiple columns. The slice logical statement finds a single row location and allows you to return all the rows before, including, between, or after the row.
(clustering1, clustering2[, …]) range_operator (value1, value2[, …])
[AND (clustering1, clustering2[, …]) range_operator (value1, value2[, …])]
Slices across full partition
SELECT * FROM numbers
WHERE key = 100
AND (col_1, col_2, col_3, col_4) <= (2, 2, 2, 1);
The database locates the matching row and then returns every record before the identified row in the results set.
SELECT * FROM numbers
WHERE key = 100
AND (col_1, col_2, col_3, col_4) <= (2, 1, 1, 4);
The query finds where the row would be in the order if a row with those values existed and returns all rows before it:
Slices of clustering segments
The same rules apply to slice restrictions when finding a slice on a lower level segment; identify the higher level clustering segments using equals or IN and specify a range on the lower segments.
SELECT * FROM numbers
WHERE key = 100 AND col_1 = 1 AND col_2 = 1
AND (col_3, col_4) >= (1, 2)
AND (col_3, col_4) < (2, 3);
When finding a between range, the two slice statements must be on the same columns for lowest columns in the hierarchy.
Invalid queries
SELECT * FROM numbers
WHERE key = 100 AND col_1 = 1
AND (col_2, col_3, col_4) >= (1, 1, 2)
AND (col_3, col_4) < (2, 3);
InvalidRequest: Error from server: code=2200 [Invalid query] message="Column "col_3" cannot be restricted by two inequalities not starting with the same column"
Procedure
CREATE TABLE cycling.events (
Year int,
Start_Month int,
Start_Day int,
End_Month int,
End_Day int,
Race TEXT,
Discipline TEXT,
Location TEXT,
UCI_code TEXT,
PRIMARY KEY ((YEAR, Discipline), Start_Month, Start_Day, Race));
start_month
and start_day
for
the range using a slice:
SELECT start_month as month, start_day as day, race FROM cycling.events
WHERE year = 2017 AND discipline = 'Road'
AND (start_month, start_day) < (2, 14) AND (start_month, start_day) > (1, 15);
month | day | race -------+-----+----------------------------------------------------------------- 1 | 23 | Vuelta Ciclista a la Provincia de San Juan 1 | 26 | Cadel Evans Great Ocean Road Race - Towards Zero Race Melbourne 1 | 26 | Challenge Mallorca: Trofeo Porreres-Felanitx-Ses Salines-Campos 1 | 28 | Cadel Evans Great Ocean Road Race 1 | 28 | Challenge Mallorca: Trofeo Andratx-Mirador des Colomer 1 | 28 | Challenge Mallorca: Trofeo Serra de Tramuntana -2017 1 | 29 | Cadel Evans Great Ocean Road Race 1 | 29 | Grand Prix Cycliste la Marseillaise 1 | 29 | Mallorca Challenge: Trofeo Palma 1 | 31 | Ladies Tour of Qatar 2 | 1 | Etoile de Besseges 2 | 1 | Jayco Herald Sun Tour 2 | 1 | Volta a la Comunitat Valenciana 2 | 5 | G.P. Costa degli Etruschi 2 | 6 | Tour of Qatar 2 | 9 | South African Road Championships 2 | 11 | Trofeo Laigueglia 2 | 12 | Clasica de Almeria (18 rows)