The problem of finding spaces and ranges in SQL very often has to be solved in real life situations. The basic principle is that you have a specific sequence of numbers or date and time values ​​that must have a fixed interval between them, but some elements are missing. Solving a gap search involves finding elements that are missing in a sequence, while a range search involves finding continuous ranges of existing values. To demonstrate the technique for finding spaces and ranges, I will use a table named T1 with a numerical sequence in column col1 with an integer interval, equal to one, and table T2 with a sequence of date and time method in column col1 with an interval of one day. Here is the code for creating T1 and T2 and filling them with test data:

SET NOCOUNT ON; USE TSQL2012; -- dbo.T1 (numeric sequence with unique values, interval: 1) IF OBJECT_ID("dbo.T1", "U") IS NOT NULL DROP TABLE dbo.T1; CREATE TABLE dbo.T1 (col1 INT NOT NULL CONSTRAINT PK_T1 PRIMARY KEY); GO INSERT INTO dbo.T1(col1) VALUES(2),(3),(7),(8),(9),(11),(15),(16),(17),(28); -- dbo.T2 (temporal sequence with unique values, interval: 1 day) IF OBJECT_ID("dbo.T2", "U") IS NOT NULL DROP TABLE dbo.T2; CREATE TABLE dbo.T2 (col1 DATE NOT NULL CONSTRAINT PK_T2 PRIMARY KEY); GO INSERT INTO dbo.T2(col1) VALUES ("20120202"), ("20120203"), ("20120207"), ("20120208"), ("20120209"), ("20120211"), ("20120215" ), ("20120216"), ("20120217"), ("20120228");

Spaces

As discussed earlier, the gap finding task involves finding ranges of missing values ​​in a sequence. For our test data, the required result for the numerical sequence in T1 is:

And here is the desired result for the sequence of date and time stamps in T2:

In versions SQL Server Before SQL Server 2012, whitespace techniques were quite expensive and sometimes complex. But with the advent LAG and LEAD functions it became possible to solve this problem simply and effectively. Let's call the current value in the sequence col1 cur, and call the next value in the sequence nxt. Then you can use a filter to select only pairs whose difference is greater than the interval. Then you need to add one interval to cur and subtract the interval from nxt to get the space information. Here is the complete solution for the number sequence and the plan for executing it:

Admire how efficient this plan is: it only performs one ordered index scan based on column col1. To apply the same technique to a time sequence, you simply use the DATEDIFF function to calculate the difference between cur and nxt, and then the DATEADD function to add or subtract the interval:

Ranges

The task of finding ranges involves identifying ranges of existing values. Here is the expected result for a number sequence:

And here is the required result for a time sequence of dates:

One of the most effective solutions to the range search problem involves the use of ranking. Use the DENSE_RANK function to create a sequence of integers ordered by col1 and calculate the difference between col1 and the "dense rank" (drnk), something like this;

SELECT col1, DENSE_RANK() OVER(ORDER BY col1) AS drnk, col1 - DENSE_RANK() OVER(ORDER BY col1) AS diff FROM dbo.T1;

Note that the difference is the same within a range and is unique to each range. This occurs because col1 and drnk increase at the same interval. When moving to the next range, col1 increases by more than one interval, and drnk always increases by one interval. Therefore, the difference in each subsequent interval is greater than in the previous one. Because this difference is the same and unique within each range, it can be used as a group identifier. So all that remains is to group the rows by this difference and return the maximum and minimum value of col1 in each group:

WITH C AS (SELECT col1, col1 - DENSE_RANK() OVER(ORDER BY col1) AS grp FROM dbo.T1) SELECT MIN(col1) AS start_range, MAX(col1) AS end_range FROM C GROUP BY grp;

The outline of this solution is shown in the figure:

The plan is very efficient because the dense rank calculation uses index ordering based on col1. You may be wondering why I use the DENSE_RANK function rather than ROW_NUMBER. This is necessary for cases where the uniqueness of the sequence values ​​is not guaranteed. When using the ROW_NUMBER function, this technique only works if the sequence values ​​are unique (which is our test data), and fails if duplicates are allowed. When using DENSE_RANK the solution works for both unique and non-unique values, which is why I always prefer to use the DENSE_RANK function.

The same technique applies to time intervals, but the solution is not so obvious. Recall that the solution described creates a group identifier, which is a value that is the same for all members of the same range and different from the values ​​for members in other ranges. In time sequences, the intervals between the values ​​of col1 and the dense rank are different - the first has an interval of a day, and the second has a unit. To make this work, simply subtract from the value of col1 the number of time slots equal to the dense rank. To do this, use the DATEADD function. Then you will get a datetime stamp that is the same for all members of one range and different from the values ​​in other ranges.

Here is the code for the completed solution:

WITH C AS (SELECT col1, DATEADD(day, -1 * DENSE_RANK() OVER(ORDER BY col1), col1) AS grp FROM dbo.T2) SELECT MIN(col1) AS start_range, MAX(col1) AS end_range FROM C GROUP BY grp;

As you can see, instead of directly subtracting the result of the dense rank function from col1, we use DATEADD to subtract the dense rank multiplied by the interval, i.e. day, from col1.

There are many tasks that require range calculation techniques, including reports on availability, activity periods, and others. The same technique can be used to solve the classic problem of packing date intervals. Let's say there is a table like this with information about date intervals:

IF OBJECT_ID("dbo.Intervals", "U") IS NOT NULL DROP TABLE dbo.Intervals; CREATE TABLE dbo.Intervals (id INT NOT NULL, startdate DATE NOT NULL, enddate DATE NOT NULL); INSERT INTO dbo.Intervals(id, startdate, enddate) VALUES (1, "20120212", "20120220"), (2, "20120214", "20120312"), (3, "20120124", "20120201");

These intervals can represent periods of activity, validity, or any other type of period. The task is to, given a period (with a beginning @from and @to an end), pack the intervals in it. In other words, you need to combine overlapping and immediately adjacent intervals. Here is the expected result for the given test data for the period from January 1, 2012 to December 31, 2012:

In the solution below, the GetNums function described in the article "Auxiliary virtual tables of numbers" is used to generate a sequence of dates that fit into a given period. The code defines a CTE named Dates that represents this set of dates. The code next joins the CTE expression Dates (alias D) in the Intervals table (alias I), matching each date to the intervals that contain it, using a join predicate like this: D.dt BETWEEN I.startdate AND I.enddate. The code then uses the technique described above to calculate the group identifier (let's call it grp) that defines the ranges. Based on this request, the code defines a CTE expression named Groups. Finally, the outer query groups the rows by grp and returns the minimum and maximum dates of each range, which represent the bounds of the packed intervals. Here is the code for the completed solution:

DECLARE @from AS DATE = "20120101", @to AS DATE = "20121231"; WITH Dates AS (SELECT DATEADD(day, n-1, @from) AS dt FROM dbo.GetNums(1, DATEDIFF(day, @from, @to) + 1) AS Nums), Groups AS (SELECT D.dt, DATEADD(day, -1 * DENSE_RANK() OVER(ORDER BY D.dt), D.dt) AS grp FROM dbo.Intervals AS I JOIN Dates AS D ON D.dt BETWEEN I.startdate AND I.enddate) SELECT MIN (dt) AS rangestart, MAX(dt) AS rangeend FROM Groups GROUP BY grp;

Note that this solution does not perform very well if the intervals span long periods of time. And this is understandable, because the solution unpacks each period into separate dates.

There are versions of the range finding problem that are much more difficult basic version. Let's say, for example, that we need to ignore spaces less than or equal to a certain size, for example, in a number sequence we are not interested in spaces 2 or less. Then the expected result will be like this:

Note that the values ​​7, 8, 9, and 11 are all part of the same range starting at 7 and ending at 11. The space between 9 and 11 is ignored because it is less than 2.

To solve this problem, you can use the LAG and LEAD functions. First, we define a CTE named C1, in which the T1 table query evaluates the following two attributes: isstart and isend. The isstart attribute is a flag that equal to one, when the sequence value is the first in the range, and zero otherwise. A value is not the first value in the range if the difference between col1 and the previous value (obtained using the LAG function) is less than or equal to 2, otherwise it is the first value in the range. Likewise, a value is not the last value in the range if the difference between the next value (obtained using the LEAD function) and col1 is less than or equal to 2, otherwise it is the last value in the range.

The code then defines a CTE named C2 that selects only rows where the sequence values ​​are the start or end of the range. The LEAD function identifies the beginning and end pairs of each range. This is achieved by using the expression 1 - isend as the offset of the LEAD function. This means that if the current row representing the beginning of the range also represents the end, then the offset is zero, otherwise it is one. Finally, the external query simply selects from the C2 results only those rows in which isstart is equal to one. Here is the code for the completed solution.

When working with relational DBMSs, in which data is stored in tabular form, users are often faced with the task of selecting values ​​that are included (not included) in a certain range. The SQL language allows you to specify a set to which a value should (should not) belong using various options - the In operator, the Like operator, a combination of the more-less conditions, and also SQL statement Between. The description and examples in this article will focus on the latter option.

Operator "Between" in SQL: syntax, restrictions

The SQL between operator is literally translated as “between”. Its use allows you to set a “From and To” constraint on a specific field, and if the next value falls into the range, the predicate will take the value “True” and the value will be included in the final selection.

The syntax of the operator is extremely simple:

Where t1.n between 0 and 7

As you can see, after the between keyword you must specify the value of the lower limit of the range, then AND and the value of the upper limit.

Let's list what data types the between SQL operator can work with:

  1. With numbers - whole and fractional.
  2. With dates.
  3. With text.

This between SQL operator has certain features. Let's get to know them:

  1. When working with numbers and dates, the From and To constraint values ​​are included in the selection.
  2. The value of the lower limit of the range must be less than the value of the upper limit, otherwise nothing will be displayed, because the condition is logically invalid. You need to be especially careful when variables are included in the condition instead of specific values.

When working with text, the value for the upper limit of the range will not be included in the selection unless it is specified very precisely. In the following sections we will consider this feature in more detail.

Selection of numbers and dates in a certain range

Let's prepare a table with data on managers working in the organization. The table will have the following structure:

Field name

Data type

Description

Unique employee identifier

Text

Employee's last name

Text

Employee name

Surname

Text

Employee's middle name

Text

Employee gender (M/F)

Reception_date

Date/time

Date the employee was hired

Number_of_children

Numerical

Number of children the employee has

Let's fill the table with the following data:

Code

Surname

Name

Surname

Floor

Reception_date

Number_of_children

Alexandrova

Nikolaevna

Stepanovich

Vinogradov

Pavlovich

Alexander

Borisovich

Vishnyakov

Alexandrovich

Tropnikov

Sergeevich

Zhemchugov

Vasilievich

Konstantinovna

Nikolaevich

Let's create between, which will help us select all employees who have 2 or 3 children:

The result will be three lines with data on employees with the names Shumilin, Tropnikov and Avdeeva.

Now we will select employees hired from January 1, 2005 to December 31, 2016. It should be noted that different DBMSs allow you to write dates in conditions differently. In most cases, the date is simply forced into the form day-month-year (or whatever is more convenient) and is written in single or In the DBMS, the date is enclosed in the “#” sign. Let's run an example based on it:

SELECT Managers.*, Managers.Reception_date

FROM Managers

WHERE Managers. Reception date Between #1/1/2005# And #31/12/2016#

The result will be five employees hired during the specified period inclusive.

Working in between with strings

A very common problem that you have to solve when working with employee surnames is the need to select only those whose surnames begin with a certain letter. Let's try to fulfill the request and select employees whose last names begin with last names from A to B:

The result is as follows:

As you can see, two employees with a last name starting with the letter B were not included in the list. What is this connected with? The point is exactly how the operator compares strings of unequal length. The “B” line is shorter than the “Vinogradov” line and is padded with spaces. But when sorting alphabetically, spaces will be leading characters, and the last name will not be included in the selection. Different DBMS offer solutions in different ways this problem, but often the easiest way to be safe is to specify the next letter of the alphabet in the range:

When executing of this request the result will completely satisfy us.

This nuance exists only when working with character data, but it shows that you need to be careful when working even with such simple operators as between.

Any query created for work in the database simplifies access to necessary information. In a previous post I talked about common operators conditions. In this same post, I will talk about operators that will allow you to create queries that can provide more detailed information of interest, which, at the same time, is not so easy to find with queries with the AND, OR operators.
One of the special operators is IN. This operator allows you to specify required range display the necessary information. Let's return to the data on debtors

Debtors

Num Month Year Sname City Address Debt
0001 July2012 IvanovStavropolStavropolskaya, 150000
0002 December2019 KononovTatarZagorodnaya, 254684068
0003 May2013 YamshinMikhailovskSelskaya, 48165840
0004 August2012 PreneyStavropolCentral, 1646580
... ... ... ... ... ... ...
9564 March2015 UlievaDeminoInternational, 156435089
9565 October2012 PavlovaStavropolVokzalnaya, 3768059
9566 January2012 UryupaMikhailovskFontannaya, 1951238
9567 November2017 ValetovTatarVyezdnaya, 65789654

Suppose you need to select all debtors of the city of Stavropol or Tatarka. By analogy with the previous entry, you would need to use the request
SELECT *
FROM Debtors
WHERE City = "Stavropol"
OR City = "Tatarka";

First of all, the resulting code is cumbersome. By using special operators, you can get more compact code.
SELECT *
FROM Debtors
WHERE City IN (“Stavropol”, “Tatarka”);

The result will be

Let's follow the logic of the program. WITH keywords SELECT, FROM and WHERE. But then the IN operator appears. It sets the program a sequence of actions - it is necessary to view the database information contained in the "City" column. And to display you need to select the data “Stavropol” and “Tatarka”.
Let me consider an example in which you need to make a selection based on certain amounts of debt.
SELECT *
FROM Debtors
WHERE Debt IN (435089, 789654, 684068);

The result will be the following

Those. the IN operator scans the entire database for the presence of the specified information selection parameters.
The situation is different using another special operator BETWEEN. If the operator IN considered information exclusively specified parameters, then the operator BETWEEN- between certain ranges. However, one should not draw an analogy between the translation from English of this operator and its actual purpose. If you specify BETWEEN 1 AND 5, this does not mean that the numbers 2, 3 and 4 will be true. This operator is simply perceived by SQL as a certain value that can be found among other values. In an example it will look like this.
SELECT *
FROM Debtors
WHERE Debts BETWEEN 30000 AND 100000;

The result will be

That is, SQL accepted the operator BETWEEN as any value in the range from 30000 to 100000 in the "Debts" column.
In addition to specifying approximate ranges in numerical terms, you can specify alphabetical ranges, which display information containing the first letters of the specified range. But there is one interesting point here. Let's create the following request
SELECT *
FROM Debtors
WHERE Sname BETWEEN "I" AND "P";

Then the following data will be displayed

A logical question: “Why did debtors with the last name P reni and P Avlova? After all, the first letters of their surnames are included in the specified range!" The letters are included, but the surnames are not. This is due to the fact that the SQL language in this kind of queries only accepts the length of the search strings that are specified. In other words, the length of the string is "P" in the query is one character, and the length of the string "Preni" and "Pavlova" in the database is five and seven, respectively. But the surname ". AND vanov" falls within the range because the range starts with AND, as the beginning, one character long.


Close