A note about regular expressions. Why are they needed, where are they used and how to use them. In other words, about searching in PHP.

Regular expressions are a set of pointers that are used in searches to find the desired data.

Usage regular expressions V PHP functions, replacement , search .

For example, consider this function:

preg_replace("/(]*>)(.*?)()/i", "$1$3", $string);

What is inside this function, all these seemingly incomprehensible values, is called a regular expression (PHP RegEx). They are used to search for specific data.

The search pattern (characters) originates from the Perl language.

Regular expressions are divided into metacharacters and metacharacter modifiers.

Metacharacters—define a group of regular characters. Modifiers help you understand how many and how many of these characters to look for.

Regular expression metacharacters

The meanings of some metacharacters from the example above (they will also appear below):

^ - start of line
\ - treat the next element as a regular character (not a command)
. - one arbitrary (any random) symbol
() - grouping (submask)
- character class
$ - end of line
| - alternative (or)

Regular Expression Modifiers

* - repeat from 0 to infinity
? - search 1 time or less

More modifiers, but not used in the current examples:

Repeat 1 or more times
(n) - exact number of times (replace n with a number)
(n,5) - at least 5 times
(n,m) - not less than n, but not more than m

Any of the above modifiers can be combined with the modifier "?". It is needed to limit the search because by default all metacharacters are repeated greedily (without limit).

For example:

() - will find the entire string with all tags
() - will find only tags

Additional options

In addition to modifiers and metacharacters, there are options (not all are listed):

/i - case does not matter (lowercase and uppercase)
/s - makes dots (.) line breaks and carriage returns (left shift).
/U - turns all quantitative metacharacters into "non-greedy".

Options can be combined together:

Templates

There are also patterns, something like an analogue of metacharacters, one of them:

\n - new line

This page does not list all regular expression options. This is done specifically so as not to confuse the beginner and at the same time give him the basic tools for searching. In the future, if you want to go into more detail, you can find more detailed instructions on the Internet.

Experiments

You can experiment with regular expressions on this site. Enter a regular expression at the top, and at the bottom html data in which you are looking. If you choose the right regular expression, the section of code you need will be highlighted.

Regular expressions are special patterns for searching for substrings in text. With their help, you can solve the following problems in one line: “check whether a string contains numbers”, “find all email addresses in the text”, “replace several consecutive question marks with one”.

Let's start with one popular programming wisdom:

Some people, when faced with a problem, think: “Yeah, I'm smart, I'll solve it using regular expressions.” Now they have two problems.

Template examples Let's start with a couple simple examples . The first expression in the picture below looks for a sequence of 3 letters, where the first letter is “k”, the second is any Russian letter

and the third is a case-insensitive "t" (e.g. "cat" or "CAT" fits this pattern). The second expression searches the text for the time in the format 12:34.

  • Any expression begins with a delimiter character. The symbol / is usually used as it, but you can also use other symbols that do not have a special purpose in regular expressions, for example, ~, # or @. Alternative delimiters are used if the character / may appear in the expression. Then comes the pattern of the string we are looking for, followed by a second delimiter, and at the end there may be one or more flag letters. They specify additional options when searching for text. Here are examples of flags:
  • i - says that the search should be case insensitive (case sensitive by default)

u - says that the expression and text being searched use the utf-8 encoding, and not just Latin letters. Without it, the search for Russian (and any other non-Latin) characters may not work correctly, so you should always set it. The template itself consists of ordinary characters and special constructions. Well, for example, the letter “k” in regular expressions means itself, but the symbols mean “any number from 0 to 5 can be in this place.” Here full list

Below we will analyze the meaning of each of these characters (and also explain why the letter “е” is placed separately in the first expression), but for now let’s try to apply our regular expressions to the text and see what happens. PHP has a special function preg_match($regexp, $text, $match) that takes regular expression, text and an empty array as input. It checks if the text contains a substring that matches the given pattern and returns 0 if not, or 1 if there is. And in the passed array, the first found match with the regular sequence is placed in the element with index 0. Let's write a simple program that applies regular expressions to different strings:

After looking at the example, let's study regular expressions in more detail.

Parentheses in regular expressions

Let's repeat what they mean different types brackets:

  • The curly braces a(1,5) specify the number of repetitions of the previous character - in this example, the expression searches for 1 to 5 consecutive letters "a"
  • Square brackets mean “any one of these characters”, in this case the letters a, b, c, x, y, z or a number from 0 to 5. Other special characters like | do not work inside square brackets. or * - they denote a regular character. If there is a ^ symbol at the beginning of the square brackets, then the meaning changes to the opposite: “any one character except those indicated” - for example [^a-c] means “any one character except a, b or c”.
  • Parentheses group characters and expressions. For example, in the expression abc+, the plus sign refers only to the letter c and this expression searches for words like abc, abcc, abccc. And if you put parentheses a(bc)+, then the quantifier plus refers to the sequence bc and the expression looks for the words abc, abcbc, abcbcbc

Note: you can specify ranges of characters in square brackets, but remember that the Russian letter е is separate from the alphabet and to write “any Russian letter” you need to write [a-яе].

Bexslashes

If you've looked at other textbooks on regular expressions, you've probably noticed that backslash is written differently everywhere. Somewhere they write one backslash: \d , but here in the examples it is repeated 2 times: \\d .

Why? The regular expression language requires you to write backslash once. However, in lines in single and in PHP backslash also has a special meaning: a manual about strings.

Well, for example, if you write $x = "\$"; then PHP will treat this as a special combination and insert only the $ character into the string (and the regular expression engine will not know about the backslash before it). To insert the sequence \$ into a string, we must double the backslash and write the code as $x = "\\$"; .

  • For this reason, in some cases (where the sequence of characters has a special meaning in PHP) we are required to double the backslash:
  • To write \$ in regular expression, we write "\\$" in code
  • To write \\ in regular expression, we double each backslash and write "\\\\"

To write a backslash and a number (\1) in regular format, you need to double the backslash: "\\1"

In other cases, one or two backslashes will give the same result: "\\d" and "\d" will insert a pair of \d characters into the line - in the first case, 2 backslashes are the sequence for inserting a backslash, in the second case there is no special sequence and the characters will be inserted as is. You can check which characters will be inserted into a string and what the regular expression engine will see using echo: echo "\$"; . Yes, it’s difficult, but what can you do?
  • Special designs in regular season
  • \d searches for any one digit, \D - any one character except a digit

\w matches any one letter (of any alphabet), number, or underscore _ . \W matches any character except a letter, number, or underscore.

Also, there is a convenient condition for indicating a word boundary: \b .
  • This construction means that on one side of it there should be a character that is a letter/number/underscore (\w), and on the other side there should be a character that is not. Well, for example, we want to find the word “cat” in the text. If we write the regular expression /cat/ui , then it will find the sequence of these letters anywhere - for example, inside the word “cattle”. This is clearly not what we wanted. If we add a word boundary condition to the regular expression: /\bcat\b/ui , then now only the stand-alone word “cat” will be searched.

Manual Regular expression syntax in PHP, detailed description One of the very powerful and


Regular expressions are expressions written in a special language.


Don't be alarmed, the language is quite easy to understand; all you need is experience and practice. I think you have repeatedly encountered situations where you have text (for example, in Microsoft Word


) and you need to find something important in it. If you know what exactly you are looking for, everything is simple: call up the search dialog, enter the search word, press the button and voila - the text is found. But what will you do if you only know in advance the type of information you are looking for? For example, you are faced with the task of finding all addresses Email

in a document of a couple of hundred sheets. Some will view the document manually, some will enter the dog (@) into the search and search for it. Agree - both options are backbreaking, thankless work.

This is where regular expressions come to the rescue. To some approximation, regular expressions can be compared to masks or templates that are superimposed on text: if the text matches the mask, then this is the desired fragment. But before we consider the use of regular expressions, we will become familiar with their syntax.

A regular expression is a text string composed according to certain laws and rules. A string consists of characters and groups of characters, metacharacters, quantifiers and modifiers.

In this case, symbols mean any symbols of any alphabet. And not only readable ones. You can easily insert an unreadable character into an expression; to do this, you just need to know its code in hexadecimal form. For example:

// readable characters a E // unreadable characters and codes \x41 - the same as the letter "A" \x09 - tab character

A character group is several characters written sequentially:

Abvg ACZms

I would like to draw your attention right away - the “space” in regular expressions is also considered as a significant character, so be careful when writing expressions. For example, these character groups are DIFFERENT expressions:

ABC WHERE ABC WHERE

The next element of the language is metacharacters. The prefix "meta" means that these symbols describe some other symbols or their groups. The table describes the main metacharacters of the regular expression language:
() Metacharacters for specifying special characters
| Brackets. Defines nested expressions.
^ Selection metacharacter
$ Start of line metacharacter
End of line metacharacter \n
Line feed character (hex code 0x0A) \r
Carriage return character (hex code 0x0D) \t
Tab character (hex code 0x09) Inserting a character with hexadecimal code 0xhh, for example \x42 will insert the Latin letter "B"
Metacharacters for specifying groups of characters
. Dot. Any character.
\d Digit (0-9)
\D Not a number (any character except characters 0-9)
\s Blank character (usually space and tab)
\S Non-empty character (all except characters identified by the \s metacharacter)
\w A "dictionary" character (a character that is used in words. Typically all letters, all numbers, and an underscore ("_"))
\W All except characters defined by the \w metacharacter

The metacharacters from the second half of the table are very easy to remember. "d" - digit (digit), "s" - symbol (symbol), "w" - word (word). If the letter is large, then you need to add “NOT” to the group description.

Let’s take for example the text “The red jersey has the numbers 1812, and the green jersey has the numbers 2009.” Let's look at examples of the simplest regular expressions:

\d\d\d\d - will find 1812 and 2009 \D - will find all letters, spaces and punctuation marks \s - will find all spaces in the text.

But the year in our example can be written not in four, but in two digits, words can have other declensions, etc. Subsets of characters that are specified using square brackets can help here:

Means any digit (same as \d) - means an even digit - means any symbol of the Latin alphabet (in any case) or digit.

For example, the expression \d\d\d in the test string will only find 1812, but not 2009. This expression should be read as "find all sequences of four digits where the last digit is 0,2,4,6 or 8".

All we have left to mention are quantifiers and modifiers.

A quantifier is a special construct that determines how many times a character or group of characters must occur. The quantifier is written in curly brackets "()". Two recording formats are possible: precise and range. The exact format is written like this:

Here X is the number of times the previous symbol or group must be repeated. For example the expression

The second form of recording is range. Recorded as

(X, Y) // or (,Y) // or (X,)

where X is the minimum and Y is the maximum number of repetitions. For example:

read as "two to four digits written in sequence." If one of the boundaries is not specified, then no limitation is assumed. For example:

\w(3,) - three or more letters. \d(,5) - there are no numbers at all, or there are, but no more than five.

Quantifiers can be applied to either a single character or a group:

[A-Yaa-ya](1,3)

This construction will select from the text all Russian words of one, two or three letters (for example, “or”, “not”, “I”, “I go”, etc.)

Except curly braces There are three more quantifier metacharacters: “*” (asterisk), “+” (plus) and “?” (question). They are used in cases where the minimum and maximum number of required repetitions is unknown in advance. For example, when searching for email addresses, you can't tell in advance how many characters will be in the username (before "dog") and how many will be in the domain name (after "dog").

The metacharacter "*" is read as "any amount from zero or more", i.e. design

defines any number of consecutive letters, including their complete absence.

The "+" symbol differs from an asterisk only in that it requires the presence of at least one character. Those. design

matches any digital sequence with one or more digits.

Symbol "?" matches the absence or presence of a single character. Those. design

matches any digital sequence with one or two digits.

Here it is worth mentioning such a feature of the antiifiers “*” and “+” as greed. The point is that by default these characters correspond to the longest possible sequence of characters. For example, for the line “mom washed the frame” the expression:

will choose “mama soap ra”, which is somewhat unexpected, because we expected to get “ma”. To change this behavior, use the metacharacter "?" (question mark) written immediately after the quantifier. It limits the "appetite" of quantifiers by forcing them to return the first match rather than the longest. Now let's change the previous example:

and get the required match "ma".

The last element of the language is modifiers. A modifier is a special character that defines the “system” parameters for analyzing regular expressions. There are only four such symbols, they can be used either individually or simultaneously:

i Enables case-insensitive mode, i.e. capital and small letters do not differ in the expression.
m Indicates that the text being searched should be treated as consisting of multiple lines. By default, the regular expression engine treats text as a single string, regardless of what it actually is. Accordingly, the metacharacters "^" and "$" indicate the beginning and end of the entire text. If this modifier is specified, then they will indicate the beginning and end of each line of text, respectively.
s The default metacharacter is "." does not include the newline character in its definition. Those. for multiline text, the expression /.+/ will return only the first line, not the entire text as expected. Specifying this modifier removes this limitation.
U Makes all quantitative metacharacters "not greedy" by default. In some modifications of the language (in particular in PHP), instead of "U", the symbol "g" is used, which is more consistent with the meaning ("g" is an abbreviation for the English "greedy", "greedy").

The table shows the most popular and necessary examples regular expressions. Some of them may seem complicated and cumbersome to you, but with detailed study you will undoubtedly understand.

Regular expressions in PHP.

To work with regular expressions in PHP, there are special functions, the list of which short description are given in the table:

int preg_match (string pattern, string subject [, array matches])

The function checks whether the content of subject matches the pattern pattern. Returns 1 if matches are found, otherwise returns 0. If you specify the optional matches array parameter, then when the function is executed, a single element will be entered into it - the first match found.

int preg_match_all (string pattern, string subject, array matches [, int order])
The function is identical to the previous one, with the only difference - it searches the entire text and returns ALL matches found in the matches array.
mixed preg_replace (mixed pattern, mixed replacement, mixed subject [, int limit])
Like both of its predecessor functions, preg_replace searches for a piece of text that matches a pattern. The function replaces all found fragments with the text specified in the parameters.
mixed preg_replace_callback (mixed pattern, mixed callback, mixed subject [, int limit])
The function is an extended version of the previous one. The main difference is that this function is passed in the parameters the name of the function that will analyze the text and generate replacement text.
array preg_split (string pattern, string subject [, int limit [, int flags]])
This function is similar to the explode() and split() functions. Its peculiarity is that the separator is not a fixed string, but a regular expression. The function splits the source data into elements and places them in the output array.
array preg_grep (string pattern, array input)
The function is designed for regular search in arrays. For the search, a template and an array of input data are specified, and an array is returned consisting only of elements that match the template.

The list of functions considered is far from complete, but it is quite sufficient for a successful start to working with regular expressions. If you are interested in this topic, be sure to read additional literature (for example, Friedl’s book “Regular Expressions”). In addition, for training purposes, I recommend installing one of special programs for testing regular expressions (for example, "PCRE" or "RegEx Builder").

mixed preg_replace (mixed pattern, mixed replacement, mixed subject [, int limit])

Searches the string subject for matches of pattern and replaces them with replacement . If the limit parameter is specified, the limit occurrences of the template will be replaced; if limit is omitted or equal to -1, all occurrences of the pattern will be replaced.

Replacement can contain references of the form \\ n or (since PHP 4.0.4) $n , with the latter being preferable. Each such link will be replaced by a substring corresponding to the n "th enclosed in round brackets under the mask.

n can take values ​​from 0 to 99, with the reference \\0 (or $0) matching an occurrence of the entire pattern.

Subpatterns are numbered from left to right, starting with one.

When using wildcard replacement using subpattern references, a situation may arise where the mask is immediately followed by a number.

In this case, notation like \\n results in an error: a reference to the first subpattern followed by the number 1 will be written as \\11 , which will be interpreted as a reference to the eleventh subpattern.

The first three parameters of preg_replace() can be one-dimensional arrays. In case the array uses keys, when processing the array they will be taken in the order in which they are located in the array.


Specifying the keys in the array for pattern and replacement is optional.

If you do decide to use indexes to match the patterns and strings involved in the replacement, use the ksort() function on each of the arrays.

If you do decide to use indexes to match the patterns and strings involved in the replacement, use the ksort() function on each of the arrays.

Example 2: Using arrays with numeric indexes as arguments to preg_replace()

Result:

The slow black bear jumped over the lazy dog.

If the subject parameter is an array, pattern search and replacement are performed for each of its elements.


The returned result will also be an array.

If the pattern and replacement parameters are arrays, preg_replace() alternately retrieves a pair of elements from both arrays and uses them for the search and replace operation.

If the replacement array contains more elements than pattern , empty strings will be taken to replace the missing elements.


If pattern is an array and replacement is a string, each element of the pattern array will be searched and replaced with pattern (all elements of the array will be the pattern in turn, while the replacement string remains fixed).

The option when pattern is a string and replacement is an array does not make sense.

The /e modifier changes the behavior of the preg_replace() function in such a way that the replacement parameter, after performing the necessary substitutions, is interpreted as PHP code and only then is used for replacement. When using this modifier, be careful: the replacement parameter must contain valid PHP code, otherwise a syntax error will occur on the line containing the preg_replace() function call.

Example 3. Replacement using several patterns

This example will output:

Converts all HTML tags to uppercase

If this flag is specified, for each found substring its position in the source string will be indicated. It is important to remember that this flag changes the format of the returned data: each occurrence is returned in as an array, the zero element of which contains the found substring, and the first element contains the offset.

This flag is available in PHP 4.3.0 and higher. Additional parameter

flags is available since PHP 4.3.0.

The search is carried out from left to right, from the beginning of the line. The optional offset parameter can be used to specify an alternative starting position for the search. The additional offset parameter is available since PHP 4.3.3. ^ , $ Note: Using the offset parameter is not equivalent to replacing the matched string with substr($subject, $offset) when calling preg_match_all() , since pattern can contain conditions such as (? or

.

Compare:

While this example


The preg_match() function returns the number of matches found.