Python中的正则表达式及示例|集1-yiteyi-C++库

A. 正则表达式（RegEx） 是一种特殊的字符序列，它使用搜索模式查找字符串或字符串集。它可以通过与特定模式匹配来检测文本的存在或不存在，还可以将模式拆分为一个或多个子模式。Python提供了重新支持在Python中使用正则表达式的模块。它的主要功能是提供搜索，其中包含正则表达式和字符串。在这里，它要么返回第一个匹配项，要么不返回任何匹配项。

null

例子：

Python3

                           import                                        re                         
            
                           s                                        =                                        'GeeksforGeeks: A computer science portal for geeks'                         
            
                           match                                        =                                        re.search(r                                        'portal'                                        , s)                         
            
                           print                                        (                                        'Start Index:'                                        , match.start())                         
                           print                                        (                                        'End Index:'                                        , match.end())                         

输出

Start Index: 34End Index: 40

上面的代码给出了字符串门户的起始索引和结束索引。

注：这里r字符（r’portal’）代表原始字符，而不是正则表达式。原始字符串与常规字符串略有不同，它不会将字符解释为转义字符。这是因为正则表达式引擎将字符用于其自身的转义目的。

在开始Python正则表达式模块之前，让我们看看如何使用元字符或特殊序列编写正则表达式。

元字符

为了理解RE类比，元字符是有用的、重要的，并将用于模块RE的函数中。下面是元字符列表。

元字符	描述
	用于删除后面字符的特殊含义
[]	表示一个字符类
^	匹配开头
$	匹配结尾
.	匹配除换行符以外的任何字符
\|	表示或（与由其分隔的任何字符匹配）。
?	匹配零次或一次
*	任意数量的事件（包括0个事件）
+	一个或多个事件
{}	指示要匹配的前一个正则表达式的出现次数。
()	附上一组正则表达式

让我们详细讨论每一个元字符

–反斜杠

反斜杠（）确保不会以特殊方式处理字符。这可以被认为是一种逃避元字符的方法。例如，如果要搜索点（.）在字符串中，你会发现点（.）将被视为特殊字符，就像元字符之一一样（如上表所示）。对于这种情况，我们将在点（.）之前使用反斜杠（）这样它就会失去它的特色。请参阅下面的示例以更好地理解。

例子：

Python3

                         import                                     re                       
           
                         s                                     =                                     'geeks.forgeeks'                       
           
                         # without using                       
                         match                                     =                                     re.search(r                                     '.'                                     , s)                       
                         print                                     (match)                       
           
                         # using                       
                         match                                     =                                     re.search(r                                     '.'                                     , s)                       
                         print                                     (match)                       

输出

<_sre.SRE_Match object; span=(0, 1), match='g'><_sre.SRE_Match object; span=(5, 6), match='.'>

[]–方括号

方括号（[]）表示一个字符类，它由一组我们希望匹配的字符组成。例如，字符类[abc]将匹配任何单个a、b或c。

我们还可以在方括号内使用–指定一系列字符。例如

[0,3]是样本[0123]
[a-c]与[abc]相同

我们还可以使用插入符号（^）反转字符类。例如

[^0-3]指除0、1、2或3之外的任何数字
[^a-c]指除a、b或c以外的任何字符

^-插入符号

插入符号（^）与字符串开头匹配，即检查字符串是否以给定字符开头。例如——

^g将检查字符串是否以g开头，例如极客、地球仪、女孩、g等。
^ge将检查字符串是否以ge开头，例如geeks、Geeksforgeks等。

美元

美元（$）符号与字符串结尾匹配，即检查字符串是否以给定字符结尾。例如——

s$将检查以a结尾的字符串，例如geeks、ends、s等。
ks$将检查以ks结尾的字符串，例如geeks、GeeksForgeks、ks等。

.–圆点

点（.）除换行符外，符号只匹配单个字符().例如——

a、 b将检查在点位置包含任何字符的字符串，如acb、acbd、abbb等
..将检查字符串是否包含至少2个字符

|–或者

Or符号用作Or运算符，这意味着它检查Or符号之前或之后的模式是否存在于字符串中。例如——

a | b将匹配任何包含a或b的字符串，如acd、bcd、abcd等。

? – 问号

问号（？）检查正则表达式中问号之前的字符串是否至少出现一次。例如——

ab？c将与字符串ac、acb、dabc匹配，但不会与abbc匹配，因为有两个b。同样，它也不会与abdc匹配，因为b后面没有c。

*–明星

星号（*）符号与*符号前面的正则表达式的零次或多次匹配。例如——

ab*c将与字符串ac、abc、abbbc、dabc等匹配，但不会与abdc匹配，因为b后面不跟c。

+–外加

加号（+）与+号前面的一个或多个正则表达式匹配。例如——

ab+c将与字符串abc、abbc、dabc匹配，但不会与ac、abdc匹配，因为ac中没有b，abdc中b后面没有c。

{m，n}–大括号

大括号匹配正则表达式前面从m到n的任何重复。例如——

a{2，4}将与字符串aaab，baaaac，gaad匹配，但不会与字符串abc，bc匹配，因为两种情况下只有一个a或没有a。

（）-集团

组符号用于对子图案进行分组。例如——

（a | b）cd将匹配acd、abcd、gacd等字符串。

特殊序列

特殊序列与字符串中的实际字符不匹配，而是告诉搜索字符串中必须匹配的特定位置。它使编写常用模式变得更容易。

特殊序列列表

特殊序列	描述	例子
A.	如果字符串以给定字符开头，则匹配	阿福	对于极客
A.	如果字符串以给定字符开头，则匹配	阿福	为了世界
	如果单词以给定字符开头或结尾，则匹配。（string）将检查单词的开头，（string）将检查单词的结尾。	通用电气	极客
	如果单词以给定字符开头或结尾，则匹配。（string）将检查单词的开头，（string）将检查单词的结尾。	通用电气	收到
B	它与正则表达式相反，即字符串不应以给定的正则表达式开头或结尾。	Bge	在一起
B	它与正则表达式相反，即字符串不应以给定的正则表达式开头或结尾。	Bge	锻造
D	匹配任何十进制数字，这相当于集合类[0-9]	D	123
D	匹配任何十进制数字，这相当于集合类[0-9]	D	哎呀
D	匹配任何非数字字符，这相当于集合类[^0-9]	D	极客
D	匹配任何非数字字符，这相当于集合类[^0-9]	D	天哪
s	匹配任何空白字符。	s	天哪
s	匹配任何空白字符。	s	公元前
s	匹配任何非空白字符	s	屋宇署
s	匹配任何非空白字符	s	abcd
W	匹配任何字母数字字符，这相当于类[a-zA-Z0-9_]。	W	123
W	匹配任何字母数字字符，这相当于类[a-zA-Z0-9_]。	W	极客4
W	匹配任何非字母数字字符。	W	>$
W	匹配任何非字母数字字符。	W	哎呀<>
	匹配字符串是否以给定的正则表达式结尾	ab	abcdab
	匹配字符串是否以给定的正则表达式结尾	ab	阿巴巴布

Python中的正则表达式模块

Python有一个名为re的模块，用于Python中的正则表达式。我们可以使用进口声明 .

例子： 在Python中导入re模块

Python3

import re

让我们看看这个模块提供的各种函数，它们可以在Python中使用正则表达式。

重新。芬德尔（）

返回字符串中模式的所有非重叠匹配项，作为字符串列表。从左到右扫描字符串，并按找到的顺序返回匹配项。

例子： 查找模式的所有匹配项

Python3

                         # A Python program to demonstrate working of                       
                         # findall()                       
                         import                                     re                       
           
                         # A sample text string where regular expression                       
                         # is searched.                       
                         string                                     =                                     """Hello my Number is 123456789 and                       
                                                 my friend's number is 987654321"""                       
           
                         # A sample regular expression to find digits.                       
                         regex                                     =                                     'd+'                       
           
                         match                                     =                                     re.findall(regex, string)                       
                         print                                     (match)                       
           
                         # This example is contributed by Ayush Saluja.                       

输出

['123456789', '987654321']

重新。编译（）

正则表达式被编译成模式对象，这些对象具有用于各种操作的方法，例如搜索模式匹配或执行字符串替换。

例1：

python

                         # Module Regular Expression is imported                       
                         # using __import__().                       
                         import                                     re                       
           
                         # compile() creates regular expression                       
                         # character class [a-e],                       
                         # which is equivalent to [abcde].                       
                         # class [abcde] will match with string with                       
                         # 'a', 'b', 'c', 'd', 'e'.                       
                         p                                     =                                     re.                                     compile                                     (                                     '[a-e]'                                     )                       
           
                         # findall() searches for the Regular Expression                       
                         # and return a list upon finding                       
                         print                                     (p.findall(                                     "Aye, said Mr. Gibenson Stark"                                     ))                       

输出：

['e', 'a', 'd', 'b', 'e', 'a']

理解输出：

第一个出现在“是”中的是“e”，而不是“A”，因为它区分大小写。
下一个出现是“said”中的“a”，然后是“said”中的“d”，接着是“Gibenson”中的“b”和“e”，最后一个“a”与“Stark”匹配。
元字符反斜杠“”具有非常重要的作用，因为它表示各种序列。如果要使用反斜杠，但没有作为元字符的特殊含义，请使用“\”

例2： 设置类[s，.]将匹配任何空白字符“，”或“.”。

python

                         import                                     re                       
           
                         # d is equivalent to [0-9].                       
                         p                                     =                                     re.                                     compile                                     (                                     'd'                                     )                       
                         print                                     (p.findall(                                     "I went to him at 11 A.M. on 4th July 1886"                                     ))                       
           
                         # d+ will match a group on [0-9], group                       
                         # of one or greater size                       
                         p                                     =                                     re.                                     compile                                     (                                     'd+'                                     )                       
                         print                                     (p.findall(                                     "I went to him at 11 A.M. on 4th July 1886"                                     ))                       

输出：

['1', '1', '4', '1', '8', '8', '6']['11', '4', '1886']

例3：

python

                         import                                     re                       
           
                         # w is equivalent to [a-zA-Z0-9_].                       
                         p                                     =                                     re.                                     compile                                     (                                     'w'                                     )                       
                         print                                     (p.findall(                                     "He said * in some_lang."                                     ))                       
           
                         # w+ matches to group of alphanumeric character.                       
                         p                                     =                                     re.                                     compile                                     (                                     'w+'                                     )                       
                         print                                     (p.findall("I went to him at                                     11                                     A.M., he                       
                         said                                     *                                     *                                     *                                     in                                     some_language."))                       
           
                         # W matches to non alphanumeric characters.                       
                         p                                     =                                     re.                                     compile                                     (                                     'W'                                     )                       
                         print                                     (p.findall(                                     "he said *** in some_language."                                     ))                       

输出：

['H', 'e', 's', 'a', 'i', 'd', 'i', 'n', 's', 'o', 'm', 'e', '_', 'l', 'a', 'n', 'g']['I', 'went', 'to', 'him', 'at', '11', 'A', 'M', 'he', 'said', 'in', 'some_language'][' ', ' ', '*', '*', '*', ' ', ' ', '.']

例4：

python

                         import                                     re                       
           
                         # '*' replaces the no. of occurrence                       
                         # of a character.                       
                         p                                     =                                     re.                                     compile                                     (                                     'ab*'                                     )                       
                         print                                     (p.findall(                                     "ababbaabbb"                                     ))                       

输出：

['ab', 'abb', 'a', 'abbb']

理解输出：

我们的RE是ab*，它由“a”和任何数量的“b”组成，从0开始。
输出“ab”有效，因为单个“a”带有单个“b”。
输出“abb”有效，因为单个“a”带有2个“b”。
输出“a”有效，因为单个“a”带有0“b”。
输出“abbb”是有效的，因为单个“a”带有3个“b”。

重新。split（）

按字符或模式的出现次数拆分字符串，找到该模式后，字符串中的剩余字符将作为结果列表的一部分返回。

语法：

re.split(pattern, string, maxsplit=0, flags=0)

第一个参数pattern表示正则表达式，string是给定的字符串，将在其中搜索pattern，并在其中发生拆分，如果未提供maxsplit，则将其视为零“0”，如果提供了任何非零值，则最多会发生这么多拆分。如果maxsplit=1，则字符串将只拆分一次，从而生成长度为2的列表。这些标志非常有用，可以帮助缩短代码，它们不是必需的参数，例如：flags=re。IGNORECASE，在这个拆分中，大小写，即小写或大写将被忽略。

例1：

python

                         from                                     re                                     import                                     split                       
           
                         # 'W+' denotes Non-Alphanumeric Characters                       
                         # or group of characters Upon finding ','                       
                         # or whitespace ' ', the split(), splits the                       
                         # string from that point                       
                         print                                     (split(                                     'W+'                                     ,                                     'Words, words , Words'                                     ))                       
                         print                                     (split(                                     'W+'                                     ,                                     "Word's words Words"                                     ))                       
           
                         # Here ':', ' ' ,',' are not AlphaNumeric thus,                       
                         # the point where splitting occurs                       
                         print                                     (split(                                     'W+'                                     ,                                     'On 12th Jan 2016, at 11:02 AM'                                     ))                       
           
                         # 'd+' denotes Numeric Characters or group of                       
                         # characters Splitting occurs at '12', '2016',                       
                         # '11', '02' only                       
                         print                                     (split(                                     'd+'                                     ,                                     'On 12th Jan 2016, at 11:02 AM'                                     ))                       

输出：

['Words', 'words', 'Words']['Word', 's', 'words', 'Words']['On', '12th', 'Jan', '2016', 'at', '11', '02', 'AM']['On ', 'th Jan ', ', at ', ':', ' AM']

例2：

python

                         import                                     re                       
           
                         # Splitting will occurs only once, at                       
                         # '12', returned list will have length 2                       
                         print                                     (re.split(                                     'd+'                                     ,                                     'On 12th Jan 2016, at 11:02 AM'                                     ,                                     1                                     ))                       
           
                         # 'Boy' and 'boy' will be treated same when                       
                         # flags = re.IGNORECASE                       
                         print                                     (re.split(                                     '[a-f]+'                                     ,                                     'Aey, Boy oh boy, come here'                                     , flags                                     =                                     re.IGNORECASE))                       
                         print                                     (re.split(                                     '[a-f]+'                                     ,                                     'Aey, Boy oh boy, come here'                                     ))                       

输出：

['On ', 'th Jan 2016, at 11:02 AM']['', 'y, ', 'oy oh ', 'oy, ', 'om', ' h', 'r', '']['A', 'y, Boy oh ', 'oy, ', 'om', ' h', 'r', '']

重新。sub（）

函数中的’sub’代表子字符串，在给定字符串（第3个参数）中搜索特定的正则表达式模式，找到子字符串模式后，用repl（第2个参数）替换，计数检查并保持发生的次数。

语法：

 re.sub(pattern, repl, string, count=0, flags=0)

例1：

python

                         import                                     re                       
           
                         # Regular Expression pattern 'ub' matches the                       
                         # string at "Subject" and "Uber". As the CASE                       
                         # has been ignored, using Flag, 'ub' should                       
                         # match twice with the string Upon matching,                       
                         # 'ub' is replaced by '~*' in "Subject", and                       
                         # in "Uber", 'Ub' is replaced.                       
                         print                                     (re.sub(                                     'ub'                                     ,                                     '~*'                                     ,                                     'Subject has Uber booked already'                                     ,                       
                                                 flags                                     =                                     re.IGNORECASE))                       
           
                         # Consider the Case Sensitivity, 'Ub' in                       
                         # "Uber", will not be reaplced.                       
                         print                                     (re.sub(                                     'ub'                                     ,                                     '~*'                                     ,                                     'Subject has Uber booked already'                                     ))                       
           
                         # As count has been given value 1, the maximum                       
                         # times replacement occurs is 1                       
                         print                                     (re.sub(                                     'ub'                                     ,                                     '~*'                                     ,                                     'Subject has Uber booked already'                                     ,                       
                                                 count                                     =                                     1                                     , flags                                     =                                     re.IGNORECASE))                       
           
                         # 'r' before the pattern denotes RE, s is for                       
                         # start and end of a String.                       
                         print                                     (re.sub(r                                     'sANDs'                                     ,                                     ' & '                                     ,                                     'Baked Beans And Spam'                                     ,                       
                                                 flags                                     =                                     re.IGNORECASE))                       

输出

S~*ject has ~*er booked alreadyS~*ject has Uber booked alreadyS~*ject has Uber booked alreadyBaked Beans & Spam

重新。subn（）

subn（）在所有方面都与sub（）类似，只是在提供输出的方式上不同。它返回一个元组，其中包含替换和新字符串的总数，而不仅仅是字符串。

语法：

 re.subn(pattern, repl, string, count=0, flags=0)

例子：

python

                         import                                     re                       
           
                         print                                     (re.subn(                                     'ub'                                     ,                                     '~*'                                     ,                                     'Subject has Uber booked already'                                     ))                       
           
                         t                                     =                                     re.subn(                                     'ub'                                     ,                                     '~*'                                     ,                                     'Subject has Uber booked already'                                     ,                       
                                                 flags                                     =                                     re.IGNORECASE)                       
                         print                                     (t)                       
                         print                                     (                                     len                                     (t))                       
           
                         # This will give same output as sub() would have                       
                         print                                     (t[                                     0                                     ])                       

输出

('S~*ject has Uber booked already', 1)('S~*ject has ~*er booked already', 2)Length of Tuple is:  2S~*ject has ~*er booked already

重新。逃脱

返回所有非字母数字都反斜杠的字符串，如果要匹配可能包含正则表达式元字符的任意文字字符串，这非常有用。

语法：

re.escape(string)

例子：

python

                         import                                     re                       
           
                         # escape() returns a string with BackSlash '',                       
                         # before every Non-Alphanumeric Character                       
                         # In 1st case only ' ', is not alphanumeric                       
                         # In 2nd case, ' ', caret '^', '-', '[]', ''                       
                         # are not alphanumeric                       
                         print                                     (re.escape(                                     "This is Awesome even 1 AM"                                     ))                       
                         print                                     (re.escape(                                     "I Asked what is this [a-9], he said 	 ^WoW"                                     ))                       

输出

This is Awesome even 1 AMI Asked what is this [a-9], he said      ^WoW

重新。搜索（）

这个方法要么不返回（如果模式不匹配），要么返回一个re。MatchObject包含有关字符串匹配部分的信息。这种方法在第一次匹配后停止，因此最适合测试正则表达式，而不是提取数据。

例子： 搜索模式的出现

Python3

                         # A Python program to demonstrate working of re.match().                       
                         import                                     re                       
           
                         # Lets use a regular expression to match a date string                       
                         # in the form of Month name followed by day number                       
                         regex                                     =                                     r                                     "([a-zA-Z]+) (d+)"                       
           
                         match                                     =                                     re.search(regex,                                     "I was born on June 24"                                     )                       
           
                         if                                     match !                                     =                                     None                                     :                       
           
                                                 # We reach here when the expression "([a-zA-Z]+) (d+)"                       
                                                 # matches the date string.                       
           
                                                 # This will print [14, 21), since it matches at index 14                       
                                                 # and ends at 21.                       
                                                 print                                     (                                     "Match at index %s, %s"                                     %                                     (match.start(), match.end()))                       
           
                                                 # We us group() method to get all the matches and                       
                                                 # captured groups. The groups contain the matched values.                       
                                                 # In particular:                       
                                                 # match.group(0) always returns the fully matched string                       
                                                 # match.group(1) match.group(2), ... return the capture                       
                                                 # groups in order from left to right in the input string                       
                                                 # match.group() is equivalent to match.group(0)                       
           
                                                 # So this will print "June 24"                       
                                                 print                                     (                                     "Full match: %s"                                     %                                     (match.group(                                     0                                     )))                       
           
                                                 # So this will print "June"                       
                                                 print                                     (                                     "Month: %s"                                     %                                     (match.group(                                     1                                     )))                       
           
                                                 # So this will print "24"                       
                                                 print                                     (                                     "Day: %s"                                     %                                     (match.group(                                     2                                     )))                       
           
                         else                                     :                       
                                                 print                                     (                                     "The regex pattern does not match."                                     )                       

输出

Match at index 14, 21Full match: June 24Month: JuneDay: 24

匹配对象

Match对象包含有关搜索和结果的所有信息，如果没有找到匹配项，则不会返回任何信息。让我们看看match对象的一些常用方法和属性。

获取字符串和正则表达式

数学重新 属性返回传递的正则表达式和 火柴一串 属性返回传递的字符串。

例子： 获取匹配对象的字符串和正则表达式

Python3

                         import                                     re                       
           
                         s                                     =                                     "Welcome to GeeksForGeeks"                       
           
                         # here x is the match object                       
                         res                                     =                                     re.search(r                                     "G"                                     , s)                       
           
                         print                                     (res.re)                       
                         print                                     (res.string)                       

输出

re.compile('\bG')Welcome to GeeksForGeeks

获取匹配对象的索引

方法返回匹配子字符串的起始索引
方法返回匹配子字符串的结束索引
方法返回一个元组，其中包含匹配子字符串的起始索引和结束索引

例子： 获取匹配对象的索引

Python3

                         import                                     re                       
           
                         s                                     =                                     "Welcome to GeeksForGeeks"                       
           
                         # here x is the match object                       
                         res                                     =                                     re.search(r                                     "Gee"                                     , s)                       
           
                         print                                     (res.start())                       
                         print                                     (res.end())                       
                         print                                     (res.span())                       

输出

1114(11, 14)

获取匹配的子字符串

方法返回模式匹配的字符串部分。请参阅下面的示例以更好地理解。

例子： 获取匹配的子字符串

Python3

                         import                                     re                       
           
                         s                                     =                                     "Welcome to GeeksForGeeks"                       
           
                         # here x is the match object                       
                         res                                     =                                     re.search(r                                     "D{2} t"                                     , s)                       
           
                         print                                     (res.group())