Regular Expression Extracting Matches

Summary: in this tutorial, we will show you how to extract the parts of the string that match a regular expression.

It is important to find the matches in the string using regular expressions. In addition, it is more useful if we can get the matches out of the string for further processing.

Perl makes it easy for you to extract parts of the string that match by using parentheses () around any data in the regular expression. For each set of capturing parentheses, Perl populates the matches into the special variables $1, $2, $3 and so on. Perl populates those special only when the matches succeed.

Let’s take a look at the following example:

#!/usr/bin/perl use warnings; use strict; my $time = localtime(); print $time, "\n"; print ("$1 \n") if($time =~ /(\d\d:\d\d:\d\d)/);
Code language: PHP (php)

How it works.

  • First, we got the local time using the localtime() function.
  • Then, we used the regular expression /(\d\d:\d\d:\d\d)/ to capture time data in the format hh:mm:ss. We accessed the captured match using the special variable $1.
Thu Jun 13 14:17:16 2013 14:17:16
Code language: Perl (perl)

The following example demonstrates how to extract data from text.

#!/usr/bin/perl use warnings; use strict; use Data::Dumper; my $text = <<END; name: Antonio Vivaldi, period: 1678-1741 name: Andrea Zani,period: 1696-1757 name: Antonio Brioschi, period: 1725-1750 END my %composers; for my $line (split /\n/, $text){ print $line, "\n"; if($line =~ /name:\s+(\w+\s+\w+),\s+period:\s*(\d{4}\-\d{4})/){ $composers{$1} = $2; } } print Dumper(\%composers);
Code language: PHP (php)
name: Antonio Vivaldi, period: 1678-1741 name: Andrea Zani,period: 1696-1757 name: Antonio Brioschi, period: 1725-1750 $VAR1 = { 'Antonio Brioschi' => '1725-1750', 'Antonio Vivaldi' => '1678-1741' }; Press any key to continue . . .
Code language: PHP (php)

In the example above, our goal is to extract composer names and their periods out of the string $text. The text data could come from a file or a web page. To make it simple, we used a string variable to store it.

  • First, we split the text into multiple lines by using the split() function that returns a list of lines.
  • Second, we looped over the list of lines. For each line, we use the regular expression /name:\s+(\w+\s+\w+),\s+period:\s*(\d{4}\-\d{4})/ to capture name and period. The name is captured using (\w+\s+\w+) and the period is captured using (\d{4}-\d{4}). The captured data is stored in a hash variable %composers.
  • Third, we displayed the %composers hash using Dumper.

In this tutorial, you have learned how to capture data from text using regular expressions.