MariaDB – Multiple results with REGEXP_SUBSTR
The Problem
On a WordPress site that uses a MariaDB database I need to extract file paths saved on the postmeta table. Problem is that they are saved in a format similar to this:
a:1:{i:0;s:137:"/var/www/vhosts/domain.com/httpdocs/wp-content/uploads/product-extras/8d68861b92064c8f9db3f0eb54c6af8e/2023/06/image1.jpg";}
Although I’m aware I could just parse this data easily on PHP side I wanted this data to be cleaned and returned by a stored procedure, so SQL side.
The Solution
REGEX_SUBSTR Function
Fortunately, MariaDB comes with the REGEXP_SUBSTR function. You can check it on the official documentation but I’ll go through it here.
REGEXP_SUBSTR takes 2 parameters:
- subject
- pattern
It returns the part of the string subject that matches the regular expression pattern, or na empty string if the pattern was not found.
Two quick notes:
- REGEXP_SUBSTR case sensivity depends on the DB collation.
- This function supports PCRE.
Regular Expression
I’ve built the following regular expression:
(wp-content)\/([a-zA-Z0-9_.-]\/*)*
- Captures a first group containing “wp-content”.
- Followed by a “/” character.
- Captures a second group which may contain any of the following characters:
- a-z: Any character in the range “a” to “z”
- A-Z: Any character in the range “A” to “Z”
- 0-9: Any character from “0” to “9”
- .-: Any of the following characters: “”, “-“, “.”
If you’re not familiar with regular expressions (RegEx) I would recommend to check this website and explore a little: https://regexr.com/
This RegEx applied to the string I get from the DB results in:
wp-content/uploads/product-extras/8d68861b92064c8f9db3f0eb54c6af8e/2023/06/image1.jpg
Which is perfect for what I need.
The 2nd Challenge
There is a relevant caveat with REGEXP_SUBSTR. It only returns the first result. However, I need them all.
Time for a stored procedure.
DROP PROCEDURE IF EXISTS usp_fetchUrls;
DELIMITER //
CREATE PROCEDURE usp_fetchUrls()
BEGIN
DECLARE to_parse MEDIUMTEXT;
DECLARE tmp MEDIUMTEXT;
CREATE TEMPORARY TABLE results (url MEDIUMTEXT NOT NULL);
SET to_parse = (SELECT PM.meta_value FROM `wp_postmeta` as PM WHERE PM.meta_id = '123456');
SET tmp = (SELECT REGEXP_SUBSTR(to_parse, '(wp-content)\/([A-z]*\/[a-zA-Z0-9_.-]*)*'));
WHILE LENGTH(tmp) > 0 DO
INSERT INTO results (url) VALUES (tmp);
SET to_parse = REPLACE(to_parse, tmp, '');
SET tmp = (SELECT REGEXP_SUBSTR(to_parse, '(wp-content)\/([A-z]*\/[a-zA-Z0-9_.-]*)*'));
END WHILE;
SELECT * FROM results;
DROP TEMPORARY TABLE IF EXISTS results;
END
//
DELIMITER ;
I won’t go too much in-depth about the stored procedure since it’s not the goal here. What’s happening here is:
- I declare the variables I’ll need, plus the temporary table that will hold my results.
DECLARE to_parse MEDIUMTEXT; -- Original string from postmeta
DECLARE tmp MEDIUMTEXT; -- Holds the result on each iteration
-- Temporary table to hold all the results.
CREATE TEMPORARY TABLE results (url MEDIUMTEXT NOT NULL);
- Fetch the string to parse from the postmeta table.
SET to_parse = (SELECT PM.meta_value FROM `wp_postmeta` as PM WHERE PM.meta_id = '123456');
- Parse the first result from the string using REGEXP_SUBSTR applied to the to_parse string with the regular expression built before.
SET to_parse = (SELECT PM.meta_value FROM `H093m_postmeta` as PM WHERE PM.meta_id = '1846763');
- Since REGEXP_SUBSTR returns an empty string if no results are found I can use that as a stopping condition for a WHILE loop.
WHILE LENGTH(tmp) > 0 DO
INSERT INTO results (url) VALUES (tmp);
SET to_parse = REPLACE(to_parse, tmp, '');
SET tmp = (SELECT REGEXP_SUBSTR(to_parse, '(wp-content)\/([A-z]*\/[a-zA-Z0-9_.-]*)*'));
END WHILE;
- While my result from REGEXP_SUBSTR is not an empty string I keep going.
- Insert the result into my temporary table.
- Replace the found result on the postmeta string with empty so next time we apply REGEXP_SUBSTR to this string I don’t get back the same result but the next (if any).
- Apply REGEXP_SUBSTR on the modified postmeta string.
- Finally I just return the results from the temporary table and DROP the temporary table as good practice.
SELECT * FROM results;
DROP TEMPORARY TABLE IF EXISTS results;
Applying this stored procedure to this postmeta result:
a:2:{i:0;s:121:"/var/www/vhosts/domain.com/httpdocs/wp-content/uploads/product-extras/b8f10481054349039776cf74be2dcc39/2023/06/image1.jpg";i:1;s:137:"/var/www/vhosts/domain.com/httpdocs/wp-content/uploads/product-extras/b8f10481054349039776cf74be2dcc39/2023/06/image2.jpg";}
Would result in:
wp-content/uploads/product-extras/b8f10481054349039776cf74be2dcc39/2023/06/image1.jpg
wp-content/uploads/product-extras/b8f10481054349039776cf74be2dcc39/2023/06/image2.jpg
The stored procedure could (and probably should) be improved to receive the postmeta ID by parameter (per example), instead of being hardcoded. That’s actually outside the scope of this post, but feel free to ask.
Would you approach this in a different way?