Strings and Regular Expressions in PHP, or "PCRE, POSIX, and Bears, Oh My!" UPHPU Meeting January 18, 2005 Mac Newbold mac@macnewbold.com Who am I? - Full-time self-employed computer geek - MNE, LLC (macnewbold.com, owner) and - Digital Media Consulting, LLC (a.k.a. Dmedia, www.dmedia.ws, partner) - Wide variety of PHP-driven web sites, mostly with MySQL and without Javascript and Flash - Background: B.S. C.S. '01, M.S. C.S. '05 - University of Utah - Go Utes! Campaign Promises - Intro to Strings in PHP - (Feel free to tell me how fast or slow to go) - Functions relating to HTML, SQL, etc. - Regular Expressions - PCRE - POSIX - Performance/Speed considerations - Grab bag of cool string functions Introducing: Strings in PHP - Much like strings in any other language - Major difference: Boundary between string, integer, float, and boolean is very blurred - Actually a benefit: if it's not a string, but should be, it will be - Though this can lead to some unexpected results - Info in PHP Manual: - www.php.net/strings - www.php.net/manual/en/language.types.string.php String Syntax - Single quotes: 'a string' - No variable interpolation, \' is only escape code - Double quotes: "a $better string\n" - Variables work, standard escape codes work - "Here-doc" syntax: $foo = << $str{3} == "B" - Concatenation: the dot operator - "This lets you join strings into ". "bigger ones" - Note: Avoiding embedded newlines "in strings that wrap onto multiple lines" is a good idea - Concatenating Assignment : .= - $str = "My name is"; $str .= " Mac.\n"; Variables in Strings - "Simple string with a $var in it\n" - "You can use $an_array[$var] too\n" - "Sometimes you need ${curl}ies to mark where the {$var}iable ends" - "Curlies help on {$big['fancy'][$stuff]} too" - "Where it's confusing to embed ". $big['ugly'][$var]. "iables, break it up as needed with concatenation." Must-Have String Functions - www.php.net/strings - echo/print - (print $foo)==1, echo "can", $take,"more than one","argument"; - Echo shortcut: - trim, ltrim, rtrim/chop - remove whitespace - explode, implode/join - $arr = explode(" ", "List of words"); - $str = implode(",",$arr); Obligatory C-like Functions - All your old favorites are in there: - printf, sprintf, sscanf, fprintf - strcmp, strlen, strpos, strtok - They all do just what you expect, though many of them have easier alternatives - Gotcha: Some of them (like strpos and friends) return boolean false, because 0 is a valid result. Always use "===false". Basic String Manipulation - Any of this can be done with regular expressions as well... - and in more complex cases, can only be done with regular expressions - But regular expressions are slower (more later) - str_replace("bar","baz","foobar"); - str_repeat("1234567890",8); Formatting functions - strtolower, strtoupper - ucfirst, ucwords - uppercase first char, or first char of each word - wordwrap - wrap text to a given width - str_pad("tooshort",15," "); - vprintf, vfprintf, vsprintf - formatted output - number_format - add thousands grouping - money_format - format as currency Special-Purpose Functions - One of PHP's strengths is the way it caters to the common things people need - Many string functions are specifically for use with things like dates/times, URLs, HTML, and SQL databases - Advice: When you need them, use them. "Rolling your own" doesn't usually work out the way you plan it. Date and Time Functions - www.php.net/datetime - A variety of functions to not only do calculations with dates, but to convert dates to strings - date(), strftime() - And more importantly, to convert strings to dates - strtotime(), strptime() - Great example of why not to "roll your own", even if it doesn't seem that complex at first URL Functions - www.php.net/url - urlencode, urldecode - Turn non-alphanumerics to %[hex] and ' '->'+' - rawurl{en,de}code do the same except for '+' - parse_url - break into host, path, query, etc. - http_build_query - turn array to URL query - base64_{en,de}code - base64 conversions for use with MIME, etc. HTML Functions - htmlspecialchars - encode &, ", <, and > with &, ", <, and > - htmlentities is same but for every char - html_entity_decode is the reverse - nl2br - turn newline (\n) into
tags - parse_str - parse GET query into variables or an array (see also: extract) - strip_tags - strip html tags [selectively] SQL Functions - "Magic Quotes" - on by default - Misnamed - adds magic slashes, not quotes - addslashes, stripslashes - escape ', ", and \ - Advice: do db queries first, then use $var = htmlspecialchars(stripslashes($input)) for use in tags - quotemeta - escape . \ + * ? [ ^ ] ( $ ) - Good for commands: system() and `backticks` Now for the fun stuff... - Intro to Strings in PHP - (Feel free to tell me how fast or slow to go) - Functions relating to HTML, SQL, etc. - Regular Expressions - PCRE - POSIX - Performance/Speed considerations - Grab bag of cool string functions Regular Expressions - Extremely powerful tool for pattern matching - same thing used by compilers and interpreters to run your programs - Two flavors in PHP: - PCRE - Perl-Compatible Regular Expressions - POSIX Extended - I favor PCRE - multiple languages, more features, faster, and binary-safe Basics of RE's - They match patterns - the magic is in the pattern you tell them to match - They have to be precise, including and excluding exactly what you want - People get scared of them because the details can be tricky - But they're one of the best tools you have for doing some pretty fancy string stuff RE Patterns - Start with strings and grouping: "abc(def)" - Add alternative branches: "abc(def|123)" - Wildcard: . matches any char but \n - Quantifiers/Repeating: - * = "0 or more", + = "1 or more", ? = "0 or 1" - {n} = "n times", {n,m} = "n to m times" - "(abc)+(def|123)*(.{2})*" - At least one abc, maybe some triplets, then an even number of characters Character Classes and Types - [] makes character classes - List of characters and ranges: [a-zA-Z0-9] - If you want to use -, put it at the beginning - Escape any special chars with \ as usual - If first char is ^, class is negated - \d = [0-9], \D = [^0-9] - \s = whitespace, \S = non-whitespace - \w = [a-zA-Z0-9_], \W = [^a-zA-Z0-9_] - \b = word boundary - "zero-width assertion" Anchors - What if you want to force it to match only at the beginning of the string? Or to match the entire string? - Use an anchor! - ^ as the first char anchors the beginning - $ as the last char anchors the end - (Varies slightly in multi-line mode) Greediness and Modifiers - Regular Expressions are Greedy - They'll keep eating characters as long as they can keep matching. - Consider: "<.*>" vs. "<[^>]*>" when matching against "Hi" - PCRE has modifiers: // - /i = case insensitive - /U = un-greedy - /m = multi-line Back References - Most commonly used in replace operations, but can be used in match patterns as well - Parentheses not only group, but capture too - Use \ followed by the number of the capture - "ab(.)\1(.)\2" will match abccdd or abxxyy, but not abcccd or abdcdc - Can get tricky to count which backref goes where with nested parentheses Modifiers for Parentheses - PCRE Only - makes some things possible that otherwise couldn't be done - Non-capturing grouping: (?: ) - Can simplify back-reference counting - Look-ahead Assertions: - They don't advance the matching position - Positive: (?= ), or Negative: (?! ) - Very powerful, but not always easy to understand. Trial and error can be your friend! PCRE Specifics - www.php.net/pcre - preg_match, preg_match_all, preg_replace, preg_split, preg_grep (filter an array) - Perl RE's have a delimiter, usually /, but can be anything: - preg_match("/foo/",$bar); - preg_match("%/usr/local/bin/%",$path); POSIX Specifics - www.php.net/regex - ereg, ereg_replace, split, eregi, spliti, etc. - [Only] Advantage over PCRE: It doesn't require the PCRE library to be installed, so it's always there in any PHP installation - Other regex engines support this specification, though the Perl style seems to be more popular. Almost there... - Intro to Strings in PHP - (Feel free to tell me how fast or slow to go) - Functions relating to HTML, SQL, etc. - Regular Expressions - PCRE - POSIX - Performance/Speed considerations - Grab bag of cool string functions Performance/Speed - Rule of thumb: use the simplest function that will get the job done right - strpos instead of substr - str_replace instead of preg_replace - And so forth... - The PHP manual online usually includes notes about speed differences - PCRE is faster than POSIX Regex Grab Bag - md5, md5_file - Calculate md5 hashes - Great for passwords in databases, etc. - levenshtein, similar_text - calculate the "similarity" of two strings - metaphone, soundex - calculate how similar two strings sound when spoken out loud - str_rot13 - Encryption algorithm - Protected by the DMCA Grab Bag 2 - str_shuffle - words are much more fun once they've been randomized - count_chars, str_word_count - statistics about your strings - str_rev - if it doesn't make sense forward, try it backwards Grand Finale - Any questions? Group Practice - 8.3 filenames - anything but zip files - /^.{0,8}(\.[^z][^i]?[^p]?)?$/i - fails filename.ftp - /^.{0,8}\.(!?zip)$/I - PCRE only - Sometimes easier to match rejects rather than keepers - Apache access log example: - 4.79.40.166 - - [07/Jan/2005:04:35:42 -0700] "GET /robots.txt HTTP/1.0" 404 337 "-" "Holmes/1.0" - preg_match("/^(\d{1,3}(:?\.\d{1,3}){3}) ". #IP - "- - \[(.+)\] \"\w+ (\S+) (\S+)\" (\d+) (\d+) ". - "\"-\" \"([^"]*)\"$/",$row,$matches);