Unicode: flag "u" and class \p{...}

JavaScript uses Unicode encoding for strings. Most characters are encoded with 2 bytes, but that allows to represent at most 65536 characters.

That range is not big enough to encode all possible characters, thatâ€™s why some rare characters are encoded with 4 bytes, for instance like ð’³ (mathematical X) or ðŸ˜„ (a smile), some hieroglyphs and so on.

Here are the Unicode values of some characters:

Character	Unicode	Bytes count in Unicode
a	`0x0061`	2
â‰ˆ	`0x2248`	2
ð’³	`0x1d4b3`	4
ð’´	`0x1d4b4`	4
ðŸ˜„	`0x1f604`	4

So characters like a and â‰ˆ occupy 2 bytes, while codes for ð’³, ð’´ and ðŸ˜„ are longer, they have 4 bytes.

Long time ago, when JavaScript language was created, Unicode encoding was simpler: there were no 4-byte characters. So, some language features still handle them incorrectly.

For instance, length thinks that here are two characters:

alert('ðŸ˜„'.length); // 2
alert('ð’³'.length); // 2

â€¦But we can see that thereâ€™s only one, right? The point is that length treats 4 bytes as two 2-byte characters. Thatâ€™s incorrect, because they must be considered only together (so-called â€œsurrogate pairâ€, you can read about them in the article Strings).

By default, regular expressions also treat 4-byte â€œlong charactersâ€ as a pair of 2-byte ones. And, as it happens with strings, that may lead to odd results. Weâ€™ll see that a bit later, in the article Sets and ranges [...].

Unlike strings, regular expressions have flag u that fixes such problems. With such flag, a regexp handles 4-byte characters correctly. And also Unicode property search becomes available, weâ€™ll get to it next.

Unicode properties \p{â€¦}

Every character in Unicode has a lot of properties. They describe what â€œcategoryâ€ the character belongs to, contain miscellaneous information about it.

For instance, if a character has Letter property, it means that the character belongs to an alphabet (of any language). And Number property means that itâ€™s a digit: maybe Arabic or Chinese, and so on.

We can search for characters with a property, written as \p{â€¦}. To use \p{â€¦}, a regular expression must have flag u.

For instance, \p{Letter} denotes a letter in any language. We can also use \p{L}, as L is an alias of Letter. There are shorter aliases for almost every property.

In the example below three kinds of letters will be found: English, Georgian and Korean.

let str = "A áƒ‘ ã„±";

alert( str.match(/\p{L}/gu) ); // A,áƒ‘,ã„±
alert( str.match(/\p{L}/g) ); // null (no matches, \p doesn't work without the flag "u")

Hereâ€™s the main character categories and their subcategories:

Letter L:
- lowercase Ll
- modifier Lm,
- titlecase Lt,
- uppercase Lu,
- other Lo.
Number N:
- decimal digit Nd,
- letter number Nl,
- other No.
Punctuation P:
- connector Pc,
- dash Pd,
- initial quote Pi,
- final quote Pf,
- open Ps,
- close Pe,
- other Po.
Mark M (accents etc):
- spacing combining Mc,
- enclosing Me,
- non-spacing Mn.
Symbol S:
- currency Sc,
- modifier Sk,
- math Sm,
- other So.
Separator Z:
- line Zl,
- paragraph Zp,
- space Zs.
Other C:
- control Cc,
- format Cf,
- not assigned Cn,
- private use Co,
- surrogate Cs.

So, e.g. if we need letters in lower case, we can write \p{Ll}, punctuation signs: \p{P} and so on.

There are also other derived categories, like:

Alphabetic (Alpha), includes Letters L, plus letter numbers Nl (e.g. â…« â€“ a character for the roman number 12), plus some other symbols Other_Alphabetic (OAlpha).
Hex_Digit includes hexadecimal digits: 0-9, a-f.
â€¦And so on.

Unicode supports many different properties, their full list would require a lot of space, so here are the references:

List all properties by a character: https://unicode.org/cldr/utility/character.jsp.
List all characters by a property: https://unicode.org/cldr/utility/list-unicodeset.jsp.
Short aliases for properties: https://www.unicode.org/Public/UCD/latest/ucd/PropertyValueAliases.txt.
A full base of Unicode characters in text format, with all properties, is here: https://www.unicode.org/Public/UCD/latest/ucd/.

Example: hexadecimal numbers

For instance, letâ€™s look for hexadecimal numbers, written as xFF, where F is a hex digit (0â€¦9 or Aâ€¦F).

A hex digit can be denoted as \p{Hex_Digit}:

let regexp = /x\p{Hex_Digit}\p{Hex_Digit}/u;

alert("number: xAF".match(regexp)); // xAF

Example: Chinese hieroglyphs

Letâ€™s look for Chinese hieroglyphs.

Thereâ€™s a Unicode property Script (a writing system), that may have a value: Cyrillic, Greek, Arabic, Han (Chinese) and so on, hereâ€™s the full list.

To look for characters in a given writing system we should use Script=<value>, e.g. for Cyrillic letters: \p{sc=Cyrillic}, for Chinese hieroglyphs: \p{sc=Han}, and so on:

let regexp = /\p{sc=Han}/gu; // returns Chinese hieroglyphs

let str = `Hello ÐŸÑ€Ð¸Ð²ÐµÑ‚ ä½ å¥½ 123_456`;

alert( str.match(regexp) ); // ä½ ,å¥½

Example: currency

Characters that denote a currency, such as $, â‚¬, Â¥, have Unicode property \p{Currency_Symbol}, the short alias: \p{Sc}.

Letâ€™s use it to look for prices in the format â€œcurrency, followed by a digitâ€:

let regexp = /\p{Sc}\d/gu;

let str = `Prices: $2, â‚¬1, Â¥9`;

alert( str.match(regexp) ); // $2,â‚¬1,Â¥9

Later, in the article Quantifiers +, *, ? and {n} weâ€™ll see how to look for numbers that contain many digits.

Summary

Flag u enables the support of Unicode in regular expressions.

That means two things:

Characters of 4 bytes are handled correctly: as a single character, not two 2-byte characters.
Unicode properties can be used in the search: \p{â€¦}.

With Unicode properties we can look for words in given languages, special characters (quotes, currencies) and so on.

Unicode: flag "u" and class \p{...}

Unicode properties \p{â€¦}

Example: hexadecimal numbers

Example: Chinese hieroglyphs

Example: currency

Summary

Comments

Chapter

Lesson navigation