Announcement of Unicode version 14


The venerable Unicode standard is undergoing an update. We report on the news and go behind the scenes with a brief overview of the philosophy and practical use of the standard.

Most people stop thinking about Unicode when introducing new Emoji characters. However, the primary purpose of the Unicode standard is not just to share expressive characters for use on mobile applications just for fun; it also facilitates communication in all human readable languages ​​and supports science and research with its scientific symbols and scriptures in ancient languages.

In the words of the Unicode Consortium:

The Unicode standard is the basis of all modern software and communications in the world, including all modern operating systems, browsers, laptops and smartphones, as well as the Internet and the Web (URL, HTML, XML, CSS , JSON, etc.).

That said, Unicode v14 added 838 characters, including five new scripts and 37 new emoji characters.

The scripts are:

  • Toto, used to write the Toto language in North East India
  • Cypro-Minoan, an undeciphered historical script mainly used on the island of Cyprus and its surroundings during the late Bronze Age (circa 1550-1050 BCE).

  • Vithkuqi, a historical script used to write Albanian and in full modern renaissance
  • Old Uyghur, a historical script used in Central Asia and elsewhere to write the Turkish, Chinese, Mongolian, Tibetan, and Arabic languages
  • Tangsa, a modern script used to write the Tangsa language, which is spoken in India and Myanmar

This shows that Unicode is not only useful for communication in the modern world, but is also the guardian that protects the memory of niche or extinct cultures.

Further elaborating, technically, a Unicode script (according to Wikipedia) is:

A set of letters and other written signs used to represent textual information in one or more writing systems. Some scripts support one and only one writing system and one language, for example Armenian.

Other scripts support many different writing systems; for example, the Latin script supports English, French, German, Italian, Vietnamese, Latin itself, and several other languages.

In regular expressions, you will usually find them denoted with p {..}, like p {Latin} etc.

As for the fun aspect, v14 also added the following 37 emoji characters:

  • Fondant face
  • Face with open eyes and hand over mouth
  • Face with furtive eye
  • Face saluting
  • Dotted face
  • Face with diagonal mouth
  • Face holding back tears
  • Right hand
  • Left hand
  • palm down
  • palm up
  • Hand with index and thumb crossed
  • Index pointing to the viewer
  • Heart Hands
  • Biting lip
  • Person with crown
    pregnant man
  • Pregnant person
  • Troll
  • coral
  • Lotus
  • Empty nest
  • Nest with eggs
  • Beans
  • Pour liquid
  • Pot
  • Playground slide
  • Wheel
  • Ring Buoy
  • Hamsa
  • Mirrorball
  • Low battery
  • Crutch
  • X-ray
  • Bubbles
  • ID card
  • Heavy equal sign

At I Programmer, we have extensive coverage of the Emoji world. Check that the Emoji subcommittee reopens the submission process and World Emoji Day picks the syringe to sum up 2021 for the last.

Other minor additions have found their way, including:

  • Many Latin additions for an extended IPA
  • Arabic script additions used to write languages ​​across Africa and in Iran, Pakistan, Malaysia, Indonesia, Java and Bosnia, and to write honorary titles and additions for Quranic use
  • Character additions to support languages ​​of North America and the Philippines, India and Mongolia

Everything is fine, but to get your hands on the new characters, you’ll have to wait until your favorite apps and fonts are upgraded to support the new standard. The same delay applies to programming language support. Perl is still the fastest to adopt the latest Unicode standards. For example, Unicode 10 support came with Perl version 5.28 in 2018, while Perl 5.32.0 came with Unicode 13. The latest version of Perl is 5.34.0, released in May 2021, and in as such it didn’t incorporate the latest standard but I guess the next one will.

And what can you do with Script programming? Use them to manipulate text, like in regular expressions. This is described in Advanced Perl Regular Expressions – Extended Constructs where I have a file:

myimage ऄ with ध Devanagari म chars फ ‘. png

in which the Hindi characters DEVANAGARI are mixed with the Latin. The file should be distributed on multiple platforms and operating systems which might not be Unicode compatible. Thus, its file name should be portable and compatible with file systems of different operating systems.

What’s the best way to do this? Renaming the file to only contain characters from the universally recognizable ASCII character set, which means we need to remove it from all non-ASCII characters. But in order to do this, we must first introduce blocks in addition to scripts. According to perlunicode:

Unicode also defines blocks of characters. The difference between scripts and blocks is that the concept of scripts is closer to natural languages, while the concept of blocks is more of an artificial grouping based on groups of Unicode characters with consecutive ordinal values. For example, the “Basic Latin” block is the set of characters whose ordinals are between 0 and 127 inclusive; in other words, ASCII characters. The “Latin” script contains a few letters of it as well as several other blocks, like “Latin-1 Supplement”, “Latin Extended-A”, etc., but it does not contain all the characters in these blocks.

Armed with this knowledge, we can proceed to resolve the portability issue. There is the [[:ascii:]]The POSIX class and / or the Unicode block p {InBasicLatin} which match all ASCII characters, so by negation [^[:ascii:]]or P {InBasic_Latin} we come to all non-ASCII. Like everything in Perl, TMTOWTDI (there is more than one way to do this). and this example can be used as a basis for the formation of more elaborate use cases later.

But what do we really mean by ASCII?

We hear characters with ordinal values ​​less than 128 (in other words in US English only), so we have to remove those beyond 127, which leads us to a condition “remove all characters with ordinal value is> 127 “to be used in the construction of the regular expression.

For the solution, see the rest of the article, but the point is, the Unicode standard organizes concepts into concrete blocks so that you can work with them intuitively.

All information about scripts, blocks and the rest can be found in the standard’s clear documentation on And you can find all the new Emoji additions at Recently added emoji.

More information

Announcement of the Unicode® Standard, version 14.0

Related Articles

Advanced Perl Regular Expressions – Extended Constructs

Advanced Perl Regular Expressions – The Pattern Code Expression

Query Unicode from the command line

Getting to grips with regular expressions

Automatically generate regular expressions with genetic programming

To be informed of new articles on I Programmer, subscribe to our weekly newsletter, subscribe to the RSS feed and follow us on Twitter, Facebook Where Linkedin.




or send your comment to:

Source link


About Author

Leave A Reply