Unicode resources

Nov 2010RFC 5895: Mapping Characters for Internationalized Domain Names in Applications (IDNA) 2008

In the original version of the Internationalized Domain Names in Applications (IDNA) protocol, any Unicode code points taken from user input were mapped into a set of Unicode code points that "made sense", and then encoded and passed to the domain name system (DNS). The IDNA2008 protocol (described in RFCs 5890, 5891, 5892, and 5893) presumes that the input to the protocol comes from a set of "permitted" code points, which it then encodes and passes to the DNS, but does not specify what to do with the result of user input. This document describes the actions that can be taken by an implementation between receiving user input and passing permitted code points to the new IDNA protocol.

13 Sep 2022Unicode 15.0.0

Unicode 15.0 adds 4,489 characters, for a total of 149,186 characters. These additions include 2 new scripts, for a total of 161 scripts, along with 20 new emoji characters, and 4,193 CJK (Chinese, Japanese, and Korean) ideographs.

The new scripts and characters in Version 15.0 add support for lesser-used languages and unique written requirements worldwide, including numerous symbols additions. Funds from the Adopt-a-Character program provided support for some of these additions. The new scripts and characters include:

  • Nag Mundari, a modern script used to write Mundari, a language spoken in India
  • A Kannada character used to write Konkani, Awadhi, and Havyaka Kannada in India
  • Kaktovik numerals, devised by speakers of Iñupiaq in Kaktovik, Alaska for the counting systems of the Inuit and Yupik languages

Popular symbol additions:

  • 20 emoji characters, including hair pick, maracas, jellyfish, khanda, and pink heart. For complete statistics regarding all emoji as of Unicode 15.0, see Emoji Counts. For more information about emoji additions in version 15.0, including new emoji ZWJ sequences and emoji modifier sequences, see Emoji Recently Added, v15.0.

Other symbol and notational additions include:

  • The nine pointed white star symbol, used by members of the Bahá’í faith
  • Eight symbols for celestial bodies, used by astronomers and astrologers
  • Twenty-nine additional Egyptian hieroglyph format controls, which will enable Egyptologists to better representer texts

Support for other languages and scholarly work worldwide includes:

  • Kawi, a historical script found in Southeast Asia, used to write Old Javanese and other languages
  • Three additional characters for the Arabic script to support Quranic marks used in Turkey
  • One new Lao sign used to write Lao Pali
  • Three Khojki characters found in handwritten and printed documents
  • Ten Devanagari characters used to represent auspicious signs found in inscriptions and manuscripts
  • Six Latin letters used in Malayalam transliteration
  • Sixty-three Cyrillic modifier letters used in phonetic transcription
  • One additional Egyptian hieroglyph

Updates to the CJK blocks add:

  • 4,192 ideographs in the new CJK Unified Ideographs Extension H block
  • One ideograph in the CJK Unified Ideographs Extension C block

Support for CJK unified ideographs was enhanced in Version 15.0 by significant corrections and improvements to the Unihan database. Changes to the Unihan database include updated source lists, regular expressions, and new and updated fields. See UAX #38, Unicode Han Database (Unihan) for more information on the updates.

Important chart font updates, including:

  • A set of updated glyphs for Egyptian hieroglyphs, in addition to standardized variation sequences to support rotated glyphs found in texts
  • Improved glyphs for Unified Canadian Aboriginal Syllabics, which provide better support for Carrier and other languages
  • A new Wancho font, with improved and simplified shapes

12 Sep 2023Unicode 15.1.0

Unicode 15.1 adds 627 characters, for a total of 149,813 characters.

There are several significant themes for this release of the Unicode Standard.

  • The repertoire addition consists almost entirely of urgently needed CJK ideographs, synchronized with planned additions to the Chinese national standard, GB 18030. The remaining additions to the repertoire extend the set of ideographic description characters, to better enable description of unusual CJK ideographs.
  • Major updates were made to UAX #9, Unicode Bidirectional Algorithm, UAX #31, Unicode Identifiers and Syntax, and UTS #39, Unicode Security Mechanisms, to coordinate with the publication of an important new Unicode Technical Standard: UTS #55, Unicode Source Code Handling.
  • Segmentation rule changes, most notably:
    1. Support was added to line breaking (UAX #14, Unicode Line Breaking Algorithm) for orthographic syllables in a number of South and Southeast Asian writing systems.
    2. Grapheme cluster breaking (UAX #29, Unicode Text Segmentation) has adopted the aksara cluster behavior for six scripts. That cluster breaking behavior had previously been widely available via CLDR and ICU.
    3. These changes involved significant character property updates.

Supersedes: Unicode 15.0.0

10 Oct 2024Unicode 16.0.0

Unicode 16.0 adds 5185 characters, for a total of 154,998 characters. The new additions include seven new scripts:

  • Garay is a modern-use script from West Africa.
  • Gurung Khema, Kirat Rai, Ol Onal and Sunuwar are four modern-use scripts from Northeast India and Nepal.
  • Todhri is an historic script used for Albanian.
  • Tulu-Tigalari is an historic script from Southwest India.

Other character additions include seven new emoji characters plus 3,995 additional Egyptian Hieroglyphs and over 700 symbols from legacy computing environments.

In addition to new characters, new “Moji Jōhō Kiban” (文字情報基盤) Japanese source references have been added for over 36,000 CJK unified ideographs. These are reflected in the code charts for virtually all CJK unified ideograph blocks by additional representative glyphs in the “J” column.

Supersedes: Unicode 15.1.0

09 Sep 2025Unicode 17.0.0

Unicode 17.0 adds 4803 characters, for a total of 159,801 characters. The new additions include 4 new scripts:

  • Sidetic
  • Tolong Siki
  • Beria Erfe
  • Tai Yo

Supersedes: Unicode 16.0.0

Unicode Consortium

What began in 1988 as a standard for character encoding has grown into a powerful portfolio of open source standards, code, tools, libraries, and products that ensure global language support, interoperability, and resiliency across billions of devices.

Behind what most take for granted on screens today is the Unicode Consortium, the non- profit open source, open standards body for the internationalization of software and services. Unicode is embedded in every major operating system and used on more than 20 billion devices worldwide. It may be the most widely deployed technology ever.

11 Nov 2008Unicode Technical Report #17: Character Encoding Model

This document clarifies a number of the terms used to describe character encodings. It elaborates the Internet Architecture Board (IAB) three-layer “text stream” definitions from RFC 2130 into a four-layer structure more appropriate for explanation of the Unicode Standard.