C library for processing UTF-8 and UTF-32 data
Go to file
Jiahao Chen 13a72c152a Add 'update' target to Makefile
This target downloads all necessary Unicode data files using curl and rebuilds utf8proc_data.c using data_generator.rb (saving the new copy to utf8proc_data.c.new).
2014-07-18 10:46:11 -04:00
.gitignore markdown and other cosmetic updates 2014-07-15 16:04:36 -04:00
data_generator.rb Replace all explicitly marked regions with Ruby file read and regex section matches 2014-07-18 10:46:11 -04:00
LICENSE.md markdown fixes, prettified NEWS 2014-07-15 21:50:23 -04:00
lump.txt import of utf8proc-v1.1.6 2014-07-15 15:29:52 -04:00
Makefile Add 'update' target to Makefile 2014-07-18 10:46:11 -04:00
NEWS.md markdown fixes, prettified NEWS 2014-07-15 21:50:23 -04:00
README.md markdown fixes, prettified NEWS 2014-07-15 21:50:23 -04:00
utf8proc_data.c import of utf8proc-v1.1.6 2014-07-15 15:29:52 -04:00
utf8proc.c import of utf8proc-v1.1.6 2014-07-15 15:29:52 -04:00
utf8proc.h import of utf8proc-v1.1.6 2014-07-15 15:29:52 -04:00

libutf8proc

The libutf8proc package is a lightly updated fork of the utf8proc library from Jan Behrens and the rest of the Public Software Group, who deserve nearly all of the credit for this package: a small, clean C library that provides Unicode normalization, case-folding, and other operations for data in the UTF-8 encoding.

The reason for this fork is that utf8proc is used for basic Unicode support in the Julia language and the Julia developers wanted Unicode 7 support and other features, but the Public Software Group currently does not seem to have the resources necessary to update utf8proc. We hope that the fork can be merged back into the mainline utf8proc package before too long.

(The original utf8proc package also includes Ruby and PostgreSQL plug-ins. We removed those from libutf8proc in order to focus exclusively on the C library for the time being. We will strive to keep API changes to a minimum, so libutf8proc should still be usable with the old plug-in code.)

Like utf8proc, the libutf8proc package is licensed under the free/open-source MIT "expat" license (plus certain Unicode data governed by the similarly permissive Unicode data license); please see the included LICENSE.md file for more detailed information.

Quick Start

For compilation of the C library run make.

General Information

The C library is found in this directory after successful compilation and is named libutf8proc.a (for the static library) and libutf8proc.so (for the dynamic library).

The Unicode version being supported is 5.0.0. Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as version 5.0.0 had not been available at the time of implementation.

For Unicode normalizations, the following options are used:

  • Normalization Form C: STABLE, COMPOSE`
  • Normalization Form D: STABLE, DECOMPOSE
  • Normalization Form KC: STABLE, COMPOSE, COMPAT
  • Normalization Form KD: STABLE, DECOMPOSE, COMPAT

C Library

The documentation for the C library is found in the utf8proc.h header file. utf8proc_map is function you will most likely be using for mapping UTF-8 strings, unless you want to allocate memory yourself.

To Do

  • detect stable code points and process segments independently in order to save memory
  • do a quick check before normalizing strings to optimize speed
  • support stream processing

Contact

Bug reports, feature requests, and other queries can be filed at the libutf8proc page on Github.