C library for processing UTF-8 and UTF-32 data
Go to file
Steven G. Johnson 4f70bbe780 Merge pull request #20 from JuliaLang/graphemes
Update graphemes for Unicode 7
2014-12-14 08:47:06 -05:00
bench added GNU libunistring benchmark 2014-07-19 14:55:25 -04:00
.gitignore update graphemes for Unicode 7, add utf8proc_grapheme_break function 2014-12-12 16:30:31 -05:00
.travis.yml Add travis file for testing 2014-08-07 17:04:03 -04:00
data_generator.rb update graphemes for Unicode 7, add utf8proc_grapheme_break function 2014-12-12 16:30:31 -05:00
graphemetest.c update graphemes for Unicode 7, add utf8proc_grapheme_break function 2014-12-12 16:30:31 -05:00
LICENSE.md README updates 2014-12-07 21:29:34 -05:00
lump.txt import of utf8proc-v1.1.6 2014-07-15 15:29:52 -04:00
Makefile update graphemes for Unicode 7, add utf8proc_grapheme_break function 2014-12-12 16:30:31 -05:00
mojibake.h Merge pull request #20 from JuliaLang/graphemes 2014-12-14 08:47:06 -05:00
NEWS.md a few typofixes 2014-08-12 21:30:59 +01:00
normtest.c grapheme test for UAX#29 2014-12-12 16:29:29 -05:00
printproperty.c update graphemes for Unicode 7, add utf8proc_grapheme_break function 2014-12-12 16:30:31 -05:00
README.md README updates 2014-12-07 21:29:34 -05:00
tests.h grapheme test for UAX#29 2014-12-12 16:29:29 -05:00
utf8proc_data.c update graphemes for Unicode 7, add utf8proc_grapheme_break function 2014-12-12 16:30:31 -05:00
utf8proc.c update graphemes for Unicode 7, add utf8proc_grapheme_break function 2014-12-12 16:30:31 -05:00

libmojibake

Build Status

libmojibake is a development fork of the utf8proc library from Jan Behrens and the rest of the Public Software Group, who deserve nearly all of the credit for this package: a small, clean C library that provides Unicode normalization, case-folding, and other operations for data in the UTF-8 encoding. The main difference from utf8proc is that the Unicode support in libmojibake is more up-to-date (Unicode 7 vs. Unicode 5).

The reason for this fork is that utf8proc is used for basic Unicode support in the Julia language and the Julia developers wanted Unicode 7 support and other features, but the Public Software Group is currently occupied with other projects. As we implement and test new features in libmojibake, we are contributing patches back to utf8proc with the hope that they can be merged upstream.

(The original utf8proc package also includes Ruby and PostgreSQL plug-ins. We removed those from libmojibake in order to focus exclusively on the C library for the time being. We will strive to keep API changes to a minimum, so libmojibake should still be usable with the old plug-in code.)

Like utf8proc, the libmojibake package is licensed under the free/open-source MIT "expat" license (plus certain Unicode data governed by the similarly permissive Unicode data license); please see the included LICENSE.md file for more detailed information.

Quick Start

For compilation of the C library run make.

General Information

The C library is found in this directory after successful compilation and is named libmojibake.a (for the static library) and libmojibake.so (for the dynamic library).

The Unicode version being supported is 7.0.0. (Grapheme segmentation is currently based on version 4.1.0 of Unicode Standard Annex #29, but we hope to update this soon.)

For Unicode normalizations, the following options are used:

  • Normalization Form C: STABLE, COMPOSE
  • Normalization Form D: STABLE, DECOMPOSE
  • Normalization Form KC: STABLE, COMPOSE, COMPAT
  • Normalization Form KD: STABLE, DECOMPOSE, COMPAT

C Library

The documentation for the C library is found in the utf8proc.h header file. utf8proc_map is function you will most likely be using for mapping UTF-8 strings, unless you want to allocate memory yourself.

To Do

See the Github issues list.

Contact

Bug reports, feature requests, and other queries can be filed at the libmojibake issues page on Github.