Commit Graph

31 Commits

Author SHA1 Message Date
Steven G. Johnson
8da37e2892 silence MSVC warning about conversion to uint8 (fix #86) 2016-11-30 10:09:18 -05:00
Michael Drake
70bbed8626 Tlsa/ucs4 normalize (#88)
* Split codepoint sequence normalisation out into separate function.

This creates utf8proc_normalize_utf32() which takes and returns
a UTF-32 string, applying the following options:

- UTF8PROC_NLF2LS
- UTF8PROC_NLF2PS
- UTF8PROC_NLF2LF
- UTF8PROC_STRIPCC
- UTF8PROC_COMPOSE
- UTF8PROC_STABLE

The utf8proc_reencode() function has been updated to call the
new utf8proc_normalize_utf32().

* Update code documentation: utf8proc_reencode handles UTF8PROC_CHARBOUND.
2016-11-21 09:22:39 -05:00
Keno Fischer
289ce5e041 Fix incorrect use of lbc instead of lbc_override (#77) 2016-07-13 12:33:50 -04:00
Keno Fischer
c0a1ff81fc Walk back ABI breaking changes (#76) 2016-07-13 10:41:13 -04:00
Benito van der Zander
eeebf70bcf Smaller tables (#68)
* convert sequences to utf-16 (saves 25kb)

* store sequence length in properties instead using -1 termination (saves 10kb)

* cache index for slightly faster data creation

* store lower/upper/title mapping in sequence array (saves 25kb). Add utf8proc_totitle, as title_mapping cannot be used to get the title codepoint anymore. Rename xxx_mapping to xxx_seqindex, so programs assuming a value with the old meaning fail at compile time

* change combination array data type to uint16 (saves 40kb)

* merge 1st and 2nd comb index (saves 50kb)

* kill empty prefix/suffix in combination array (saves 50kb)

* there was no need to have a separate combination start array, it can be merged in a single array

* some fixes

* mark the table as const again

* and regen
2016-07-12 11:51:50 -04:00
Keno Fischer
41c6b23aab Unicode 9 updates (#70)
* Updates for Unicode 9.0.0 TR29 Changes

- New rules GB10/(12/13) are used to combine emoji-zwj sequences/
  (force grapheme breaks every two RI codepoints). Unfortunately this
  breaks statelessness of grapheme-boundary determination. Deal with
  this by ignoring the problem in utf8proc_grapheme_break, and by
  hacking in a special case in decompose

- ZWJ moved to its own boundclass, update what is now GB9 accordingly.

- Add comments to indicate which rule a given case implements

- The Number of bound classes Now exceeds 4 bits, expand to 8 and
  reorganize fields

* Import Unicode 9 data

* Update Grapheme break API to expose state override

* Bump MAJOR version
2016-06-28 16:04:25 -04:00
Michaël Meyer
1f17487aa9 Fix overrun 2016-02-04 04:06:28 +01:00
Michaël Meyer
26436c9775 Reduce the size of the binary.
Use integers instead of pointers in Unicode tables. Saves 226 kb / 716 kb in the
compiled library.
2015-12-09 19:55:48 +01:00
Federico G. Schwindt
4fc2d8234d Silence warning with -Wextra
Fixes #60.
2015-11-24 20:09:10 +00:00
Steven G. Johnson
fd20b184dd update copyright statements to list recent contributors and year 2015-11-01 08:34:01 -05:00
Peter Colberg
09360de186 Do not export internal unsafe_encode_char() 2015-10-29 00:45:39 -04:00
Steven G. Johnson
a8fb4b1772 add toupper/tolower functions (for JuliaLang/julia#11471) 2015-05-29 22:00:30 -04:00
Scott Paul Jones
6249e6b8b1 Fix #34 handle 66 Unicode non-characters, also improve performance and surrogate handling 2015-05-29 19:50:03 +02:00
Tony Kelman
0a818c7003 Prefix other C99 typedefs with utf8proc_ 2015-04-06 22:36:33 -07:00
Tony Kelman
ad27722923 Use a new typedef utf8proc_ssize_t to avoid define collisions
with MSVC
2015-04-05 20:06:13 -07:00
Steven G. Johnson
a1c429a45b rename DLLEXPORT to UTF8PROC_DLLEXPORT to prevent conflicts with other header files that define DLLEXPORT 2015-03-30 11:05:51 -04:00
Steven G. Johnson
e1fdad0ca9 updated NEWS etc. for 1.2 release 2015-03-28 09:10:00 -04:00
Steven G. Johnson
11d2ece545 indentation consistency 2015-03-27 12:49:16 -04:00
Steven G. Johnson
c851c67888 put the API version as #defines in the header file (as discussed in #30) 2015-03-27 12:35:41 -04:00
Jonas Fonseca
03a4e8854a Fix #26: use doxygen for generating API docs 2015-03-21 21:23:02 -04:00
Steven G. Johnson
3822984606 remove requirement that get_property and decompose_char argument be in range 0x0 to 0x10ffff 2015-03-12 14:17:27 -04:00
Steven G. Johnson
a4c84d2063 fix #2: add charwidth function 2015-03-12 12:10:19 -04:00
Steven G. Johnson
402883c78e rename back to utf8proc now that we are taking over maintenance 2015-03-06 12:43:37 -05:00
Steven G. Johnson
397a1eabea update graphemes for Unicode 7, add utf8proc_grapheme_break function 2014-12-12 16:30:31 -05:00
Steven G. Johnson
1b3992ebe5 utf8proc_version should return a different version string than utf8proc 2014-12-12 14:20:53 -05:00
Steven G. Johnson
df71da45df Merge pull request #17 from JuliaLang/tk/dllexport
RFC: add DLLEXPORT to utf8proc_get_property
2014-09-24 14:26:13 -04:00
Tony Kelman
a840e5dae1 add DLLEXPORT to all functions in mojibake.h 2014-09-22 09:53:55 -07:00
Veres Lajos
83714e458e a few typofixes 2014-08-12 21:30:59 +01:00
Steven G. Johnson
2c4e520a17 utf8proc.h -> mojibake.h (closes #10) 2014-07-18 14:28:17 -04:00
Steven G. Johnson
dd35a8530d C++/MSVC compatibility, indenting, for #4 2014-07-18 11:47:24 -04:00
Steven G. Johnson
ab9520d188 import of utf8proc-v1.1.6 2014-07-15 15:29:52 -04:00