Commit Graph

31 Commits

Author SHA1 Message Date
Steven G. Johnson
5622a0a51b
add islower/isupper functions (#196)
* add islower/isupper functions

* added test

* more tests + bugfix

* Makefile fix

* rm iscase test on make clean
2020-08-25 16:42:59 -04:00
Steven G. Johnson
0890a538bf new emoji-data.txt location (fixes #181) 2020-03-27 20:36:18 -04:00
Steven G. Johnson
0ff48bfbfd update 2020-03-27 18:38:44 -04:00
Steven G. Johnson
1ee551c85b whoops, generated from old tables 2020-03-27 18:35:20 -04:00
Steven G. Johnson
b48f5d074f
Unicode 13 support (#179)
* exclude Sk from zero-width chars (closes #167)

* update for Unicode 13
2020-03-27 17:06:06 -04:00
GOTOH Shunsuke
7b28b9e60c update for unicode 12.1 (#156) 2019-05-10 21:12:45 -04:00
Steven G. Johnson
fd4d8a3454
give up on Unifont for charwidth data (#150)
* fix CHARBOUND option for non-characters

* give up on unifont for charwidth computation
2019-03-30 16:05:50 -04:00
Steven G. Johnson
e76cebb784
update for unicode 12 (#148) 2019-03-30 13:46:01 -04:00
Steven G. Johnson
d4a58cfec5
update data and algorithms for Unicode 11 (#140) 2018-07-24 13:18:48 -04:00
Steven G. Johnson
02f4e1890c
charwidth=1 for soft hyphen and unassigned codepoints (#135)
* use width=1 for soft hyphen and for unassigned/PUA codepoints

* don't count unassigned codepoints when comparing with system wcwidth

* more tests

* indentation fixes

* NEWS for 135

* remove special-casing for arabic control characters affecting a span of numbers, which are sometimes zero-width and sometimes not

* regenerate
2018-07-24 10:45:02 -04:00
Steven G. Johnson
d81308faba
uppercase mapping ß (U+00df) to ẞ (U+1E9E) (#134)
* uppercase(0x00df) = 0x1e9e

* tests for titlecase and u+00df uppercase

* NEWS, another test
2018-05-02 14:18:26 -04:00
Steven G. Johnson
bdc8b9e4b2
Case folding fixes (#133)
* Fixes allowing for “Full” folding and NFKC_CaseFold compliance.

* Only include C (Common) and F (Full) foldings from CaseFolding.txt. Removed S (Simple) since F & S are specified to be exclusive.
* Extend UTF8PROC_IGNORE to also ignore unassigned codepoints (such as \u2065) which are specified as being discarded by NFKC_CF.

* Document the changes to UTF8PROC_IGNORE in header.

* Add NFKC_CF helper function with documentation.

* restore old IGNORE behavior, add UTF8PROC_STRIPNA, rename to utf8proc_NFKC_Casefold, add a test

* success message

* test that IGNORE does not strip NA

* data update

* NFKC_Casefold shouldn't strip NA
2018-05-02 08:15:02 -04:00
Steven G. Johnson
d736adeff1
update to unicode 10 (#132) 2018-04-27 12:50:19 -04:00
Paul Smith
95fc75b839 Ensure generated const data tables are hidden via "static" (#100) 2017-02-19 17:33:25 -05:00
Steven G. Johnson
15e1819cdd update to unifont 9.0.04 2016-12-11 16:35:27 -05:00
Steven G. Johnson
8da37e2892 silence MSVC warning about conversion to uint8 (fix #86) 2016-11-30 10:09:18 -05:00
Steven G. Johnson
c02ebd5a83 update to Unifont 9 (for Unicode 9 charwidths) (#75) 2016-07-12 16:30:05 -04:00
Benito van der Zander
eeebf70bcf Smaller tables (#68)
* convert sequences to utf-16 (saves 25kb)

* store sequence length in properties instead using -1 termination (saves 10kb)

* cache index for slightly faster data creation

* store lower/upper/title mapping in sequence array (saves 25kb). Add utf8proc_totitle, as title_mapping cannot be used to get the title codepoint anymore. Rename xxx_mapping to xxx_seqindex, so programs assuming a value with the old meaning fail at compile time

* change combination array data type to uint16 (saves 40kb)

* merge 1st and 2nd comb index (saves 50kb)

* kill empty prefix/suffix in combination array (saves 50kb)

* there was no need to have a separate combination start array, it can be merged in a single array

* some fixes

* mark the table as const again

* and regen
2016-07-12 11:51:50 -04:00
Keno Fischer
41c6b23aab Unicode 9 updates (#70)
* Updates for Unicode 9.0.0 TR29 Changes

- New rules GB10/(12/13) are used to combine emoji-zwj sequences/
  (force grapheme breaks every two RI codepoints). Unfortunately this
  breaks statelessness of grapheme-boundary determination. Deal with
  this by ignoring the problem in utf8proc_grapheme_break, and by
  hacking in a special case in decompose

- ZWJ moved to its own boundclass, update what is now GB9 accordingly.

- Add comments to indicate which rule a given case implements

- The Number of bound classes Now exceeds 4 bits, expand to 8 and
  reorganize fields

* Import Unicode 9 data

* Update Grapheme break API to expose state override

* Bump MAJOR version
2016-06-28 16:04:25 -04:00
Michaël Meyer
26436c9775 Reduce the size of the binary.
Use integers instead of pointers in Unicode tables. Saves 226 kb / 716 kb in the
compiled library.
2015-12-09 19:55:48 +01:00
Peter Colberg
9b7184ec56 Update Unicode data
Fixes Travis builds on Ubuntu 12.04 LTS with Ruby 1.9.3-p551.
2015-10-29 19:41:16 -04:00
Jiahao Chen
cfa7c96003 Update Unicode data 2015-06-29 16:43:07 -04:00
Jiahao Chen (陈家豪)
1cc58b2bc9 Updated Unicode 8 data - now sorted internally by data generator 2015-06-26 12:12:13 -04:00
Jiahao Chen
b14ca2be57 Update Unicode data 2015-06-26 12:01:27 -04:00
Steven G. Johnson
6a7f92da64 fix #46 (make sure symbol-like codepoints have nonzero width even if they aren't in Unifont) 2015-06-24 14:07:15 -04:00
Jiahao Chen
92bc19fbe0 Updated data file to Unicode 8.0.0 2015-06-23 16:18:35 -04:00
Tony Kelman
0a818c7003 Prefix other C99 typedefs with utf8proc_ 2015-04-06 22:36:33 -07:00
Steven G. Johnson
a4c84d2063 fix #2: add charwidth function 2015-03-12 12:10:19 -04:00
Steven G. Johnson
397a1eabea update graphemes for Unicode 7, add utf8proc_grapheme_break function 2014-12-12 16:30:31 -05:00
Jiahao Chen
b81326e82f Update utf8proc_data.c (generated by data_generator.rb) 2014-07-18 10:46:11 -04:00
Steven G. Johnson
ab9520d188 import of utf8proc-v1.1.6 2014-07-15 15:29:52 -04:00