utf8proc

Author	SHA1	Message	Date
Steven G. Johnson	5622a0a51b	add islower/isupper functions (#196 ) * add islower/isupper functions * added test * more tests + bugfix * Makefile fix * rm iscase test on make clean	2020-08-25 16:42:59 -04:00
Steven G. Johnson	0890a538bf	new emoji-data.txt location (fixes #181 )	2020-03-27 20:36:18 -04:00
Steven G. Johnson	0ff48bfbfd	update	2020-03-27 18:38:44 -04:00
Steven G. Johnson	1ee551c85b	whoops, generated from old tables	2020-03-27 18:35:20 -04:00
Steven G. Johnson	b48f5d074f	Unicode 13 support (#179 ) * exclude Sk from zero-width chars (closes #167) * update for Unicode 13	2020-03-27 17:06:06 -04:00
GOTOH Shunsuke	7b28b9e60c	update for unicode 12.1 (#156 )	2019-05-10 21:12:45 -04:00
Steven G. Johnson	fd4d8a3454	give up on Unifont for charwidth data (#150 ) * fix CHARBOUND option for non-characters * give up on unifont for charwidth computation	2019-03-30 16:05:50 -04:00
Steven G. Johnson	e76cebb784	update for unicode 12 (#148 )	2019-03-30 13:46:01 -04:00
Steven G. Johnson	d4a58cfec5	update data and algorithms for Unicode 11 (#140 )	2018-07-24 13:18:48 -04:00
Steven G. Johnson	02f4e1890c	charwidth=1 for soft hyphen and unassigned codepoints (#135 ) * use width=1 for soft hyphen and for unassigned/PUA codepoints * don't count unassigned codepoints when comparing with system wcwidth * more tests * indentation fixes * NEWS for 135 * remove special-casing for arabic control characters affecting a span of numbers, which are sometimes zero-width and sometimes not * regenerate	2018-07-24 10:45:02 -04:00
Steven G. Johnson	d81308faba	uppercase mapping ß (U+00df) to ẞ (U+1E9E) (#134 ) * uppercase(0x00df) = 0x1e9e * tests for titlecase and u+00df uppercase * NEWS, another test	2018-05-02 14:18:26 -04:00
Steven G. Johnson	bdc8b9e4b2	Case folding fixes (#133 ) * Fixes allowing for “Full” folding and NFKC_CaseFold compliance. * Only include C (Common) and F (Full) foldings from CaseFolding.txt. Removed S (Simple) since F & S are specified to be exclusive. * Extend UTF8PROC_IGNORE to also ignore unassigned codepoints (such as \u2065) which are specified as being discarded by NFKC_CF. * Document the changes to UTF8PROC_IGNORE in header. * Add NFKC_CF helper function with documentation. * restore old IGNORE behavior, add UTF8PROC_STRIPNA, rename to utf8proc_NFKC_Casefold, add a test * success message * test that IGNORE does not strip NA * data update * NFKC_Casefold shouldn't strip NA	2018-05-02 08:15:02 -04:00
Steven G. Johnson	d736adeff1	update to unicode 10 (#132 )	2018-04-27 12:50:19 -04:00
Paul Smith	95fc75b839	Ensure generated const data tables are hidden via "static" (#100 )	2017-02-19 17:33:25 -05:00
Steven G. Johnson	15e1819cdd	update to unifont 9.0.04	2016-12-11 16:35:27 -05:00
Steven G. Johnson	8da37e2892	silence MSVC warning about conversion to uint8 (fix #86 )	2016-11-30 10:09:18 -05:00
Steven G. Johnson	c02ebd5a83	update to Unifont 9 (for Unicode 9 charwidths) (#75 )	2016-07-12 16:30:05 -04:00
Benito van der Zander	eeebf70bcf	Smaller tables (#68 ) * convert sequences to utf-16 (saves 25kb) * store sequence length in properties instead using -1 termination (saves 10kb) * cache index for slightly faster data creation * store lower/upper/title mapping in sequence array (saves 25kb). Add utf8proc_totitle, as title_mapping cannot be used to get the title codepoint anymore. Rename xxx_mapping to xxx_seqindex, so programs assuming a value with the old meaning fail at compile time * change combination array data type to uint16 (saves 40kb) * merge 1st and 2nd comb index (saves 50kb) * kill empty prefix/suffix in combination array (saves 50kb) * there was no need to have a separate combination start array, it can be merged in a single array * some fixes * mark the table as const again * and regen	2016-07-12 11:51:50 -04:00
Keno Fischer	41c6b23aab	Unicode 9 updates (#70 ) * Updates for Unicode 9.0.0 TR29 Changes - New rules GB10/(12/13) are used to combine emoji-zwj sequences/ (force grapheme breaks every two RI codepoints). Unfortunately this breaks statelessness of grapheme-boundary determination. Deal with this by ignoring the problem in utf8proc_grapheme_break, and by hacking in a special case in decompose - ZWJ moved to its own boundclass, update what is now GB9 accordingly. - Add comments to indicate which rule a given case implements - The Number of bound classes Now exceeds 4 bits, expand to 8 and reorganize fields * Import Unicode 9 data * Update Grapheme break API to expose state override * Bump MAJOR version	2016-06-28 16:04:25 -04:00
Michaël Meyer	26436c9775	Reduce the size of the binary. Use integers instead of pointers in Unicode tables. Saves 226 kb / 716 kb in the compiled library.	2015-12-09 19:55:48 +01:00
Peter Colberg	9b7184ec56	Update Unicode data Fixes Travis builds on Ubuntu 12.04 LTS with Ruby 1.9.3-p551.	2015-10-29 19:41:16 -04:00
Jiahao Chen	cfa7c96003	Update Unicode data	2015-06-29 16:43:07 -04:00
Jiahao Chen (陈家豪)	1cc58b2bc9	Updated Unicode 8 data - now sorted internally by data generator	2015-06-26 12:12:13 -04:00
Jiahao Chen	b14ca2be57	Update Unicode data	2015-06-26 12:01:27 -04:00
Steven G. Johnson	6a7f92da64	fix #46 (make sure symbol-like codepoints have nonzero width even if they aren't in Unifont)	2015-06-24 14:07:15 -04:00
Jiahao Chen	92bc19fbe0	Updated data file to Unicode 8.0.0	2015-06-23 16:18:35 -04:00
Tony Kelman	0a818c7003	Prefix other C99 typedefs with utf8proc_	2015-04-06 22:36:33 -07:00
Steven G. Johnson	a4c84d2063	fix #2 : add charwidth function	2015-03-12 12:10:19 -04:00
Steven G. Johnson	397a1eabea	update graphemes for Unicode 7, add utf8proc_grapheme_break function	2014-12-12 16:30:31 -05:00
Jiahao Chen	b81326e82f	Update utf8proc_data.c (generated by data_generator.rb)	2014-07-18 10:46:11 -04:00
Steven G. Johnson	ab9520d188	import of utf8proc-v1.1.6	2014-07-15 15:29:52 -04:00

31 Commits