utf8proc

Author	SHA1	Message	Date
Benito van der Zander	8a4cd4c903	reduce lenencode bits (#232 )	2021-12-16 20:30:27 -05:00
Mike Glorioso	610730f231	Fix Sign-Conversion warnings in library and test code (#214 ) * JuliaStrings#169 turn on sign-conversion warnings Signed-off-by: Mike Glorioso <mike.glorioso@gmail.com> * JuliaStrings#169 fix sign-conversion warnings for utf8proc.c fix sign-converstion warnings for utf8proc_iterate uc requires at most 21 bits to identify a unicode codepoint, so there is no need for it to be unsigned multiple locations use, modify, or store uc with a signed value the only exception is line 137 where uc is compared with an unsigned value fix sign-converstion warnings for utf8proc_tolower, utf8proc_toupper, utf8proc_totitle all three methods have sign conversion warnings when calling seqindex_decode_index seqindex_decode_index uses the passed value as an index to an array utf8proc_sequences as utf8proc_sequences is hard-coded and smaller than 2^31 - 1 we can safely cast to unsigned fix sign-converstion warnings for utf8proc_decompose_char lines with this warning use the defined function utf8proc_decompose_lump in the function, a hardcoded unsigned value (1<<12) is complemented then cast as a signed value as the intent is to remove the 12th bit flag from options, a signed value, and explicit cast is safe fix sign-conversion warnings for utf8proc_map_custom result is declared as signed, but is only expected to contain values between 0 and 4 sizeof returns an unsigned value. result must be cast to unsigned Signed-off-by: Mike Glorioso <mike.glorioso@gmail.com> * JuliaStrings#169 fix sign-conversion warnings for test/* fix sign-conversion warnings for test/tests.c encode change type for d to match return value of utf8proc_encode_char fix sign-conversion warnings for test/graphemetest.c checkline si, i, and j are unsigned size types, utf8proc_map and utf8proc_iterate accept and return signed size types utf8proc_map treats negative strlen values as 0. the strlen used by the test must be similarly limited utf8proc_iterate treats negative strlen values as 4 which will be less than the unsigned size fix unused-but-set-variable warning by checking the glen value fix sign-conversion warnings for test/case.c main the if block ensures that tested codepoint fits in wint_t, but needs to include u and l as well c, u, and l can be safely cast to wint_t fix sign-conversion warnings for test/iterate.c all values used for len are below 8, so an explicit cast is safe updated types for more portable test code fix sign-conversion warnings for test/printproperty.c main change type of c to signed to resolve all sign-converstion warnings. replace sscanf(... &c) wiht sscanf(... &x) followed by explicit sign converstion Signed-off-by: Mike Glorioso <mike.glorioso@gmail.com>	2021-01-14 12:59:49 -05:00
Steven G. Johnson	8239639e3f	fix NULL args in grapheme_break_stateful	2020-12-15 15:26:56 -05:00
Steven G. Johnson	0643a64479	Fix grapheme breaks on string-initial (#205 ) * Fix extended emoji + zwj combo * Patch initial repeated regional flags and extended+zwj emoj * Merge conditions for setting breaks bt region * updated fix * perform tests for both utf8proc_map and manual calls to utf8proc_grapheme_break_stateful * consolidate tests Co-authored-by: Thomas Marks <marksta@umich.edu>	2020-11-23 14:10:29 -05:00
Steven G. Johnson	5622a0a51b	add islower/isupper functions (#196 ) * add islower/isupper functions * added test * more tests + bugfix * Makefile fix * rm iscase test on make clean	2020-08-25 16:42:59 -04:00
xkszltl	08f9999a06	Switch to HTTPS for referencing `www.unicode.org`. (#193 ) Resolve https://github.com/JuliaStrings/utf8proc/issues/192	2020-05-25 10:20:08 -04:00
Steven G. Johnson	b48f5d074f	Unicode 13 support (#179 ) * exclude Sk from zero-width chars (closes #167) * update for Unicode 13	2020-03-27 17:06:06 -04:00
Steven G. Johnson	e6fba4aa8c	update header file comments (closes #157 )	2019-05-14 10:53:55 -04:00
GOTOH Shunsuke	7b28b9e60c	update for unicode 12.1 (#156 )	2019-05-10 21:12:45 -04:00
Steven G. Johnson	abf81603ba	add utf8proc_unicode_version (#151 )	2019-03-30 16:31:02 -04:00
Steven G. Johnson	4603e00cfc	fix CHARBOUND option for non-characters (#149 )	2019-03-30 15:22:25 -04:00
Steven G. Johnson	6a659a5843	doc fixes, don't export stdint and limits.h values UINT16_MAX and SSIZE_MAX	2018-07-24 13:32:42 -04:00
Steven G. Johnson	e0295be467	Merge branch 'master' of https://github.com/JuliaLang/utf8proc	2018-07-24 13:25:51 -04:00
Steven G. Johnson	60a2398184	copyright year updates	2018-07-24 13:20:49 -04:00
Steven G. Johnson	d4a58cfec5	update data and algorithms for Unicode 11 (#140 )	2018-07-24 13:18:48 -04:00
Steven G. Johnson	bdc8b9e4b2	Case folding fixes (#133 ) * Fixes allowing for “Full” folding and NFKC_CaseFold compliance. * Only include C (Common) and F (Full) foldings from CaseFolding.txt. Removed S (Simple) since F & S are specified to be exclusive. * Extend UTF8PROC_IGNORE to also ignore unassigned codepoints (such as \u2065) which are specified as being discarded by NFKC_CF. * Document the changes to UTF8PROC_IGNORE in header. * Add NFKC_CF helper function with documentation. * restore old IGNORE behavior, add UTF8PROC_STRIPNA, rename to utf8proc_NFKC_Casefold, add a test * success message * test that IGNORE does not strip NA * data update * NFKC_Casefold shouldn't strip NA	2018-05-02 08:15:02 -04:00
Benito van der Zander	acc204f1f1	possible fix for #128 (#129 ) Does this help? I do not really remember what I wrote back then	2018-04-27 08:06:14 -04:00
Branko Čibej	3a10df6013	Fix declaration-after-statement warning when compiling in strict C90 mode. (#113 )	2017-09-21 12:27:24 -04:00
Steven G. Johnson	b4621f43c3	new utf8proc_map_custom for hooking in user-defined custom mappings (#89 ) * new utf8proc_map_custom for hooking in user-defined custom mappings * whoops, add test program * NEWS, version bump for 2.1 * change test functions to static so that gcc doesn't complain about missing prototypes	2016-11-30 10:40:26 -05:00
Steven G. Johnson	8da37e2892	silence MSVC warning about conversion to uint8 (fix #86 )	2016-11-30 10:09:18 -05:00
Michael Drake	70bbed8626	Tlsa/ucs4 normalize (#88 ) * Split codepoint sequence normalisation out into separate function. This creates utf8proc_normalize_utf32() which takes and returns a UTF-32 string, applying the following options: - UTF8PROC_NLF2LS - UTF8PROC_NLF2PS - UTF8PROC_NLF2LF - UTF8PROC_STRIPCC - UTF8PROC_COMPOSE - UTF8PROC_STABLE The utf8proc_reencode() function has been updated to call the new utf8proc_normalize_utf32(). * Update code documentation: utf8proc_reencode handles UTF8PROC_CHARBOUND.	2016-11-21 09:22:39 -05:00
Keno Fischer	289ce5e041	Fix incorrect use of `lbc` instead of `lbc_override` (#77 )	2016-07-13 12:33:50 -04:00
Keno Fischer	c0a1ff81fc	Walk back ABI breaking changes (#76 )	2016-07-13 10:41:13 -04:00
Benito van der Zander	eeebf70bcf	Smaller tables (#68 ) * convert sequences to utf-16 (saves 25kb) * store sequence length in properties instead using -1 termination (saves 10kb) * cache index for slightly faster data creation * store lower/upper/title mapping in sequence array (saves 25kb). Add utf8proc_totitle, as title_mapping cannot be used to get the title codepoint anymore. Rename xxx_mapping to xxx_seqindex, so programs assuming a value with the old meaning fail at compile time * change combination array data type to uint16 (saves 40kb) * merge 1st and 2nd comb index (saves 50kb) * kill empty prefix/suffix in combination array (saves 50kb) * there was no need to have a separate combination start array, it can be merged in a single array * some fixes * mark the table as const again * and regen	2016-07-12 11:51:50 -04:00
Keno Fischer	41c6b23aab	Unicode 9 updates (#70 ) * Updates for Unicode 9.0.0 TR29 Changes - New rules GB10/(12/13) are used to combine emoji-zwj sequences/ (force grapheme breaks every two RI codepoints). Unfortunately this breaks statelessness of grapheme-boundary determination. Deal with this by ignoring the problem in utf8proc_grapheme_break, and by hacking in a special case in decompose - ZWJ moved to its own boundclass, update what is now GB9 accordingly. - Add comments to indicate which rule a given case implements - The Number of bound classes Now exceeds 4 bits, expand to 8 and reorganize fields * Import Unicode 9 data * Update Grapheme break API to expose state override * Bump MAJOR version	2016-06-28 16:04:25 -04:00
Michaël Meyer	1f17487aa9	Fix overrun	2016-02-04 04:06:28 +01:00
Michaël Meyer	26436c9775	Reduce the size of the binary. Use integers instead of pointers in Unicode tables. Saves 226 kb / 716 kb in the compiled library.	2015-12-09 19:55:48 +01:00
Federico G. Schwindt	4fc2d8234d	Silence warning with -Wextra Fixes #60.	2015-11-24 20:09:10 +00:00
Steven G. Johnson	fd20b184dd	update copyright statements to list recent contributors and year	2015-11-01 08:34:01 -05:00
Peter Colberg	09360de186	Do not export internal unsafe_encode_char()	2015-10-29 00:45:39 -04:00
Steven G. Johnson	a8fb4b1772	add toupper/tolower functions (for JuliaLang/julia#11471 )	2015-05-29 22:00:30 -04:00
Scott Paul Jones	6249e6b8b1	Fix #34 handle 66 Unicode non-characters, also improve performance and surrogate handling	2015-05-29 19:50:03 +02:00
Tony Kelman	0a818c7003	Prefix other C99 typedefs with utf8proc_	2015-04-06 22:36:33 -07:00
Tony Kelman	ad27722923	Use a new typedef utf8proc_ssize_t to avoid define collisions with MSVC	2015-04-05 20:06:13 -07:00
Steven G. Johnson	a1c429a45b	rename DLLEXPORT to UTF8PROC_DLLEXPORT to prevent conflicts with other header files that define DLLEXPORT	2015-03-30 11:05:51 -04:00
Steven G. Johnson	e1fdad0ca9	updated NEWS etc. for 1.2 release	2015-03-28 09:10:00 -04:00
Steven G. Johnson	11d2ece545	indentation consistency	2015-03-27 12:49:16 -04:00
Steven G. Johnson	c851c67888	put the API version as #defines in the header file (as discussed in #30 )	2015-03-27 12:35:41 -04:00
Jonas Fonseca	03a4e8854a	Fix #26 : use doxygen for generating API docs	2015-03-21 21:23:02 -04:00
Steven G. Johnson	3822984606	remove requirement that get_property and decompose_char argument be in range 0x0 to 0x10ffff	2015-03-12 14:17:27 -04:00
Steven G. Johnson	a4c84d2063	fix #2 : add charwidth function	2015-03-12 12:10:19 -04:00
Steven G. Johnson	402883c78e	rename back to utf8proc now that we are taking over maintenance	2015-03-06 12:43:37 -05:00
Steven G. Johnson	397a1eabea	update graphemes for Unicode 7, add utf8proc_grapheme_break function	2014-12-12 16:30:31 -05:00
Steven G. Johnson	1b3992ebe5	utf8proc_version should return a different version string than utf8proc	2014-12-12 14:20:53 -05:00
Steven G. Johnson	df71da45df	Merge pull request #17 from JuliaLang/tk/dllexport RFC: add DLLEXPORT to utf8proc_get_property	2014-09-24 14:26:13 -04:00
Tony Kelman	a840e5dae1	add DLLEXPORT to all functions in mojibake.h	2014-09-22 09:53:55 -07:00
Veres Lajos	83714e458e	a few typofixes	2014-08-12 21:30:59 +01:00
Steven G. Johnson	2c4e520a17	utf8proc.h -> mojibake.h (closes #10 )	2014-07-18 14:28:17 -04:00
Steven G. Johnson	dd35a8530d	C++/MSVC compatibility, indenting, for #4	2014-07-18 11:47:24 -04:00
Steven G. Johnson	ab9520d188	import of utf8proc-v1.1.6	2014-07-15 15:29:52 -04:00

50 Commits