Commit Graph

50 Commits

Author SHA1 Message Date
Benito van der Zander
8a4cd4c903
reduce lenencode bits (#232) 2021-12-16 20:30:27 -05:00
Mike Glorioso
610730f231
Fix Sign-Conversion warnings in library and test code (#214)
* JuliaStrings#169 turn on sign-conversion warnings

Signed-off-by: Mike Glorioso <mike.glorioso@gmail.com>

* JuliaStrings#169 fix sign-conversion warnings for utf8proc.c

fix sign-converstion warnings for utf8proc_iterate
uc requires at most 21 bits to identify a unicode codepoint, so there is no need for it to be unsigned
multiple locations use, modify, or store uc with a signed value
the only exception is line 137 where uc is compared with an unsigned value

fix sign-converstion warnings for utf8proc_tolower, utf8proc_toupper, utf8proc_totitle
all three methods have sign conversion warnings when calling seqindex_decode_index
seqindex_decode_index uses the passed value as an index to an array utf8proc_sequences
as utf8proc_sequences is hard-coded and smaller than 2^31 - 1 we can safely cast to unsigned

fix sign-converstion warnings for utf8proc_decompose_char
lines with this warning use the defined function utf8proc_decompose_lump
in the function, a hardcoded unsigned value (1<<12) is complemented then cast as a signed value
as the intent is to remove the 12th bit flag from options, a signed value, and explicit cast is safe

fix sign-conversion warnings for utf8proc_map_custom
result is declared as signed, but is only expected to contain values between 0 and 4
sizeof returns an unsigned value. result must be cast to unsigned

Signed-off-by: Mike Glorioso <mike.glorioso@gmail.com>

* JuliaStrings#169 fix sign-conversion warnings for test/*

fix sign-conversion warnings for test/tests.c encode
change type for d to match return value of utf8proc_encode_char

fix sign-conversion warnings for test/graphemetest.c checkline
si, i, and j are unsigned size types, utf8proc_map and utf8proc_iterate accept and return signed size types
utf8proc_map treats negative strlen values as 0. the strlen used by the test must be similarly limited
utf8proc_iterate treats negative strlen values as 4 which will be less than the unsigned size
fix unused-but-set-variable warning by checking the glen value

fix sign-conversion warnings for test/case.c main
the if block ensures that tested codepoint fits in wint_t, but needs to include u and l as well
c, u, and l can be safely cast to wint_t

fix sign-conversion warnings for test/iterate.c
all values used for len are below 8, so an explicit cast is safe
updated types for more portable test code

fix sign-conversion warnings for test/printproperty.c main
change type of c to signed to resolve all sign-converstion warnings.
replace sscanf(... &c) wiht sscanf(... &x) followed by explicit sign converstion

Signed-off-by: Mike Glorioso <mike.glorioso@gmail.com>
2021-01-14 12:59:49 -05:00
Steven G. Johnson
8239639e3f fix NULL args in grapheme_break_stateful 2020-12-15 15:26:56 -05:00
Steven G. Johnson
0643a64479
Fix grapheme breaks on string-initial (#205)
* Fix extended emoji + zwj combo

* Patch initial repeated regional flags and extended+zwj emoj

* Merge conditions for setting breaks bt region

* updated fix

* perform tests for both utf8proc_map and manual calls to utf8proc_grapheme_break_stateful

* consolidate tests

Co-authored-by: Thomas Marks <marksta@umich.edu>
2020-11-23 14:10:29 -05:00
Steven G. Johnson
5622a0a51b
add islower/isupper functions (#196)
* add islower/isupper functions

* added test

* more tests + bugfix

* Makefile fix

* rm iscase test on make clean
2020-08-25 16:42:59 -04:00
xkszltl
08f9999a06
Switch to HTTPS for referencing www.unicode.org. (#193)
Resolve https://github.com/JuliaStrings/utf8proc/issues/192
2020-05-25 10:20:08 -04:00
Steven G. Johnson
b48f5d074f
Unicode 13 support (#179)
* exclude Sk from zero-width chars (closes #167)

* update for Unicode 13
2020-03-27 17:06:06 -04:00
Steven G. Johnson
e6fba4aa8c update header file comments (closes #157) 2019-05-14 10:53:55 -04:00
GOTOH Shunsuke
7b28b9e60c update for unicode 12.1 (#156) 2019-05-10 21:12:45 -04:00
Steven G. Johnson
abf81603ba
add utf8proc_unicode_version (#151) 2019-03-30 16:31:02 -04:00
Steven G. Johnson
4603e00cfc
fix CHARBOUND option for non-characters (#149) 2019-03-30 15:22:25 -04:00
Steven G. Johnson
6a659a5843 doc fixes, don't export stdint and limits.h values UINT16_MAX and SSIZE_MAX 2018-07-24 13:32:42 -04:00
Steven G. Johnson
e0295be467 Merge branch 'master' of https://github.com/JuliaLang/utf8proc 2018-07-24 13:25:51 -04:00
Steven G. Johnson
60a2398184 copyright year updates 2018-07-24 13:20:49 -04:00
Steven G. Johnson
d4a58cfec5
update data and algorithms for Unicode 11 (#140) 2018-07-24 13:18:48 -04:00
Steven G. Johnson
bdc8b9e4b2
Case folding fixes (#133)
* Fixes allowing for “Full” folding and NFKC_CaseFold compliance.

* Only include C (Common) and F (Full) foldings from CaseFolding.txt. Removed S (Simple) since F & S are specified to be exclusive.
* Extend UTF8PROC_IGNORE to also ignore unassigned codepoints (such as \u2065) which are specified as being discarded by NFKC_CF.

* Document the changes to UTF8PROC_IGNORE in header.

* Add NFKC_CF helper function with documentation.

* restore old IGNORE behavior, add UTF8PROC_STRIPNA, rename to utf8proc_NFKC_Casefold, add a test

* success message

* test that IGNORE does not strip NA

* data update

* NFKC_Casefold shouldn't strip NA
2018-05-02 08:15:02 -04:00
Benito van der Zander
acc204f1f1 possible fix for #128 (#129)
Does this help? I do not really remember what I wrote back then
2018-04-27 08:06:14 -04:00
Branko Čibej
3a10df6013 Fix declaration-after-statement warning when compiling in strict C90 mode. (#113) 2017-09-21 12:27:24 -04:00
Steven G. Johnson
b4621f43c3 new utf8proc_map_custom for hooking in user-defined custom mappings (#89)
* new utf8proc_map_custom for hooking in user-defined custom mappings

* whoops, add test program

* NEWS, version bump for 2.1

* change test functions to static so that gcc doesn't complain about missing prototypes
2016-11-30 10:40:26 -05:00
Steven G. Johnson
8da37e2892 silence MSVC warning about conversion to uint8 (fix #86) 2016-11-30 10:09:18 -05:00
Michael Drake
70bbed8626 Tlsa/ucs4 normalize (#88)
* Split codepoint sequence normalisation out into separate function.

This creates utf8proc_normalize_utf32() which takes and returns
a UTF-32 string, applying the following options:

- UTF8PROC_NLF2LS
- UTF8PROC_NLF2PS
- UTF8PROC_NLF2LF
- UTF8PROC_STRIPCC
- UTF8PROC_COMPOSE
- UTF8PROC_STABLE

The utf8proc_reencode() function has been updated to call the
new utf8proc_normalize_utf32().

* Update code documentation: utf8proc_reencode handles UTF8PROC_CHARBOUND.
2016-11-21 09:22:39 -05:00
Keno Fischer
289ce5e041 Fix incorrect use of lbc instead of lbc_override (#77) 2016-07-13 12:33:50 -04:00
Keno Fischer
c0a1ff81fc Walk back ABI breaking changes (#76) 2016-07-13 10:41:13 -04:00
Benito van der Zander
eeebf70bcf Smaller tables (#68)
* convert sequences to utf-16 (saves 25kb)

* store sequence length in properties instead using -1 termination (saves 10kb)

* cache index for slightly faster data creation

* store lower/upper/title mapping in sequence array (saves 25kb). Add utf8proc_totitle, as title_mapping cannot be used to get the title codepoint anymore. Rename xxx_mapping to xxx_seqindex, so programs assuming a value with the old meaning fail at compile time

* change combination array data type to uint16 (saves 40kb)

* merge 1st and 2nd comb index (saves 50kb)

* kill empty prefix/suffix in combination array (saves 50kb)

* there was no need to have a separate combination start array, it can be merged in a single array

* some fixes

* mark the table as const again

* and regen
2016-07-12 11:51:50 -04:00
Keno Fischer
41c6b23aab Unicode 9 updates (#70)
* Updates for Unicode 9.0.0 TR29 Changes

- New rules GB10/(12/13) are used to combine emoji-zwj sequences/
  (force grapheme breaks every two RI codepoints). Unfortunately this
  breaks statelessness of grapheme-boundary determination. Deal with
  this by ignoring the problem in utf8proc_grapheme_break, and by
  hacking in a special case in decompose

- ZWJ moved to its own boundclass, update what is now GB9 accordingly.

- Add comments to indicate which rule a given case implements

- The Number of bound classes Now exceeds 4 bits, expand to 8 and
  reorganize fields

* Import Unicode 9 data

* Update Grapheme break API to expose state override

* Bump MAJOR version
2016-06-28 16:04:25 -04:00
Michaël Meyer
1f17487aa9 Fix overrun 2016-02-04 04:06:28 +01:00
Michaël Meyer
26436c9775 Reduce the size of the binary.
Use integers instead of pointers in Unicode tables. Saves 226 kb / 716 kb in the
compiled library.
2015-12-09 19:55:48 +01:00
Federico G. Schwindt
4fc2d8234d Silence warning with -Wextra
Fixes #60.
2015-11-24 20:09:10 +00:00
Steven G. Johnson
fd20b184dd update copyright statements to list recent contributors and year 2015-11-01 08:34:01 -05:00
Peter Colberg
09360de186 Do not export internal unsafe_encode_char() 2015-10-29 00:45:39 -04:00
Steven G. Johnson
a8fb4b1772 add toupper/tolower functions (for JuliaLang/julia#11471) 2015-05-29 22:00:30 -04:00
Scott Paul Jones
6249e6b8b1 Fix #34 handle 66 Unicode non-characters, also improve performance and surrogate handling 2015-05-29 19:50:03 +02:00
Tony Kelman
0a818c7003 Prefix other C99 typedefs with utf8proc_ 2015-04-06 22:36:33 -07:00
Tony Kelman
ad27722923 Use a new typedef utf8proc_ssize_t to avoid define collisions
with MSVC
2015-04-05 20:06:13 -07:00
Steven G. Johnson
a1c429a45b rename DLLEXPORT to UTF8PROC_DLLEXPORT to prevent conflicts with other header files that define DLLEXPORT 2015-03-30 11:05:51 -04:00
Steven G. Johnson
e1fdad0ca9 updated NEWS etc. for 1.2 release 2015-03-28 09:10:00 -04:00
Steven G. Johnson
11d2ece545 indentation consistency 2015-03-27 12:49:16 -04:00
Steven G. Johnson
c851c67888 put the API version as #defines in the header file (as discussed in #30) 2015-03-27 12:35:41 -04:00
Jonas Fonseca
03a4e8854a Fix #26: use doxygen for generating API docs 2015-03-21 21:23:02 -04:00
Steven G. Johnson
3822984606 remove requirement that get_property and decompose_char argument be in range 0x0 to 0x10ffff 2015-03-12 14:17:27 -04:00
Steven G. Johnson
a4c84d2063 fix #2: add charwidth function 2015-03-12 12:10:19 -04:00
Steven G. Johnson
402883c78e rename back to utf8proc now that we are taking over maintenance 2015-03-06 12:43:37 -05:00
Steven G. Johnson
397a1eabea update graphemes for Unicode 7, add utf8proc_grapheme_break function 2014-12-12 16:30:31 -05:00
Steven G. Johnson
1b3992ebe5 utf8proc_version should return a different version string than utf8proc 2014-12-12 14:20:53 -05:00
Steven G. Johnson
df71da45df Merge pull request #17 from JuliaLang/tk/dllexport
RFC: add DLLEXPORT to utf8proc_get_property
2014-09-24 14:26:13 -04:00
Tony Kelman
a840e5dae1 add DLLEXPORT to all functions in mojibake.h 2014-09-22 09:53:55 -07:00
Veres Lajos
83714e458e a few typofixes 2014-08-12 21:30:59 +01:00
Steven G. Johnson
2c4e520a17 utf8proc.h -> mojibake.h (closes #10) 2014-07-18 14:28:17 -04:00
Steven G. Johnson
dd35a8530d C++/MSVC compatibility, indenting, for #4 2014-07-18 11:47:24 -04:00
Steven G. Johnson
ab9520d188 import of utf8proc-v1.1.6 2014-07-15 15:29:52 -04:00