Commit Graph

24 Commits

Author SHA1 Message Date
Steven G. Johnson
6a899e100a update copyright statement for data_generator 2018-07-24 13:24:56 -04:00
Steven G. Johnson
02f4e1890c
charwidth=1 for soft hyphen and unassigned codepoints (#135)
* use width=1 for soft hyphen and for unassigned/PUA codepoints

* don't count unassigned codepoints when comparing with system wcwidth

* more tests

* indentation fixes

* NEWS for 135

* remove special-casing for arabic control characters affecting a span of numbers, which are sometimes zero-width and sometimes not

* regenerate
2018-07-24 10:45:02 -04:00
Steven G. Johnson
d81308faba
uppercase mapping ß (U+00df) to ẞ (U+1E9E) (#134)
* uppercase(0x00df) = 0x1e9e

* tests for titlecase and u+00df uppercase

* NEWS, another test
2018-05-02 14:18:26 -04:00
Steven G. Johnson
bdc8b9e4b2
Case folding fixes (#133)
* Fixes allowing for “Full” folding and NFKC_CaseFold compliance.

* Only include C (Common) and F (Full) foldings from CaseFolding.txt. Removed S (Simple) since F & S are specified to be exclusive.
* Extend UTF8PROC_IGNORE to also ignore unassigned codepoints (such as \u2065) which are specified as being discarded by NFKC_CF.

* Document the changes to UTF8PROC_IGNORE in header.

* Add NFKC_CF helper function with documentation.

* restore old IGNORE behavior, add UTF8PROC_STRIPNA, rename to utf8proc_NFKC_Casefold, add a test

* success message

* test that IGNORE does not strip NA

* data update

* NFKC_Casefold shouldn't strip NA
2018-05-02 08:15:02 -04:00
Steven G. Johnson
d736adeff1
update to unicode 10 (#132) 2018-04-27 12:50:19 -04:00
Paul Smith
95fc75b839 Ensure generated const data tables are hidden via "static" (#100) 2017-02-19 17:33:25 -05:00
Michael Hatherly
eab97d16fb Don't use cached version of UnicodeData.txt (#92)
Ref: https://github.com/JuliaLang/julia/pull/19725, UnicodeData.txt is
now being cached in JuliaLang/julia's build.
2017-01-03 16:44:23 -08:00
Steven G. Johnson
15e1819cdd update to unifont 9.0.04 2016-12-11 16:35:27 -05:00
petercolberg
11b84e2de1 Use versioned Unicode data URLs (#78)
This ensures the tests keep working when a new Unicode version is released.
2016-07-13 12:40:59 -04:00
Steven G. Johnson
c02ebd5a83 update to Unifont 9 (for Unicode 9 charwidths) (#75) 2016-07-12 16:30:05 -04:00
Benito van der Zander
eeebf70bcf Smaller tables (#68)
* convert sequences to utf-16 (saves 25kb)

* store sequence length in properties instead using -1 termination (saves 10kb)

* cache index for slightly faster data creation

* store lower/upper/title mapping in sequence array (saves 25kb). Add utf8proc_totitle, as title_mapping cannot be used to get the title codepoint anymore. Rename xxx_mapping to xxx_seqindex, so programs assuming a value with the old meaning fail at compile time

* change combination array data type to uint16 (saves 40kb)

* merge 1st and 2nd comb index (saves 50kb)

* kill empty prefix/suffix in combination array (saves 50kb)

* there was no need to have a separate combination start array, it can be merged in a single array

* some fixes

* mark the table as const again

* and regen
2016-07-12 11:51:50 -04:00
Keno Fischer
41c6b23aab Unicode 9 updates (#70)
* Updates for Unicode 9.0.0 TR29 Changes

- New rules GB10/(12/13) are used to combine emoji-zwj sequences/
  (force grapheme breaks every two RI codepoints). Unfortunately this
  breaks statelessness of grapheme-boundary determination. Deal with
  this by ignoring the problem in utf8proc_grapheme_break, and by
  hacking in a special case in decompose

- ZWJ moved to its own boundclass, update what is now GB9 accordingly.

- Add comments to indicate which rule a given case implements

- The Number of bound classes Now exceeds 4 bits, expand to 8 and
  reorganize fields

* Import Unicode 9 data

* Update Grapheme break API to expose state override

* Bump MAJOR version
2016-06-28 16:04:25 -04:00
Michaël Meyer
26436c9775 Reduce the size of the binary.
Use integers instead of pointers in Unicode tables. Saves 226 kb / 716 kb in the
compiled library.
2015-12-09 19:55:48 +01:00
Peter Colberg
b10b64dc10 Fix deprecated warnings with Julia 0.4 2015-10-31 13:59:38 -04:00
Peter Colberg
8f522ad8e7 Add missing files to make clean 2015-10-30 14:56:03 -04:00
Peter Colberg
0a20307c39 Set URLCACHE to JuliaLang cache server for Travis builds
Download Unicode data from upstream server by default.

Download GNU Unifont from reliable GNU mirror by default.
2015-10-29 20:07:35 -04:00
Peter Colberg
f35e18e4b5 Generate fontforge font files in makefile
Revise the script to directly read fontforge font files, which are
generated in the makefile. This permits overriding the fontforge path
during the build, and executing fontforge in parallel with make -j.

Avoid duplicating download URLs in the script, which ensures that the
script itself works without network access, e.g., when downloading the
data files on a developer machine with network access and executing the
script on a build machine without network access.
2015-10-29 19:48:49 -04:00
Jiahao Chen
f0675f26f4 Update Unifont to 8.0.01 2015-06-29 16:42:34 -04:00
Steven G. Johnson
eefdaed218 sort keys to try to eliminate data dependence on Ruby version 2015-06-25 19:15:57 -04:00
Steven G. Johnson
6a7f92da64 fix #46 (make sure symbol-like codepoints have nonzero width even if they aren't in Unifont) 2015-06-24 14:07:15 -04:00
Jiahao Chen
d18963cc46 Minor fixes to work with Unicode 8.0.0 data 2015-06-20 08:03:40 -04:00
Tony Kelman
0a818c7003 Prefix other C99 typedefs with utf8proc_ 2015-04-06 22:36:33 -07:00
Steven G. Johnson
a4c84d2063 fix #2: add charwidth function 2015-03-12 12:10:19 -04:00
Steven G. Johnson
90721f2d39 directory cleanup: move tests and data into subdirectories 2015-03-06 17:36:08 -05:00