Specify all presentation sequences #114

MDLC01 · 2025-07-16T23:25:23Z

This PR fixes #21, fixes #23, and closes #25, by adding the appropriate variation selectors to symbols that exist in both sym and emoji. I verified that all the variation sequences are defined in by Unicode.¹²

I have marked this PR as a draft because the next step is to add variation selectors to all symbols that allow it, whether present in both sym and emoji or not, to prevent ambiguity.

This made me realize that some emojis are poorly named, but improving this is a task for a separate PR.

Related: typst/typst#6489 (comment).

Enivex · 2025-07-17T07:18:12Z

This made me realize that some emojis are poorly named,

It's a lot of them. My recentish pr only covered a small part.

T0mstone · 2025-07-17T07:48:09Z

I'll be the first to bring up automation here:
Whether we want to automate this in the future or not, I think we should definitely have all of these hard-coded in the source code, since such automation would probably take a lot of time that we shouldn't have to pay on each build.
The automation would then only be an optional verification step like in #9. (And maybe we always enable it in CI)

That means any concerns about such automation should not block this PR.

MDLC01 · 2025-07-24T21:40:12Z

I had to find a way to be confident in the content of this PR anyway, so I figured I could make tests so that the automation part is already done. I left a commit with the tests failing, so that you can see the output in CI. Notably, there was one mistake introduced in ae98eb0: I added the text presentation selector to sym.dash.wave instead of sym.dash.wave.double.

For now, I query the list of presentation sequences defined by Unicode from the internet every time the tests are run. Feel free to suggest a better way.

MDLC01 · 2025-07-24T22:34:41Z

This PR now contains meta changes.

T0mstone · 2025-08-03T17:47:17Z

For now, I query the list of presentation sequences defined by Unicode from the internet every time the tests are run. Feel free to suggest a better way.

How about caching it in a file cache/presentation-sequences.txt and adding /cache/presentation-sequences.txt to the .gitignore?

Also, I'd like the web-request part of this test to be opt-in to begin with (the test can still run without it if the cache file already exists). IMO running cargo test shouldn't perform any web requests without the user's consent. A crate feature for this would also have the advantage of making the reqwest dependency optional.

MDLC01 · 2025-08-03T18:16:27Z

Initially I thought it may be better to download the file as part of a build script for tests only (which would cache it for as long as the source code is not modified), but I'm not sure how to run something only for building tests? Also, maybe this is a bad idea for some other reason. I think your solution is probably better anyway. Maybe the file could even be part of the repo so that it doesn't need to be downloaded every time, but somehow that doesn't feel right to me. Also, that might have some licensing issues (we would probably need to include https://www.unicode.org/license.txt as well).

Additionally, I want to clarify that the dependency on reqwest is for tests only.

T0mstone · 2025-08-03T20:42:49Z

Additionally, I want to clarify that the dependency on reqwest is for tests only.

Yeah I got that, but it's still a huge dep tree that not everyone may want to have to download before running [the rest of] the tests.

laurmaedje · 2025-08-04T08:32:30Z

If we're gonna go with the downloading thing, then ureq would already be a lot smaller than reqwest.

MDLC01 · 2025-08-05T11:12:08Z

I switched to ureq. Additionally, the file is now pinned to Unicode 16.0.0 so as to prevent sudden breakage when a new Unicode version releases. The file is now downloaded in build.rs. To prevent always having ureq as a build dependency, I hid the tests that require it behind a non-default _test-unicode-conformance feature and added it to CI.

MDLC01 · 2025-08-05T11:15:06Z

According to The Cargo Book, it is not possible to have a build dependency for tests only, so this feature trick is necessary:

The same applies to cfg(debug_assertions), cfg(test) and cfg(proc_macro). [...] There is currently no way to add dependencies based on these configuration values.

build.rs

src/lib.rs

Enivex · 2025-08-12T19:36:57Z

Would it be possible to have some shorthand that specifies text vs emoji form? \u{FE0E} isn't particularly readable

MDLC01 · 2025-08-13T10:18:11Z

Could be \vs1–\vs16, or alternatively \evs and \tvs for "emoji variation selector" and "text variation selector" only.¹

https://www.unicode.org/charts/PDF/UFE00.pdf ↩

MDLC01 · 2025-08-13T10:18:50Z

Or \emoji and \text

MDLC01 · 2025-08-13T15:39:39Z

An alternative is to simply paste the corresponding codepoints in the symbol lists

T0mstone · 2025-08-14T20:51:04Z

An alternative is to simply paste the corresponding codepoints in the symbol lists

That defeats the point of having escape sequences to begin with. We could also paste in the zwj and friends verbatim, but that's terrible for editing, diffing, etc.

T0mstone · 2025-08-14T20:52:51Z

Or \emoji and \text

This (only with ! instead of \) was also what I originally suggested in #92. Using a backslash is way better tho, so I actually like this more than my original idea.

Enivex · 2025-08-14T20:55:58Z

Or \emoji and \text

I think this would be good

Enivex · 2025-08-14T21:55:40Z

The other variation selectors have multiple roles, and therefore aren't easy to name. Maybe \vs1, \vs2, ... \vs15 and \vs16 with \text and \emoji possibly being alternative names for the final two? Even just \vs1 - \vs16 would be more readable than \u{fe00} - u{fe0f}

There's a couple dozen mathematical symbol variants involving VS1

https://www.unicode.org/Public/16.0.0/ucd/StandardizedVariants.txt

Edit: Only selectors 1-4, 7, 15 and 16 are currently in use.

MDLC01 · 2025-08-15T12:20:10Z

I implemented \vs{1}--\vs{15}, as well as \vs{text} and \vs{emoji}.

MDLC01 · 2025-08-22T12:17:13Z

I think all the concerns were addressed and so this is ready for review

Add variation selectors to symbols that exist in both sym and emoji

ae98eb0

MDLC01 added 3 commits July 24, 2025 22:33

Specify 'static lifetime when possible

fabc977

Add tests for variation sequences

2062af4

Fix variation sequences

bc90113

MDLC01 marked this pull request as ready for review July 24, 2025 21:40

MDLC01 changed the title ~~Add variation selectors to symbols that exist in both sym and emoji~~ Specify all presentation sequences Jul 24, 2025

MDLC01 added the meta Discussion about the structure of this repo label Jul 24, 2025

Download file on build and add _test-unicode-conformance feature

84f7d3e

MDLC01 force-pushed the vs branch from ed9862e to 84f7d3e Compare August 5, 2025 11:09

T0mstone reviewed Aug 7, 2025

View reviewed changes

build.rs Outdated Show resolved Hide resolved

Only download file if not cached

a94ed5b

T0mstone reviewed Aug 10, 2025

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

MDLC01 added 2 commits August 11, 2025 12:45

Improve code clarity and future-proofness

2fb667e

Add clarification comment

0258d05

MDLC01 force-pushed the vs branch from 8b20945 to 0258d05 Compare August 11, 2025 14:21

T0mstone reviewed Aug 11, 2025

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

"presentation sequences" -> "emoji variation sequences"

3fe4ebb

Add and use \vs{15} syntax

efadf1d

Uh oh!

Specify all presentation sequences #114

Are you sure you want to change the base?

Specify all presentation sequences #114

Uh oh!

Conversation

MDLC01 commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Footnotes

Uh oh!

Enivex commented Jul 17, 2025

Uh oh!

T0mstone commented Jul 17, 2025

Uh oh!

MDLC01 commented Jul 24, 2025

Uh oh!

MDLC01 commented Jul 24, 2025

Uh oh!

T0mstone commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MDLC01 commented Aug 3, 2025

Uh oh!

T0mstone commented Aug 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

laurmaedje commented Aug 4, 2025

Uh oh!

MDLC01 commented Aug 5, 2025

Uh oh!

MDLC01 commented Aug 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Enivex commented Aug 12, 2025

Uh oh!

MDLC01 commented Aug 13, 2025

Footnotes

Uh oh!

MDLC01 commented Aug 13, 2025

Uh oh!

MDLC01 commented Aug 13, 2025

Uh oh!

T0mstone commented Aug 14, 2025

Uh oh!

T0mstone commented Aug 14, 2025

Uh oh!

Enivex commented Aug 14, 2025

Uh oh!

Enivex commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MDLC01 commented Aug 15, 2025

Uh oh!

MDLC01 commented Aug 22, 2025

Uh oh!

Uh oh!

MDLC01 commented Jul 16, 2025 •

edited

Loading

T0mstone commented Aug 3, 2025 •

edited

Loading

T0mstone commented Aug 3, 2025 •

edited

Loading

Enivex commented Aug 14, 2025 •

edited

Loading