Reorganize this repo: distribute the dirs over categories #2872

KOLANICH · 2022-09-16T07:28:58Z

KOLANICH
Sep 16, 2022

It is a bit hard to navigate this repo when all the dirs are piled into the main dir.

It is proposed to reorganize it by introducing dirs with demantic names and moving parsers' dirs into them.

The proposed dir hierarchy:

config - config files and records.
grammar - DSLs describing other grammars.
- text - grammars like the ones for tools like ANTLR
- ddl - DSLs for describing binary grammars, like protobuf, flatbuffers, capnproto, FlexT and so on
programming - programming and scripting languages, like C++ or bash.
programms - for parsing output of software, when it is infeasible to use a machine-readable interface.
protocols - for interfacing servers or devices, single command per line, such as SCPI, AT, JTAG consoles, SMTP, stuff like this.
serialization - serialization languages, like JSON, YAML, protobuf and CSV.
embedded - grammars used as parts of other formats, that don't belong to anywhere else
identifiers - various identifiers, like SSNs, phone numbers, VIN-codes, UUID and so on
- network - network addresses: IPv4, IPv6, MAC, IMEI,
- products - product namebers, like HTE721010A9E630

The rest of identifiers should stay in root untill it is decided to where they are to be moved.

KvanTTT · 2022-09-16T14:01:53Z

KvanTTT
Sep 16, 2022
Collaborator

I like the suggestion. Also, the similar topic was raised some time ago: #941 But it looks like your structure is more thoughtful.

0 replies

KvanTTT · 2022-09-16T14:06:04Z

KvanTTT
Sep 16, 2022
Collaborator

Could you please describe the detailed transform for all grammars in the repository? I'll suggest fixes if it's required.

0 replies

teverett · 2022-09-16T16:16:12Z

teverett
Sep 16, 2022
Collaborator

@KOLANICH I've resisted changes lie this for quite a while. However, with the number of grammars there are now, I think it might be time. I like the structure you've proposed.

One of the reasons I've been concerned to accept a change like this is that I am worried it will be a barrier to people finding a grammar they're looking for. Could an index of grammars be generated and published as part of this?

0 replies

KOLANICH · 2022-09-16T17:17:24Z

KOLANICH
Sep 16, 2022
Author

The problem with any index is that it has to be updated. It can be automated, though.

0 replies

kaby76 · 2022-09-17T16:36:47Z

kaby76
Sep 17, 2022

I think the first thing to do is propose the new directory structure and where the grammars currently reside would be moved to. We don't have a "C++" directory but "cpp", and we don't have a Bash grammar at all.

I'm sure there will be several grammars that fit into multiple categories. For example, I have grammars for many parser generator systems, including tree-sitter, which is a JSON structured-document that represents a context-free grammar. What would these all fall under?

I worry that unless there is an index, I won't be able to find a grammar. As @teverett suggests, perhaps what we should have is a generated index page where one would enter search terms. And if I'm working on a particular grammar, I can set up an alias to combine a find and cd to navigate to it at a Bash shell depending on how deep the directory structure is.

Note, the only other realistic grammar database that I know of is Grammar Zoo (index page for the repo). The github repository for this website is https://github.com/slebok/zoo. You can peruse that repo and see how Zaytsev (https://grammarware.net/) organized it. Note, each grammar is described by a meta file (zoo.xml) containing the author, date written, how it was written (e.g., "scraped"), source, DOI for papers, etc. See this example: https://github.com/slebok/zoo/blob/master/zoo/ada/ada83/ichbiah/zoo.xml. The meta could contain searchable terms, which would be a way of generating the indexing page.

0 replies

KOLANICH · 2022-09-17T18:07:40Z

KOLANICH
Sep 17, 2022
Author

Grammar Zoo

Thanks a lot for letting me know about this project. In fact I haven't known about Zaytsev work and has created (well, not really "created", it is very immature) something similar (an own DSL with the goal to be transpiled (and work after transpilation) into DSLs of as many different parser gens as possible (also a wrapper is generated to use the built AST uniformly) ), and my main motivation for this proposal was to have them structured, so for me not to get mad when porting your grammars into my DSL.

You can peruse that repo and see how Zaytsev (https://grammarware.net/) organized it.

It seems that the organization relies more on XML files than on directory structure, at least https://github.com/slebok/zoo/tree/master/zoo looks like a pile similar to the one we see in this repo.

The hierarchy I propose for this repo is more influenced by the one we (I'm a contributor of that repo) use in https://github.com/kaitai-io/kaitai_struct_formats/ .

perhaps what we should have is a generated index page where one would enter search terms.

Fortunately, one can enter search terms into GitHub search, and it works without JavaScript, but to be honest, I dislike the ranking: https://github.com/antlr/grammars-v4/search?q=json&type=code&l=ANTLR doesn't have the JSON grammars on the first lines.

0 replies

KOLANICH · 2022-09-17T18:14:58Z

KOLANICH
Sep 17, 2022
Author

<source>
		<author>Jean D. Ichbiah</author>
		<title>Preliminary Ada reference manual; Syntax Summary</title>
		<subtitle>ACM SIGPLAN Notices, Volume 14 Issue 6a</subtitle>
		<date>June 1979</date>
		<specific>pages E-1 to E-5 (142-146)</specific>
		<link>
			<doi>10.1145/956650.956651</doi>
		</link>
</source>

In UG and KS we inline this kind of metadata into grammars themselves under a meta key. In ANTLR it is not possible without the DSL extension, I guess, but we can probably rely on a convention to embed a comment with YAML/NEON/TOML/JSON/HCL2 or any other text language for serialization

0 replies

kaby76 · 2022-09-17T21:40:33Z

kaby76
Sep 17, 2022

In UG and KS we inline this kind of metadata into grammars themselves under a meta key. In ANTLR it is not possible without the DSL extension, I guess, but we can probably rely on a convention to embed a comment with YAML/NEON/TOML/JSON/HCL2 or any other text language for serialization

The .g4 files can have comments (block /* ... */, or line //), so we could embed meta data in a comment. The main problem I have with having this information in the grammar file is that I usually don't want to see all that every time I edit the grammar. I prefer to just see the context-free grammar, nothing else. But, an IDE can hide all that when editing.

0 replies

KOLANICH · 2022-09-17T22:41:46Z

KOLANICH
Sep 17, 2022
Author

Could you please describe the detailed transform for all grammars in the repository?

#2830

0 replies

RossPatterson · 2022-09-27T01:35:36Z

RossPatterson
Sep 27, 2022

I have to say, as a retired long-time programmer, I don't find the proposed organization any better than the flat model we currently have. One person's obvious hierarchy is another person's chaos.

I think we'd be far better off with a structured metadata file in each grammar's root directory, and an automatically-recreated index file in the repo's root based on those files.

0 replies

kaby76 · 2022-10-09T23:48:07Z

kaby76
Oct 9, 2022

I have a preliminary PR for Github Actions to generate an index file (grammars.json for now) that calls @parrt 's _script/mkindex.py script. It's only kicked off when there is a push into "master", not PRs. The file contains everything needed for lab.antlr.org for selecting a grammar and input file.

Right now, it's a control that offers a flat view of the entire grammars. You can try it out here. If we offer a structured view, we're going to need to be able to find the grammar easily with a search term, like "cpp" or "c++", as "one person's ... hierarchy is another person's chaos." I couldn't find anything without setting up some scripts. If we change the structure of the repo, the select control should also probably be redesigned to reflect the file system organization.

2 replies

RossPatterson Oct 10, 2022

About 2/3 of the grammar directories (228 of 321) contain some sort of readme.* file (220 Markdown, 8 plaintext)[1]. Piping those through a Markdown-to-text converter as necessary (e.g., kostyachum/python-markdown-plain-text) and a noise-word remover (e.g., via NLTK), might produce a tolerable corpus for a full-text search mechanism. That's not a lot of additional data - almost all of them are less than 1,000 characters, and most are much less. About 125KB total, before removing noise and formatting, on top of the 555KB for the current index data.

[1] tnsnames also has a README.html file, but its content is identical to the README.md file.

kaby76 Oct 10, 2022

We might need to update the readme's to provide information that's not in the readme's yet. But, if we need a structured doc, we probably could just add it to the pom.xml, as long as we come up with some standardized elements for doc purposes. I don't think Maven will care about elements that it doesn't know the meaning of.

teverett · 2022-10-10T16:24:22Z

teverett
Oct 10, 2022
Collaborator

My thought was to merge the generated markdown into the main repo readme.md, and then link to the appropriate readme files where they exist.

0 replies

RossPatterson · 2022-10-10T19:19:27Z

RossPatterson
Oct 10, 2022

I just looked at all 228 README files. 200 (87%) of them are for grammars where the directory name in this repository is the proper noun for the thing being parsed. That's pretty good - it means that 62% of the 321 grammars in this repository are verifiably stored in the most likely place someone seeking them would look. Here is the list of all 28 grammars with READMEs that aren't in proper-noun directories. Some of them are still pretty obvious.

Do we actually have a problem that is worth discussing and trying to solve?

agc
arithmetic
asm/asm8086
asm/asmMASM
asm/asmZ80
asm/pdp7
calculator
dice
evm-bytecode
fen
fol
jpa
ltl
mckeeman-form
metric
molecule
p
parkingsign
postalcode
propcalc
romannumerals
smiles
tcpheader
telephone
tnt
unreal_angelscript
wat
wln

1 reply

kaby76 Oct 10, 2022

Do we actually have a problem that is worth discussing and trying to solve?

There are other ways to find an appropriate grammar. E.g.

Given a file suffix/extension, what grammars apply?
Which grammars are "optimized"?
Which grammars are good introductions for a novice?

Etc.

kaby76 · 2022-10-12T22:26:42Z

kaby76
Oct 12, 2022

I think the best way to implement this is to use the Github Pages. A nice website can be implemented for the repo on a branch, say "gh-pages". Then when the repo is updated with new grammars, the gh-pages branch is updated with new information on grammars that are "deployed" using Gihub Actions. The basic github.com/antlr/grammars-v4 view would still be what it is, but you can have a UI at https://antlr.github.io/grammars-v4/ that presents the grammars the way you want to organize them. Markdown itself doesn't have tables that can be sorted by column selection. You need Javascript for that, but Github Pages allows you to do that.

1 reply

parrt Oct 13, 2022
Maintainer

An interesting idea. On the other hand it might make sense simply to fold the index into the existing README... Maybe there is an include mechanism that works in mark down for github.

teverett · 2022-10-13T17:04:28Z

teverett
Oct 13, 2022
Collaborator

@parrt that was my perspective too; fold it into the readme.

1 reply

parrt Oct 13, 2022
Maintainer

That's certainly the first thing people will notice...

parrt · 2022-10-13T17:04:29Z

parrt
Oct 13, 2022
Maintainer

I'm liking Ken's idea to tag grammars with various classifications, but echoing @RossPatterson, maybe we don't actually have a real problem here. Have we gotten any feedback that suggest people can't find what they need I just digging around in the subdirectories?

0 replies

teverett · 2022-10-13T17:06:12Z

teverett
Oct 13, 2022
Collaborator

Currently, no problem other that some grammars are hidden under subdirectories such as /asm and /esolang. However if we did refactor into numerous subdirectories, I'd rather provide an index than ask people to dig through the source tree.

4 replies

parrt Oct 13, 2022
Maintainer

Yeah I'm making a bunch of clicking like a phone tree would be annoying... I actually got mildly annoyed when I looked at the assembly code directory and had to jump inside that haha. Perhaps it's better to create an index that is organized the way we want or even multiple indexes rather than change the structure of the grammars. One could argue that any grammar should be accessible with a consistent URL for people to pull from, whether it's a tool like antlr lab or otherwise. For example it should always be something like languagename/grammarfilename.g4

https://raw.githubusercontent.com/antlr/grammars-v4/master/algol60/algol60.g4

teverett Oct 13, 2022
Collaborator

Exactly :)

kaby76 Oct 13, 2022

An index in the form of a table than can be dynamically sorted by a column can't really be done with Github Markdown. And Github Markdown has no include for more markdown. I tried inserting raw html into the markdown files and it's pretty limited, definitely cannot include <script>, which would be necessary to dynamically sort by column.

We can create static indices for various sorts by these columns by generating different readme for each, e.g., alphabetic by grammar name, alphabetic by directory name, date created in repo, date last modified, "target agnostic"?, "is java target available?", "is c# target available?", "contains no target code?", etc.

In the meanwhile, I'm going to just play around with Github Pages, because that allows full html/js website code.

KvanTTT Oct 13, 2022
Collaborator

definitely cannot include <script>

I afraid it's unsafe.

teverett · 2022-10-13T20:11:31Z

teverett
Oct 13, 2022
Collaborator

I could be ok with an alphabetical table. While I don't support asking people to recurse through directories, but ctl-F search on readme.md seems reasonable?

0 replies

Reorganize this repo: distribute the dirs over categories #2872

Replies: 18 comments · 9 replies

KvanTTT Sep 16, 2022 Collaborator

KvanTTT Sep 16, 2022 Collaborator

teverett Sep 16, 2022 Collaborator

KOLANICH Sep 16, 2022 Author

KOLANICH Sep 17, 2022 Author

KOLANICH Sep 17, 2022 Author

KOLANICH Sep 17, 2022 Author

teverett Oct 10, 2022 Collaborator

parrt Oct 13, 2022 Maintainer

teverett Oct 13, 2022 Collaborator

parrt Oct 13, 2022 Maintainer

parrt Oct 13, 2022 Maintainer

teverett Oct 13, 2022 Collaborator

parrt Oct 13, 2022 Maintainer

teverett Oct 13, 2022 Collaborator

KvanTTT Oct 13, 2022 Collaborator

teverett Oct 13, 2022 Collaborator

Replies: 18 comments 9 replies

KvanTTT
Sep 16, 2022
Collaborator

KvanTTT
Sep 16, 2022
Collaborator

teverett
Sep 16, 2022
Collaborator

KOLANICH
Sep 16, 2022
Author

KOLANICH
Sep 17, 2022
Author

KOLANICH
Sep 17, 2022
Author

KOLANICH
Sep 17, 2022
Author

teverett
Oct 10, 2022
Collaborator

parrt Oct 13, 2022
Maintainer

teverett
Oct 13, 2022
Collaborator

parrt Oct 13, 2022
Maintainer

parrt
Oct 13, 2022
Maintainer

teverett
Oct 13, 2022
Collaborator

parrt Oct 13, 2022
Maintainer

teverett Oct 13, 2022
Collaborator

KvanTTT Oct 13, 2022
Collaborator

teverett
Oct 13, 2022
Collaborator