Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add language support for Simula #7025

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

eirslett
Copy link

@eirslett eirslett commented Sep 2, 2024

Simula is a language of historic and academic interest. It sprung out of ALGOL, and went on to inspire next-generation languages like Java, C++ and C#.

Simula is considered the first object-oriented programming language. As its name suggests, the first Simula version by 1962 was designed for doing simulations; Simula 67 though was designed to be a general-purpose programming language[3] and provided the framework for many of the features of object-oriented languages today.

It was invented by Ole-Johan Dahl (1931-2002) and Kristen Nygaard (1926 – 2002) at the Norwegian Computing Center, for which they both received the Turing Award - "for ideas fundamental to the emergence of object oriented programming". It's considered one of the most impactful Norwegian contributions to the field of Computer Science.

Personally, I think these historical languages are incredibly exciting, and learning about their design decisions teaches us a lot about designs of newer programming languages! Modern Simula tools, like syntax highlighters and compilers for recent hardware, enable new generations of software developers to take a first-hand look at how programming languages worked in the past.

Checklist:

@eirslett eirslett requested a review from a team as a code owner September 2, 2024 19:17
Sec32fun32

This comment was marked as spam.

Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 your samples aren't real world examples… they're the kind of thing you write when learning the language. Your search query also wouldn't find them which brings me to my next point… there are a lot of .sim files on GitHub and I don't think they're all Simula. Merging this PR as-is will result in all .sim files being classified as Simula.

To prevent this, please identify at least one other language that uses this extension and add support for it as part of this PR. A heuristic should also be implemented to differentiate the two.

@eirslett
Copy link
Author

eirslett commented Sep 4, 2024

Sorry about my long answer - I tried to shorten it as much as I could!

Code examples

🤔 your samples aren't real world examples… they're the kind of thing you write when learning the language.

Yup, you're absolutely right! I followed the example of COBOL, Fortran and (to an extent Scala) which also have toy examples.

The issue is licensing - Simula companies have ceased to exist, code authors may already have passed away. And some of the code is copylefted. I wrote the sample code for the purpose of this PR, to make sure to have something MIT-licensed.
If you would like, I could write a couple more "complex" examples? Or maybe reach out to some people, ask if it's possible to get a permissive license.

Existing .sim files on GitHub

Your search query also wouldn't find them which brings me to my next point… there are a lot of .sim files on GitHub and I don't think they're all Simula. Merging this PR as-is will result in all .sim files being classified as Simula.

I had a look, it seems like the stuff people put in .sim files is very diverse:

To prevent this, please identify at least one other language that uses this extension and add support for it as part of this PR.

I couldn't really find any sort of language here that is more prevalent than the others, is there any in particular? My suggestion is to detect Simula with a heuristic - otherwise just leave the files as plaintext like today?

Heuristic

A heuristic should also be implemented to differentiate the two.

There are a couple of heuristics we could choose, what would be the better option?

1) Simple begin and end match

All Simula programs have to include BEGIN and END (case-insensitive). It's a simple heuristic, but one could get a couple of false positives.

GitHub code search with this heuristic 99 % of it is Simula, and the few ones that are not, are similar enough that the Simula textmate grammar will look OK.

2) Crazy regex

All Simula programs have to start with this pattern:

/^(\s*?\n)*?((%.*?\n|\s*!.*?;|\s*comment\b.*?;|\s*--.*?\n)|(\s*?\n)*?)*\s*(((\b\p{L}[\p{L}0-9_]*\b((\s+)|(\(.*?\)\s*)))?\bbegin\b.*?\bend\b)|(?:(\bexternal\b)|(\bclass\b)|(\bprocedure\b)|(\bref\b)|(\bboolean\b)|(\bcharacter\b)|(\bshort\b)|(\binteger\b)|(\blong\b)|(\breal\b)|(\btext\b)|(\barray\b)|(\b\p{L}[\p{L}0-9_]*\b\s*\bclass\b)).*?\bbegin\b.*?\bend\b)/is

It's based on the official Simula grammar. I wasn't able to run the GitHub code search with it - it's probably too complex. PCRE also needs to run with the i flag (case-insensitive match) and the s flag (so that . matches newline characters).
Another drawback is that it might not detect Simula programs with spelling mistakes in them.

3) Use tree-sitter (maybe unrealistic for now)

One could run the tree-sitter parser on the code, and see what percentage of the tokens are valid Simula code, what percentage are syntax errors. I think that would be a better heuristic. But granted Linguist doesn't support tree-sitter directly yet, maybe it's something to think of for the future?


Again, sorry for the long response! Please let me know which kinds of options I should go with!

@lildude
Copy link
Member

lildude commented Sep 4, 2024

And some of the code is copylefted

We accept copyleft licensed samples (we have plenty GPL samples for example) as we don't ship them as-is... we ship them as tokens which are used to train the classifier.

I couldn't really find any sort of language here that is more prevalent than the others, is there any in particular? My suggestion is to detect Simula with a heuristic - otherwise just leave the files as plaintext like today?

The only preference is for the other language to also meet our usage requirements, so go for the other language with the most files. We can't simply implement a heuristic as it needs two or more languages to apply to it hence the need for identifying at least another language.

There are a couple of heuristics we could choose, what would be the better option?

Lets start simple. It's good enough. It can be refined as people add support for the other languages.

@eirslett
Copy link
Author

eirslett commented Sep 4, 2024

Ok, I added the simple begin/end heuristic.
And added special cases for Markdown.sim and Dockerfile.sim, there seemed to be a few of those.
It looks like a lot of the other .sim files are some sort of DSL that are JSON/YAML-like. I tried using YAML as a fallback - the YAML grammar is quite ok for simple general-purpose data formats.
What do you think?

Copy link
Member

@lildude lildude left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See inline comment. We also need a test for your heuristic.

@@ -8245,6 +8259,7 @@ YAML:
- ".mir"
- ".reek"
- ".rviz"
- ".sim"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a sample for this language if you're going to add this extension here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a sample for this language if you're going to add this extension here.

I'm not 100 % what you meant - was this about the Simula language, or about the YAML fallback?

I made 2 updates to the pull request:

  1. Pushed 5 more Simula language code samples
  2. Added an example of an existing .sim file on GitHub which would make sense to code-highlight as YAML (it's JSON, but all JSON is valid YAML).

It would be nice if linguist supported a "noop fallback" configuration, where - if we can't positively identify a .sim file as valid Simula code - it would just be treated as unknown text format.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the YAML entry. You added it but didn't add a sample.

I see you've now added samples but you've now also added samples that are too big. Please remove any that are suppressed in the diff and any that aren't real world uses.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I removed the samples!

lib/linguist/languages.yml Outdated Show resolved Hide resolved
@lildude
Copy link
Member

lildude commented Nov 26, 2024

Oh yes and we need a link to the source of each sample and the licence for each in the initial PR template.

- language: Simula
pattern: '(?i)\bbegin\b.*?\bend\b'
- language: YAML
pattern: '^[\{\[]'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need a test for this heuristic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants