Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Licensee and Licensed gems #3982

Merged
merged 19 commits into from
Apr 3, 2018
Merged

Update Licensee and Licensed gems #3982

merged 19 commits into from
Apr 3, 2018

Conversation

lildude
Copy link
Member

@lildude lildude commented Jan 10, 2018

We're lagging a low way behind on these two gems and regularly encounter problems with the incorrect license being detected on new grammars added.

This PR updates the Licensee and Licensed gems to newer versions and updates our usage.

@lildude
Copy link
Member Author

lildude commented Jan 10, 2018

@jonabc & @benbalter: if either of you have a few spare cycles, I could really do with a little bit of your expert knowledge on optimising our usage and getting our tests passing.

At the moment, the tests are failing because Licensee isn't correctly detecting the license of quite a few grammars. Most of these grammars have BSD 3 clause licenses.

If I remove the cached license and then re-add it using our script/licensed script (it needs improving as suggested by @jonabc), the license is correctly detected.

Any pointers you can offer would be greatly appreciated. 🙇

@lildude
Copy link
Member Author

lildude commented Jan 10, 2018

I've been experimenting using the go-tmbundle grammar as it comes up in our failing tests.

The latest Licensee run from the command line reports:

$ bundle exec bin/licensee ~/github/linguist/vendor/grammars/go-tmbundle
License: Other
Matched files: ["LICENSE"]
LICENSE:
  Content hash: 4cfc7ce12de920ccc836bbab2d748151d5ba7e38
  Attribution: Copyright (c) 2009, Jim Dovey
  License: Other
  Closest licenses:
    * NCSA similarity: 63.33%
$

... but if you look at the license file, you'll see it's a BSD 3 Clause license and this has been detected correctly on GitHub.com

@jonabc
Copy link

jonabc commented Jan 10, 2018

@lildude 👋 sure I'm happy to help.

If I remove the cached license and then re-add it using our script/licensed script (it needs improving as suggested by @jonabc), the license is correctly detected.

This part is a little weird to me, running licensed cache should have equivalent cached file output regardless of whether the files already exist or not.

Could you run me through the workflow that is failing and the workflow for updating cached license metadata?

@lildude
Copy link
Member Author

lildude commented Jan 10, 2018

Could you run me through the workflow that is failing and the workflow for updating cached license metadata?

Sure, but this isn't really about workflow at the moment... this is about our tests verifying what we've already got.

Note: all links and output are for my branch in this PR.

I'm going to concentrate on just one test for the moment as I suspect fixing this will resolve one of the other failing tests too.

In our tests in test/test_grammars.rb, we verify the grammars include a license and that they're valid, etc. With bumping Licensee and Licensed, the test_submodules_have_approved_licenses test

https://github.com/github/linguist/blob/96f87042ab358eddb5e2cfd2c062e3fcadbf6ecc/test/test_grammars.rb#L118-L126

... is now finding several grammars no longer have approved licenses as they're all now detected as "other" as you can see in the test failure at https://travis-ci.org/github/linguist/jobs/327212197#L1462

Sidenote: I needed to do a little tweaking of submodule_license() in this PR to take into account the changes introduced in Licensee.

If I manually run Licensee against that same directory, it too reports "other", as I showed earlier.

I see the same thing if I do it manually in irb:

irb(main):001:0> require 'licensee'
=> true
irb(main):002:0> Licensee::VERSION
=> "9.7.0"
irb(main):003:0> Licensee.license "vendor/grammars/go-tmbundle/"
=> #<Licensee::License key=other>
irb(main):004:0>

So to me this says Licensee has changed. So I thought I'd check what Licensed does, as I know it uses Licensee.

I removed the cached license .txt file, re-cached it, and checked it, and found the correct license has been detected and added to the frontmatter of the cached file:

$ head -5 vendor/licenses/grammar/go-tmbundle.txt
---
type: grammar
name: go-tmbundle
license: bsd-3-clause
---
$ rm vendor/licenses/grammar/go-tmbundle.txt
$ ls -l vendor/licenses/grammar/go-tmbundle.txt
ls: vendor/licenses/grammar/go-tmbundle.txt: No such file or directory
$ bundle exec script/licensed -m /Users/lildude/github/linguist/vendor/grammars/go-tmbundle
Caching licenses for grammar dependencies:
caching /Users/lildude/github/linguist/vendor/grammars/go-tmbundle
  Caching go-tmbundle ()
caching /Users/lildude/github/linguist/vendor/grammars/go-tmbundle
License caching complete!
caching /Users/lildude/github/linguist/vendor/grammars/go-tmbundle
* grammar dependencies: 1
$ bundle exec script/licensed -m /Users/lildude/github/linguist/vendor/grammars/go-tmbundle verify
caching /Users/lildude/github/linguist/vendor/grammars/go-tmbundle
Verifying licenses for 1 dependencies
.
1 dependencies checked, 0 warnings found.
$ head -5 vendor/licenses/grammar/go-tmbundle.txt
---
type: grammar
name: go-tmbundle
license: bsd-3-clause
---
$

So for some reason, calling Licensee directory to determine the license results in "other" whilst using Licensed results in the expected BSD 3 Clause license and I can't work out why.

@jonabc
Copy link

jonabc commented Jan 10, 2018

@lildude ah, I didn't realize the context was in using licensee directly. thanks for the clarification!

taking a look

script/licensed Outdated
@@ -40,7 +40,7 @@ OptionParser.new do |opts|
end
end.parse!

source = Licensed::Source::Filesystem.new(module_path || "vendor/grammars/*/", type: "grammar")
source = Licensed::Source::Filesystem.new(module_path || "#{File.expand_path("../", File.dirname(__FILE__))}/vendor/grammars/*/", type: "grammar")
Copy link

@jonabc jonabc Jan 10, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI in local testing, I had to remove the trailing "/" from the file path to properly find license data 🤷‍♂️. Otherwise my cached files looked like

---
name: <name>
type: grammars
license: none
---

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Once we sort out the Licensee part of things, I'd like to steal a bit more of your time and take you up on your suggestion of switching this to use Licensed::Source::Manifest, but only when you've got the time - I still need to give it a go too 😄

@jonabc
Copy link

jonabc commented Jan 10, 2018

a few things going on here:

Using the -m flag on script/licensed runs git checkout -- vendor/licenses/grammar/ at the end. this is undoing whatever changes licensed made locally and restoring the original (correct) file from git.

If I run the same command without -m then it finds the license as other 👍 . I made a review comment for an additional change I needed locally to find data.

➜  linguist git:(lildude/update-licensee-d) ✗ bundle exec script/licensed
Caching licenses for grammar dependencies:
...
caching /Users/jonabc/github/linguist/vendor/grammars/go-tmbundle
...
  Caching go-tmbundle ()
...
License caching complete!
* grammar dependencies: 1
➜  linguist git:(lildude/update-licensee-d) ✗ head -6 vendor/licenses/grammar/go-tmbundle.txt
---
type: grammar
name: go-tmbundle
license: other
---
Copyright (c) 2009, Jim Dovey

Second, the reason that the file isn't being detected as bsd-3-clause is because it's got a trailing Portions based on Allan Odgaard's C bundle for TextMate.. That additional content is causing the matcher to detect that LICENSE is different from the standard bsd-3-clause license by more than formatting.

@benbalter this has turned into a question of licensee not detecting the right license. Do you have any ideas on how to improve this scenario?

@lildude I have some ideas for how to make your situation easier but don't have more time at the moment to think through everything. will get back to you ASAP on that

@lildude
Copy link
Member Author

lildude commented Jan 10, 2018

Using the -m flag on script/licensed runs git checkout -- vendor/licenses/grammar/ at the end. this is undoing whatever changes licensed made locally and restoring the original (correct) file from git.

Doh!! Of course. 😊

@lildude I have some ideas for how to make your situation easier but don't have more time at the moment to think through everything. will get back to you ASAP on that

NP. Thanks for taking the time today.

@benbalter
Copy link
Contributor

benbalter commented Jan 18, 2018

@benbalter this has turned into a question of licensee not detecting the right license. Do you have any ideas on how to improve this scenario?

👋 sorry for missing this @mention.

As @jonabc mentioned:

Second, the reason that the file isn't being detected as bsd-3-clause is because it's got a trailing Portions based on Allan Odgaard's C bundle for TextMate.. That additional content is causing the matcher to detect that LICENSE is different from the standard bsd-3-clause license by more than formatting.

That is an intended behavior. Computers aren't smart enough (yet) to parse that trailing statement and know if it's legally significant or not, so it says it's not confidence enough to call the license BSD-3-clause. The idea being, that a human can review it, and if the content is not legally significant, you can whitelist the license via licensed's config.

@lildude
Copy link
Member Author

lildude commented Jan 19, 2018

That is an intended behavior. Computers aren't smart enough (yet) to parse that trailing statement and know if it's legally significant or not, so it says it's not confidence enough to call the license BSD-3-clause. The idea being, that a human can review it, and if the content is not legally significant, you can whitelist the license via licensed's config.

Thanks for the explanation @benbalter. My main issue is this appears to be quite a significant change from the behaviour seen with the antiquated version Linguist is using at the moment, but your explanation certainly makes sense.

I'll go the route of whitelisting those that are flagged. Thanks.

@Alhadis
Copy link
Collaborator

Alhadis commented Jan 19, 2018

Computers aren't smart enough (yet) to parse that trailing statement

We don't need smarter computers, only dumber lawyers.

In fact, the entire justice system would make more sense if "Law" was a programming language that compiled down to ones and zeroes dictating if somebody receives a legal arse-kick.

@benbalter
Copy link
Contributor

this appears to be quite a significant change from the behaviour seen with the antiquated version Linguist is using at the moment, but your explanation certainly makes sense.

Ahh. Didn't realize you were coming from such an old version. Yes, the underlying mechanism to match licenses has improved a lot, in part, to catch, for example, if in the above example the line said "Parts of this project licensed under GPL-v3". What you're describing is an intended behavior, even if a (breaking) change from previous versions.

@pchaigno
Copy link
Contributor

My main issue is this appears to be quite a significant change from the behaviour seen with the antiquated version Linguist is using at the moment, but your explanation certainly makes sense.
I'll go the route of whitelisting those that are flagged. Thanks.

Don't we already have a whitelist for these licenses? I'd much prefer to whitelist a hash of the current license rather than a whole repository. At least, if a repository starts using a copyleft license, we'll see it.

@lildude
Copy link
Member Author

lildude commented Jan 19, 2018

Don't we already have a whitelist for these licenses?

Yup, but only for some of the licenses caught by the old behaviour. The issue I've found is the newer Licensee is catching issues with licenses that earlier versions didn't mainly because peeps have deviated from the standard license text. For example, the go-tmbundle grammar I've referenced in my comments has this teeny weeny modification at the end of the license which previously wasn't detected:

Portions based on Allan Odgaard's C bundle for TextMate.

Others flagged by the test have similar deviations from the standard license text.

I'd much prefer to whitelist a hash of the current license rather than a whole repository.

I don't think this aspect changes. AFAICS I'll need to add the hashes for the newly flagged licenses and we may need to do this more often in future if peeps insist on adding clauses or comments that steer them away from the standard license text.

"966085b715baa0b0b67b40924123f92f90acd0ba", # sublime-shen
"3df4ef028c6384b64bc59b8861d6c52093b2116d", # sublime-text-ox
"fd47e09f1fbdb3c26e2960d0aa2b8535bbc31188", # sublimetext-cuda-cpp
"93360925b1805be2b3f0a18e207649fcb524b991", # Std license in README.md of many TextMate grammars like abap.tmbundle
Copy link
Member Author

@lildude lildude Jan 20, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these changes are because Licensee now strips out "All rights reserved" text and markup before calculating the hash. This results in better license detection and also detection of identical licenses in the README of a lot of the TextMate grammars like that in abap.tmbundle. It also means that some of the old entries that remain now have a new hash.

@benbalter
Copy link
Contributor

@lildude any feedback/changes/things we can do within Licensee to make your life easier, please let me know. One thing to note, if you clone Licensee locally and script/bootstrap, you can use script/git-repo <GITHUB URL> to get some diagnostic information about where the license may deviate. If you append, e.g., --license mit to the command, it will also diff the in-repo license to the version it expects (or you can use script/diff directly). e.g.:

➜  licensee git:(master) script/git-repo AlanQuatermain/go-tmbundle --license bsd-3-clause
License: Other
Matched files: ["LICENSE"]
LICENSE:
  Content hash: 4cfc7ce12de920ccc836bbab2d748151d5ba7e38
  Attribution: Copyright (c) 2009, Jim Dovey
  License: Other
  Closest licenses:
    * NCSA similarity: 63.33%
Comparing to BSD 3-Clause "New" or "Revised" License:
  Input length: 1484
  License length: 1410
  Similarity: 96.23%
diff --git a/LICENSE b/LICENSE
index 73e3883..beafdb0 100644
--- a/LICENSE
+++ b/LICENSE
@@ -1,19 +1,20 @@
redistribution and use in source and binary forms, with or without modification,
are permitted provided that the following conditions are met: {+-+} redistributions
of source code must retain the above copyright notice, this list of conditions
and the following disclaimer. {+-+} redistributions in binary form must reproduce
the above copyright notice, this list of conditions and the following disclaimer
in the documentation and/or other materials provided with the distribution. {+-+}
neither the name of the copyright holder nor the names of [-its-]{+the software's+}
contributors may be used to endorse or promote products derived from this
software without specific prior written permission. this software is provided by
the copyright holders and contributors "as is" and any express or implied
warranties, including, but not limited to, the implied warranties of
merchantability and fitness for a particular purpose are disclaimed. in no event
shall the copyright holder or contributors be liable for any direct, indirect,
incidental, special, exemplary, or consequential damages including, but not
limited to, procurement of substitute goods or services; loss of use, data, or
profits; or business interruption however caused and on any theory of liability,
whether in contract, strict liability, or tort including negligence or otherwise
arising in any way out of the use of this software, even if advised of the
possibility of such damage. {+portions based on allan odgaard's c bundle for+}
{+textmate.+}

@lildude
Copy link
Member Author

lildude commented Jan 22, 2018

Thanks @benbalter. I'm still tinkering locally with Licensee to see what I can do with it. Thanks for these new deets... I hadn't found these yet.

Your most recent update does bring up a question though:

Why is "Closest licenses" returning a license with a much lower similarity than the license you've explicitly asked to check?

I've not dug into how this is done in Licensee yet, but on the face of it, I'd have expected "BSD 3-Clause" to more likely have been identified as closest than "NCSA".

@benbalter
Copy link
Contributor

Why is "Closest licenses" returning a license with a much lower similarity than the license you've explicitly asked to check?

We used to use the much more expensive levenshtein distance to match licenses. To speed things up, we compared license lengths and found the closest N by number of characters and only compared those licenses. We now are using Dice which is much faster, but we are still limiting potential matches due to limitations with how wordset matching may allow for false positives. That particular debug output should probably match all licenses and output the best match, not just those that are similar in length.

default for licensed v1.0 changed from `vendor/licenses` to `.licenses`
default configuration file location changed from `vendor/licenses/config.yml` to `.licensed.yml`
@jonabc
Copy link

jonabc commented Mar 7, 2018

@lildude I've updated the licensed dependency to 1.0.0 with minimum changes needed to keep things working

v1 also brings some other changes that I haven't made here

  • default/recommended license file directory is now <repo root>/.licenses
  • default/recommended licensed config file is now <repo root>/.licensed.yml

Would you like me to push updates for these changes as well, or leave it as-is?

@lildude
Copy link
Member Author

lildude commented Mar 10, 2018

@lildude I've updated the licensed dependency to 1.0.0 with minimum changes needed to keep things working

🙇 Really appreciate it.

v1 also brings some other changes that I haven't made here

  • default/recommended license file directory is now /.licenses
  • default/recommended licensed config file is now /.licensed.yml

Would you like me to push updates for these changes as well, or leave it as-is?

On the one hand I think leave as-is, but on the other hand, we probably should be one of the leaders in the usage of Licensed now it's public.

@Alhadis @pchaigno what are your thoughts on this idea?

@lildude lildude changed the title WIP: Update Licensee and Licensed gems Update Licensee and Licensed gems Mar 13, 2018
@lildude lildude mentioned this pull request Mar 13, 2018
15 tasks
@lildude
Copy link
Member Author

lildude commented Mar 28, 2018

Ping @Alhadis @pchaigno what are your thoughts on this idea? ☝️

@lildude
Copy link
Member Author

lildude commented Apr 3, 2018

Would you like me to push updates for these changes as well, or leave it as-is?

On the one hand I think leave as-is, but on the other hand, we probably should be one of the leaders in the usage of Licensed now it's public.

I'm going to throw this out there and say, lets keep things as-is for the moment. We can address this in the future if we decide that's the route we want to take. For the moment, I'd like to get this merged so I can make a new release 🔜

Copy link

@jonabc jonabc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@lildude lildude merged commit f452612 into master Apr 3, 2018
@lildude lildude deleted the lildude/update-licensee-d branch April 3, 2018 15:35
@lildude lildude mentioned this pull request Apr 12, 2018
2 tasks
@github-linguist github-linguist locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants