Skip to content

Update pytorch scraper, include various 2.x versions #2513

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 1, 2025

Conversation

blahgeek
Copy link
Contributor

  1. Various 2.x versions are included separately. Pytorch versions are not backward compatible, it has different compatibilities between CUDA etc, so people may use specific versions for a extended period of time.

  2. Removed the type replacement table for get_type. Instead, get the type from breadcrumbs directly. IMO this produces better results that matches the index in the original website (the left side menu in docs.python.org). Also, the TYPE_REPLACEMENT table was opiniated and hard to maintain across versions.

  3. Always include default entry (removed include_default_entry? function). I don't see the downside of this. Previously some pages are missing because of this (e.g. torchrun https://docs.pytorch.org/docs/1.13/elastic/run.html)

  • Updated the versions and releases in the scraper file
  • Ensured the license is up-to-date
  • Ensured the icons and the SOURCE file in public/icons/your_scraper_name/ are up-to-date if the documentation has a custom icon
  • Ensured self.links contains up-to-date urls if self.links is defined
  • Tested the changes locally to ensure:
    • The scraper still works without errors
    • The scraped documentation still looks consistent with the rest of DevDocs
    • The categorization of entries is still good
Siteproxy
1. Various 2.x versions are included separately. Pytorch versions are
not backward compatible, it has different compatibilities between CUDA
etc, so people may use specific versions for a extended period of
time.

2. Removed the type replacement table for `get_type`. Instead, get the
type from breadcrumbs directly. IMO this produces better results that
matches the index in the original website (the left side menu in
docs.python.org). Also, the `TYPE_REPLACEMENT` table was opiniated and
hard to maintain across versions.

3. Always include default entry (removed `include_default_entry?`
function). I don't see the downside of this. Previously some pages are
missing because of this (e.g. torchrun https://docs.pytorch.org/docs/1.13/elastic/run.html)
@blahgeek blahgeek requested a review from a team as a code owner May 31, 2025 03:23
@blahgeek
Copy link
Contributor Author

cc @ArkciaTheDragon for modifying some code from #2137

@ArkciaTheDragon
Copy link
Contributor

Tested on PyTorch 2.7 — looks cool!

Also +1 to the suggestion of using separate PyTorch versions. For context: PyTorch 2.7 generates only 16.2MB of offline data, which is actually a bit smaller than Python 3.12’s ~18MB. I hope hosting the versions separately won’t be an issue in terms of space.

Copy link
Contributor

@simon04 simon04 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@simon04 simon04 merged commit fafde1b into freeCodeCamp:main Jun 1, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants