Skip to content

Implement Zsh scraper #2519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 27, 2025
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions lib/docs/filters/zsh/clean_html.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
module Docs
class Zsh
class CleanHtmlFilter < Filter
def call
css('table.header', 'table.menu', 'hr').remove

# Remove indices from headers.
css('h1', 'h2', 'h3').each do |node|
node.content = node.content.match(/^[\d\.]* (.*)$/)&.captures&.first
end

css('h2.section ~ a').each do |node|
node.next_element['id'] = node['name']
end

doc
end
end
end
end
41 changes: 41 additions & 0 deletions lib/docs/filters/zsh/entries.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
module Docs
class Zsh
class EntriesFilter < Docs::EntriesFilter
def get_name
extract_header_text(at_css('h1.chapter').content)
end

def additional_entries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add all (relevant) functions and (math) operators. Searching for zfopen currently yields no result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed locally.

entries = []

css('h2.section').each do |node|
type = get_type

# Linkable anchor sits above <h2>.
a = node.xpath('preceding-sibling::a').last
header_text = extract_header_text(node.content)

if type == 'Zsh Modules'
module_name = header_text.match(/The (zsh\/.*) Module/)&.captures&.first
header_text = module_name if module_name.present?
end

entries << [header_text, a['name'], type] if header_text != 'Description'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some entries lack context, for instance "utility" refers to different topics. Suggestion: add some helpful context in parenthesis after the entry name, such as "Utility functions (calendar)".

Screenshot 2025-06-05 at 17-55-42 Zsh _ Utilities — DevDocs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. What is a good example of a previous scraper adding context to ambiguously named entries?

end

entries
end

def get_type
extract_header_text(at_css('h1.chapter').content)
end

private

# Extracts text from a string, dropping indices preceding it.
def extract_header_text(str)
str.match(/^[\d\.]* (.*)$/)&.captures&.first
end
end
end
end
33 changes: 33 additions & 0 deletions lib/docs/scrapers/zsh.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
module Docs
class Zsh < UrlScraper
self.type = 'zsh'
self.release = '5.9.0'
self.base_url = 'https://zsh.sourceforge.io/Doc/Release/'
self.root_path = 'index.html'
self.links = {
home: 'https://zsh.sourceforge.io/',
code: 'https://sourceforge.net/p/zsh/web/ci/master/tree/',
}

options[:skip] = %w(
zsh_toc.html
zsh_abt.html
The-Z-Shell-Manual.html
Introduction.html
)
options[:skip_patterns] = [/-Index.html/]

html_filters.push 'zsh/entries', 'zsh/clean_html'

options[:attribution] = <<-HTML
The Z Shell is copyright &copy; 1992&ndash;2017 Paul Falstad, Richard Coleman,
Zoltán Hidvégi, Andrew Main, Peter Stephenson, Sven Wischnowsky, and others.<br />
Licensed under the MIT License.
HTML

def get_latest_version(opts)
body = fetch('https://zsh.sourceforge.io/Doc/Release', opts)
body.scan(/, Zsh version ([0-9.]+)/)[0][0][0...-1]
end
end
end
Binary file added public/icons/docs/zsh/16.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added public/icons/docs/zsh/16@2x.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions public/icons/docs/zsh/SOURCE
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
https://sourceforge.net/p/zsh/web/ci/master/tree/favicon.png