Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup XML input before processing #2190

Merged
merged 2 commits into from
Dec 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions inputters/xml.lua
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,12 @@ inputter._name = "xml"
inputter.order = 2

local function startcommand (parser, command, options)
-- Discard list values (non-key/value), stuffed by LXP/expat to make it possible to deduce the order of keys in
-- the source. We're not using it, so we don't care and it is clutter in the AST that makes it different from
-- ASTs generated from SIL inputs.
for i = 1, #options do
options[i] = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cost/benefit triggering key removals on all attributes in a big XML file? (triggering table realloc and GC) vs. just keeping these entries?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cost is less than nil. The info is going to be garbage collected anyway at the end of the SILE run, and doing it sooner rather than later actually reduces overall memory usage. The table realloc is hard to measure; this abstraction for dropping the list-like keys is pretty efficient. In fact most of my test runs have it coming in faster, possible because of the way we copy the options table around later in processing.

On a sample XML stress test file (99,999 commands with 9 args each) the difference is in the noise, but with the "cleaned" version coming in faster on most runs (the error bars are affected by other things I'm doing on my desktop while the test runs):

$ hyperfine -r 10 -L sile sile-master,sile-clean-xml "./{sile} -b dummy foo.xml"
Benchmark 1: ./sile-master -b dummy foo.xml
  Time (mean ± σ):      9.631 s ±  0.404 s    [User: 9.192 s, System: 0.400 s]
  Range (min … max):    9.267 s … 10.563 s    10 runs

Benchmark 2: ./sile-clean-xml -b dummy foo.xml
  Time (mean ± σ):      9.567 s ±  0.190 s    [User: 9.113 s, System: 0.418 s]
  Range (min … max):    9.317 s …  9.808 s    10 runs

Summary
  ./sile-clean-xml -b dummy foo.xml ran
    1.01 ± 0.05 times faster than ./sile-master -b dummy foo.xml

Benefit: Nothing to get confused in the debug output, the debug output matches what we're actually looking at in the data, and the SIL vs. XML ASTs match.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still, would be neat to propose this lunarmodules/luaexpat#41

And then use local parser = lxp.new(content, nil, nil, false) in our XML inputter.

end
local stack = parser:getcallbacks().stack
local lno, col, pos = parser:pos()
local position = { lno = lno, col = col, pos = pos }
Expand Down
5 changes: 5 additions & 0 deletions inputters/xml_spec.lua
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,11 @@ describe("#XML #inputter", function ()
assert.is.equal("bar", t[1])
end)

it("commands should parse only named arguments", function ()
local t = inputter:parse([[<foo baz="qiz">bar</foo>]])[1][1]
assert.is.equal(0, #t.options)
end)

-- it("commands with quoted arg with escape", function()
-- local t = inputter:parse([[<foo baz="qiz \"qiz\"">bar</bar>]])
-- assert.is.equal("foo", t.command)
Expand Down
Loading