BOF: Expanding the Impact of Scientific Software Engineering in HPC

Francesco Rizzi

(NexGen Analytics)

https://fnrizzi.github.io/sc22_bof_slides

(use the space bar to navigate slides)

Follow along:

...scientific and research codes would benefit from the same attention, especially in terms of making codes accessible, interoperable, and reliable.

This BoF will engage a set of expert panelists and the audience in understanding how we can bring best practices for software engineering to the wider audience of scientific software developers.

From the BoF abstract...

My contribution today

High-level thoughts, discussion points and lessons learned over years working on scientific computing
- both technical aspects as well as social ones
This is my opinion, my experience
- I have been fortunate enough to work with both experienced professionals and students, and
  both on research and production codes
Some solutions might seem obvious and/or already heard, but take-home point is that they still do not regularly occur

Accessibility

Libraries' building process

Problem: just building a library can be a barrier
- Some users shy away from even trying/using a code
- Sometimes inevitably hard:
  - legacy codes would require a huge investment to revise
  - Is it worth it and/or feasible?
- "Magical scripts": known to work, used by lots of people.
  Often, how they work is "hidden" and brittle
Solutions: CMake, Modules, Spack
Aiming for: pip install <package-name>
- Can we have something similar in HPC/scientific computing?
- How do we achieve and maintain that? Is it even feasible?
It could be that the underlying code needs to be fixed

Large codes can be scary

Problem: in several cases, junior researchers and postdocs
avoid working with/contributing to large codes
- This mostly occurs for non-CS people
- A mental barrier, lack of documentation, as well as perception that working on large codes is a divergence from their goal
- Separate codes are developed for research purposes etc:
  leading to many "one-off" codes
Solutions:
- Improve documentation to make it more accessible
- Support new developers using the code for research
- Value development as part of the research: development as part of the deliverable

Documentation not a priority

Problem:
- Documentation is sacrificed to meet deliverables
- Lags behind development
- Written after the development is complete
- Documenting the API might not be enough: might still take lots of mental jumps to really understand how to use it, especially true for new/inexperienced users
- Lacking proper/practical "usability documentation"
Solutions:
- Write doc while developing...
- ...but where to find the resources?

Examples do help

template< class ExecutionPolicy, class RandomIt, class Compare >
void sort( ExecutionPolicy&& policy, 
           RandomIt first, RandomIt last, Compare comp );

- policy: execution policy

- first, last: random access iterators to the range of elements

- comp: comparison functor

int main()
{
  std::array<int, 10> s = {5, 7, 4, 2, 8, 6, 1, 9, 0, 3};

  std::sort(s.begin(), s.end());
  print("sorted with the default operator<");
 
  struct {
    bool operator()(int a, int b) const { return a < b; }
  } customLess;
  std::sort(s.begin(), s.end(), customLess);
  print("sorted with a custom function object");
 
  std::sort(s.begin(), s.end(), [](int a, int b) { return a > b; });
  print("sorted with a lambda expression");
}

(source: C++ doc)

Reliability

Modularity and public APIs are critical

This impacts both accessibility and reliability
Problem:
- For convenience, sometimes "things" are lumped all together
- Bad shortcuts: using implementation details from different packages leads to all sorts of problems
Solution:
- Large codes should have appropriate modularity
- Modules/packages should interoperate via their public APIs
- Should be able to reason about modules independently

Tests are a "burden"

Problem:
- Some research codes efforts still consider testing (both functional and performance) as a "burden" rather than added value or "must-have"
- Resources: tests are typically not considered deliverables
- A lag develops and it then becomes harder to catch up
- Reliability is thus sacrificed for "progress" and deliverables
Solution:
- Tests should be deemed invaluable in research codes
- A well-tested research code becomes more easily a reliable production code

CI can be a bottleneck

Problem:
- Choosing among cloud-based vs local testing servers
  - sometimes local ones are necessary (private data, etc)
- Increased queue times
- Test granularity, many pipelines tested, variety of compilers
Solution:
- Sometimes there is nothing you can do: limited resources
  (e.g. one GPU available)
- Mix/use different CI tools
- Refactoring/revising tests (continue on next slide)

Test granularity matters

Problem:
- Stuck with a given test suite/framework to use and increment
- Refrain from adding tests because it increases build time:
  I know of a production code where developers have a hard limit on the # of tests they can add for a given new feature
Solution:
- Separate functional from performance tests
- Sometimes refactoring the test suite is needed
- Revise the granularity of functional tests
  - Hierarchical tests: if one "level" fails, do not run the others

Predict "unexpected usage"

Problem:
- Tests are by construction biased:
  - written by human beings
  - check for "expected" things (according to who wrote it)
- Discrepancy between the "intended" use of software
  and what general users actually do in practice
  - users find all sort of unique ways to use/break your code
Solutions:
- Code reviews
- Mimic somehow this "random/human" aspect in tests
- Hard problem

Thanks for listening!

francesco.rizzi@ng-analytics.com

https://www.ng-analytics.com