Participants: ------------- Alberto Aimar - EP-SFT Alastair Bland - AB Marc Dobson - ATLAS-online, representing Bruce Barnett Christoph Schwick - CMS-online, representing Eric Cano Marco Cattaneo - LHCb Nicolas De Metz-Noblat - AB Benigno Gobbo - non-LHC Jan Iven - IT Steve O'Neale - ATLAS-offline Bernd Panzer-Steindel - IT Jarek Polok - IT Fons Rademakers - ALICE Tim Smith - IT Stephan Wynhoff - CMS-offline Thorsten Kleinwort - IT Helge Meinhard - IT /CLUG original agenda: ---------------- * discussion on Red Hat support policy changes (RH1-->Fedora) and their effect on the ongoing certification. * decision to continue the certification or decision what to certify when. * Status updates for the ongoing certification (if still relevant). * AOB ---------------------------------------------------------------------- Executive Summary: ================== The experiments reiterated the need for 12month stability of the compiler-middleware toolchain. Online groups are (currently) still eager to go to "fresh" releases. Desktops and farm machines also need "recent" hardware support. The preference for a HEP-wide solution was stressed. * Fedora within a HEP-wide collaboration and Red Hat * Enterprise (again, with a HEP license scheme) are the current * favourite alternatives. IT should see whether support for Red Hat Enterprise for special-purpose machines can be expanded, to decouple them from the upgrade cycle and to easier deploy commercial products. Minutes: ======== (round-the-table, stressing constraints in order to narrow down possible solutions. Some intermittent discussions folded in. No AOB) Beningo: * need ABI (compiler/ base libraries) stability * all alternatives are bad, but Fedora acceptable provided that the commercial product problem (ORACLE client) can be solved * suggest HEPCC to standardize HEP-wide discussion: * for-pay product may be unacceptable for HEP collaborations. 5 year support is overkill, need 2 at most * for LHC experiments, LCG has a major say in this (but this group can influence their choices) Stephan: * need ABI stability for 12months (C++ compiler, LCG layer, CMS layer) * not seeing "unifying" influence from GRID yet (too little experience, for now different sites have different versions). discussion: * with collaborating institutes running something different, we frequently need to re-certify software anyway -> multiple versions, nightly regression tests, on-disk data has to stay readable through all versions (less so if handled by dedicated servers). Still need to avoid changes during "production" periods * consensus that the "desktop == farm ( == diskless DAQ)" paradigm is still useful, and cannot easily be replaced e.g. by dedicated build machines (compile-test cycle, debugging). Dedicated servers can run something different as long as the interface to clients does not change. * can we compromise on security in favour of stability? generally no, private networks and firewalls were not sufficient in the past. Isolated groups of machines could take the risk, if they cannot affect CERN or outside world and are not offering critical services. Everything else needs regular updates. Can minimize number of changes by minimizing the installed software (->dependency analysis). Helge: (only listening) Marc: * stability: - 12month cycle is fine, too slow (24month) would be bad (need latest kernel during development) - security updates are no problem - prefer changes in winter, to have stable online software release before beams come up in spring. Need ~3months for porting and tests. * commercial tools: PVSS and Java. Would like to stay with gcc-3.2|3, not move back to 2.96 due to PVSS since ATLAS-online dropped gcc-2 support already. * own kernel drivers, need some support from "stock" kernel discussion: * PVSS is widely used (mostly servers). Need to coordinate discussions with ETM to have maximum leverage. Jarek: * need explicit dependencies, not only for certification but also for intermediate updates discussion: * suggestion to use regression tests wherever possible (IT/LCG should help?) * tests will not find everything. Announce changes to give time to do additional tests before automatic deployment. Thorsten/Tim: * own tools need time (months) to port to new release, would like to do this as part of the certification in the future * FIO tools should be able to handle intermediate updates and allow to go quickly to new releases * users/usage decide eventually what runs on farms * need ORACLE client and LSF (responsive in the past, but dependency on kernel+compiler) Fons: * different versions on sites anyway -> automatic testing in place for ALICE software * for CERN a switch away from Red Hat would be expensive (some dependencies on init scripts/configuration) * keep 64bit distributions in mind, would like common infrastructure (version?) on IA32, IA64, Opteron Nicolas/Alastair: * need total stability (i.e. no intentional change) between March and November, this includes the operator's workstations and diskless nodes * PVSS/ORACLE/Java app server machines could/should move to Red Hat Enterprise (worries about compatibility with applications compiled on desktops, no tight control over AB compilation environment). Suggest RH9 on workstations to have same compiler, or cross-compiling environment on desktops. RPMs are not used, same binary sometimes has to work on multiple platforms (NFS-mount) * depends on Mathematica and ORACLE, don't care about desktop tool rate of change Christoph: * fairly decoupled from CERN network * need recent compiler (>= gcc3.2), portable code allows to switch compilers quickly (days), would like latest in glibc (threads) and kernel (networking: NAPI, GigE, threads) * 12 month cycle is ok, but would like a "fresh" version then. * dependencies: SUN Java 1.4, ORACLE client, XERXES, device drivers (own and commercial) * interested in cluster management tools Marco: * compiler stability for 12 month: C++, python, Java? * kernel is less critical (still needs testing, but little direct dependencies) * try to avoid CERN special solutions (bad experience), HEP-wide is OK, "standard" is best. * need > 1.5 month for certification (=expensive, don't repeat every 3 months..) Steve: * we expect soon to be on a 1 year cycle based ensuring stability during the operation of LHC and the cern accelerators. (i.e. fixes the time of changes) * the "filter-farm" computing challenges in early 2004 may add to the test-beam and data challenge constraints on when we might require new systems (hardware) unable to run rh7. * agreed with need for explicit dependencies, in addition to products already mentioned fortran90, afs and the kerberos4 to kerberos5 transition come to mind. * Atlas offline and LCG/SPI are starting to introduce tools for validation and regression testing and this is an area where we may require/apply additional manpower, particularly if we finally adopt a solution such as alternate fedora releases, and no longer have the cernlib tests. * Debian has been suggested by some of my colleagues, is such a big change (and loss of expertise/products) viable? * What plans have you made to work with other institutes to maybe negotiate with redhat, suse etc in the hepix meeting? (-> action) Alberto: * too early certification is bad (e.g. on beta), external tool dependencies need time to port (typically wasted, tool maintainers will do this eventually themselves) * "stay standard". Summary: -------- * no decision for one of the alternatives yet. * need feedback from HEPiX * (most likely:) stay on Red Hat, either Fedora or Enterprise 3.0. Both would need to be acceptable to the HEP community (HEP license or HEP collaboration) * continue to certify "Fedora beta" - closer to Red Hat Enterprise than 7.3.3, will become actual "Fedora" Actions: ======== Jan: minutes/summarize, get approved, publish All: review dependencies and product status under http://cern.ch/linux/redhat10/certification Jarek: provide installable beta version All: continue with current certification Jan: report back on meeting with Red Hat Jan: report back on HEPiX meeting