Minutes draft2

Minutes of the LXCERT meeting 29.08.2007 - DRAFT

Invited/Attendance

Name Affiliation Present?
Bruce M Barnett ATLAS-online
Alastair Bland AB/CO-admin
Eric Cano CMS-online
Joel Closier LHCb-offline (absent)
Benigno Gobbo non-LHC experiments (excused)
Louis Poncet Grid middleware (replaced by Gergeli Debreczeni)
Jan Iven IT-services catchall (chair)
Niko Neufeld LHCb-online (absent)
Emil Obreshkov ATLAS-offline (absent)
Stefan Roiser LCG Applications Area (accidentally not invited)
Jarek Polok general Desktops (secretary)
Fons Rademakers ALICE-offline (absent)
N.N. IT PLUS/BATCH service (replaced by Ulrich Schwickerath)
Klaus Schossmaier ALICE-online (absent)
Vincenzo Innocente CMS-offline
Peter Kelemen Linux Support

Agenda

  1. Presentation: Overview of possible options for SLC5 by Linux Support (slides)
  2. Discussion

JP: Overview of possible options for SLC5

The meeting started with JP giving an overview of the situation of Linux at CERN:

Despite having been certified more than a year ago, SLC4 has just overtaken SLC3 in terms of deployed boxes at CERN, and some services still are not ready for SLC4.

A possible SLC5 certification (as per the usual schedule of every 1.5-2 years) would take at least 6 month, and would therefore extend into the hectic period where everybody is busy with LHC startup. At the same time, no user group has been actively pushing for a newer compiler or a general software update (VI: "no incentive", unlike SLC3->4), it is therefore unlikely that the required resources for a full certification of experiment software could be found (=more delay).

Without such a certification, a future SLC5 could not be rolled out on the most permanent=directly user-visible services, and neither on the desktop (assuming the need to have a "certified" desktop for development or analysis). Without a wide roll-out pending, there would be no incentive for other services to certify or migrate. Supporting SLC4+SLC5 without agreed end dates is too expensive for Linux.Support.

However, it becomes clear that SLC4 is near the end of its lifetime with respect to hardware support: Compatible laptops are no longer being sold on the market, compatible desktops are already difficult to get by (server or batch machines still pose no major issue yet, but this is expected to change in 2008).

The pressure to migrate off SLC4 comes therefore from hardware-purchasing groups (IT, AB/CO to some degree, perhaps experiment "online" groups).

JP presented the available options (full certification of SLC5, CERN-specific kernel/X11 updates, non-certified SLC5, "physics" environment decoupled via AFS, ..), none of which would resolve the disparity between need for new hardware vs time and resource constraints for software/service providers.

As a special case, the "Linux laptop" hardware support problems were highlighted. This is an area where the current stability-oriented certification model clearly does not work, due to short model lifetimes. The current utilization of SLC on laptops appears to be low both in relative (~20% of "Linux laptops" as per LanDB, <30% on the exemplary HP NC6400) and absolute numbers (~40 non-NICE HP NC6400; 150 DHCP updaters on linuxsoft). The effort required to properly support such hardware is not justified by this utilization. The proposal would be to withdraw formal laptop hardware support (no "compatible" model in the stores, no upfront tests) and let individual users pick their favorite models based on community experiences, possible with a "certified" OS in a virtual environment, and some guidance from IT to make things work in the CERN environemnt. If required, such a model could be extended to Linux desktops later.

This will need wider discussion (e.g. at DTF), but the participants of LXCERT were asked to provide feedback from their user communities.

Discussion

VI asked whether "hidden" services (web/disk servers etc) would be in a position to migrate to SLC5 even without a fully-certified version. But the incentive for such early migrations (except for hardware support) is currently low, no major performance/stability gains are expected from SLC5.

VI asked for a cost-based approach - cost for SLC5 certification+rollout (e.g. to the experiments) vs SLC4 penalty for IT/other hardware purchasers/users. While certification costs can be estimated based on previous experiences, cost caused by "legacy-compatible" hardware are unknown (i.e. no offers received; vague guesses for cost of backporting hardware support), as are those caused by lost productivity. The disparity of SLC4 vs SLC5 costs was clear - one is largely on IT/LinuxSupport, the other on everybody.

AB was worried on the implications for staying on SLC4 for future (2008) purchases - AB/CO is already now experiencing the same problems as IT in finding compatible "desktop" models (e.g. for the CCC consoles). They would be willing to move to SLC5 if it would provide better hardware support and would be available in "IT-supported" form before December 2007, including some "vital" packages (ORACLE-instantclient, YUM+DAG, LabView, PVSS, perhaps JVM; CERNLIB et.al. preferred)

The changes between SLC4 and a potential SLC5 were discussed. The compiler change (gcc 3.4 -> 4.1) was expected to be at most as difficult as the 3.2->3.4 change (some more restrictions on C++, but ABI-compatible; several groups started to test it already), but gfortran could be a problem (g77-gcc-3.4 is still available). Several short/informal tests (AB/CO,CMS,EGEE,IT) with SL(no-C)5 have been done and hint that most things should just work.

The topic of results from non-certified SW stacks were mentioned - VI indicated that in principle e.g. analysis results from a Ubuntu machine (uncertified stack) could not be accepted by an experiment. The question was posed whether virtual machines (Xen) would need explicit re-certification (since kernel and glibc are different), but the agreement was that this would a priori not be necessary.

There was some agreement that IT would need to provide some minimal testable SLC5 environment if anything else was to happen (with AB/CO eventually using this for production). JI requested a strong/formal demand for this for the user groups to justify the cost (this would imply a willingness on the users' side to spend effort on testing, and eventually permit deployment). BB pointed out that even a non-certified-but-tested SLC5 would ease the transition to a later SLC6.

GD would like to have a SLC5 test platform (have SL5) as soon as possible (given the long delays in certifying/porting to a new platform, see SLC4 saga). Similarly to Linux.Support, some user demand/pressure would be required to justify spending the porting effort.

JP opened the wider discussion for whether the current CERN certification model was still working or was still required. VI pointed out that some kind of certification would even be required within each experiment to be able to accept results (mentioned 32/64bit compatibility issues across sites), the current wide compatibility RHEL/SL/SLC was appreciated since it limited effort. The need to coordinate changes to the OS goes beyond CERN (LCG Architect Forum for compiler choice and HEP sites for "Grid capacity rollout" were mentioned), the current CERN model isn't ideal but part of the change coordination infrastructure. Several "chicken-and-egg"-issues exists and threaten to lock us into SLC4 until hardware issue becomes overwhelming even for software providers.

Summary:

  • SLC5 is not perceived to be a burning issue by users not directly charged with hardware purchases.
  • Linux Support would need a clear request from LXCERT to open up a (supported) SLC5, even as "just" a test service. IT_GD needs a similarly clear mandate for porting Grid software.
  • LXCERT members need to provide feedback if and when they would be willing to look at SLC5 for certification and eventual deployment (time constraints need to be made explicit)
  • LXCERT members need to provide feedback on the proposal to remove "SLC-compatible" Linux laptops