Minutes

Minutes of the LXCERT meeting 04.10.2005

Invited / Attendance

AbbrevNameAffiliationPresent
AAAlberto AimarLCG Application Area
ABAlastair BlandAB/CO-admin
ADAndre Davidnon-LHC experiments
APAndreas PfeifferLCG Application Area
BBBruce M BarnettATLAS-onlineexcused → MD
BGBenigno Gobbonon-LHC experimentsexcused → AD
ECEric CanoCMS-online
EOEmil ObreshkovATLAS-offline
FRFons RademakersALICE-offline
HDHubert DegaudenziLCG Application Area
JCJo�l ClosierLHCb-offline
JIJan IvenIT-services catchall (chair)
JPJaroslaw Polokgeneral Desktops (secretary)
KSKlaus SchossmaierALICE-online
MDMarc DobsonATLAS-online
NMNNicolas De Metz-NoblatAB/CO-development
NNNiko NeufeldLHCb-online
SWStephan WynhoffCMS-offline
TKThorsten KleinwortIT PLUS/BATCH servicemissing

Agenda

  • status of the "compiler" certification:
    • gcc version issue
    • status of major packages (dependencies status for the experiments)
  • proposed changes for SLC4
  • planning (foreseen delays to certify experiment SW after dependencies have been met)
  • AOB

Issue1: status of the "compiler" certification

AP explained that the LCG architects forum (LCG AF) has decided in a recent meeting to go back to the system gcc-3.4.3 instead of gcc-3.4.4. All "external" packages already had been compiled with 3.4.4, and no problems are expected for recompiling them with 3.4.3. This is somewhat automated, so all the "external" packages for gcc-3.4.3 should be available in the next days.

On the other hand, for some of the internal packages like POOL the port to gcc-3.4.4 (with stricter standards compliance than on gcc-3.2) hasn't finished yet. These packages require expert intervention, and may take O(weeks).

A short discussion ensued on whether the timeline for SLC4 (to be certified until end of year) was still within reach, given that all LHC experiments require LCG Application Area packages. But unless major obstacles occur, the porting effort should be finished soon enough for the experiments to test their software in time. (side note from SW: an initial report from a CMS collaborator indicates that parts may already work now on SLC4, no details available on the software/versions used etc).

It was also decided to take out the LCG Application Area packages from the certification dependency tracking (since the LCG AF does it own tracking), and only leave such packages if they are used/required by non-LHC users (who may not be represented via the LCG AF), and/or in case the package is also shipped with the distribution (e.g CERNLIB-as-RPM).

Issue2: proposed changes for SLC4

JI presented a short summary of announced changes for SLC4, both from CERN and upstream. He requested further change request to be brought forward now, so that they could be properly announced via the DTF etc.

MD pointed out that porting drivers from 2.4 to 2.6 kernels takes long (development and full testing cycle: O(1 year)). It was agreed that recurring issues (and their solutions) should be shared via the Linux-certification mailing list. AB and NMN reported that AB/CO have successfully ported their drivers to 2.6, as have LHCb-online (NN) and ALICE-online (KS). But nobody has drivers for SLC4 yet.

Given that OpenAFS still hasn't reached the same level of stability on 2.6 as on 2.4 even 2 years after 2.6 has been released, FR asked whether this was a dying project (and whether CERN should drop it), or whether labs like CERN should get more involved with OpenAFS development. It was explained that the project was alive and other labs are using OpenAFS as well, and that the current round of development (1.2 → 1.4) centered on the Windows client. But CERN has less than 2 FTE on AFS support and is not in a position to do significant development in this area.

NMN and AB listed several problem areas with the "general version update" in SLC4 packages, such as

  • xorg-x11 (now runs without open TCP port by default, several other members welcomed this since clear X11 connections are a security risk), xorg-x11 Xnest ([update: Xnest has problems with backing-store, i.e. applications displayed on the screen are not refreshed properly when temporarly hidden by another application])
  • NFSv3 now is the default (which has caused trouble in mixed environments, especially with automounting). MD pointed out that NFSv3 in their environment is actually behaving better than v2.
  • GNOME-2.8 being still slow. However, it was clarified that the "slow gnome-terminal scrolling" bug (vte) from SLC3 has been fixed.

    [addendum: gnome-2.6 should have had several other performance improvements via GTK+, which would then get carried over to gnome-2.8.
    GNOME-2.4 release notes
    GNOME-2.6 release notes
    GNOME-2.6 release notes ]

  • Kerberos/AFS: NMN suggest to restrict AFS clients to the CERN AFS cell, otherwise long delays may come from autocompletion (dynroot might help as well). This turned out to be a per-machine configuration issue.
    A quick discussion followed during which NN pointed out that LHCb-online thinks of setting up their own AFS servers inside their CNIC domain. Other experiments don't plan to do that, and may have to configure access to the CERN-wide Kerberos KDC from within their domains. Following a question by AB, it was clarified that the SLC4 ssh would forward AFS credentials (via SSH-1 protocol) or get them via forwarded Kerberos5 TGTs (SSH-2)
  • ORACLE: the ORACLE tools situation is still not clarified (first raised in SLC3; RPMs initially promised but then retracted): AB/CO needs to have locally-installable versions of the ORACLE libraries and sqlplus, properly configured for CERN via tnsnames.ora, preferably without relying on LD_LIBRARY_PATH being set in the user environment. AB/CO also relies heavily on ProC.
    Having locally-installed client tools could be more important to AB/CO that central support from IT-DES. NMN stressed that these tools need to be available before any meaningful validation inside AB/CO can even start.
  • AB/CO needs to have recent enough versions of eclipse (3.1 or more recent) and ant (1.6.3 or more recent)
  • SLC4 seems to drop support for certain processors [Pentium2?, HP6000 notebook, need to clarify]. This would need to be widely announced, since the installation will actually start, but leave the user with a non-booting system. Also needs to be seen whether this affects any of the "online" VMEbus systems.

NN stressed the need for full support for x86_64. This support is available on the OS level, and it is up to the individual package providers to decide whether x86_64 will be supported (for 64bit mode). Notably the LCG AF needs to decide when it wants to have x86_64 as supported platform. Most newly-arriving hardware can run in 64bit-mode, it will be between the experiments and IT-FIO to schedule a transition to 64bit-mode for batch capacity.

NN and FR also explained that support for ia64 is not required for their environments and should be the first to be discontinued to limit the number of platforms. Ia64 should also not be a cause for any certification delay. [addendum: NN pointed out that while em64t and x86_64 are very similar, Opterons benefit strngly from a NUMA-aware kernel which may require different kernels]

NMN and EC expressed strong need for a Java JDK that work with NPTL/glibc-2.3, in a readily-installable version. This would also make the RPMs at jpackage.org (such as Jakarta or newer (NMN: usable) eclipse versions) immediately accessible. JI explained that this is a political issue (and has been for the past few years), IT does not have formal Java support (AB: may actually not even be required, package-level support may be enough). Also the license is still very restrictive, and the CERN user community would need to agree on a single version of the JDK (possibly 1.5, since older version do not support x86_64 well).

[update: NMN added the following via mail:
For JDK, the problem is really complex:

  • nobody want to become beta-tester for any version, but we finally had to urgently upgrade under SLC3 to at least 1.4.2_06 for javaws to accept properly signed jars.
  • deployed version has to stay coherent with gcc and glibc for proper usage of threads and have JNI capability.
  • users did not wanted to change to 1.5 (fear for language incompatibilities)...
  • proper choice of jdk/jre for stand-alone applications might be independent from jre required by firefox, mozilla and other web-browsers to ensure that web applications like EDH still work in a safe environment.
]

JI presented the current rumors (no confirmation from RH) about a possible delay of RHEL5 to end of 2006, which would rule out this release for general LHC startup. This was followed by a discussion on release lifetimes, with a requirement from MD that SLC3 needs to be available until the 2.6 kernel drivers have been fully validated. JP pointed out that support overlap between current and "LHC" production release will be on the order of 1.5 years.

The option to only lock certain environments (like online farms) onto the "LHC startup release" in October 2006 was discussed, with other services moving to SLC5 sometime after that date (little issue for services hidden behind a networked API, and e.g. little trouble for CPU nodes expected, especially if the physics compiler is decoupled from the OS). In this case an older release could be obsoleted for general CERN usage while still be available to certain (restricted, e.g. via CNIC network domains) environments

Issue3: planning

Original certification planning still holds: try to certify until the end of 2005. Currently the LCG application area software (blocking LHC experiments) and ORACLE client (blocking AB/CO) are on the critical path.

Issues encountered should be submitted via the "linux.support" REMEDY mailfeed.

AOB

  • It was pointed out that the current apt/yum-autoupdate tool is not a recommended solution for service-critical machines, as changes may lead to machines becoming unusable (examples: recent XFree86 updates for AB/CO Controls room machines). This is a design decision, (rare) unavailability on selected desktops is preferred over (frequently) non-patched desktop machines.
  • This lead to a discussion on CNIC/LinuxFC. Criticism was expressed for the fact that only now the technical people are being contacted to review the requirements and test the proposed (Quattor-based) solution. This turned out to be partly an internal communication issue, since all concerned user groups have formal representation inside CNIC, and have been asked to review the current list of requirements. It was pointed out that the current round of going round the "technical" contacts was on initiative by IT, who needs to have some commitment to use LinuxFC in order for the project to continue.

    A quick poll showed general interest in CNIC, but already now for some environments (e.g. ATLAS-online diskless farms) it may be too late to switch management to LinuxFC. CNIC/LinuxFC will be presented further via a DTF presentation.

JI: asked whether a date for next meeting needed to be fixed(need 2 weeks lead for external participants). Consensus: Not required now, no real issues yet.