Minutes

Minutes of the LXCERT meeting 07.02.2006

Invited / Attendance

Abbrev Name Affiliation Present
AB Alastair Bland AB/CO-admin (missing)
AP Andreas Pfeiffer LCG Application Area
BB Bruce M Barnett ATLAS-online
BG Benigno Gobbo non-LHC experiments (only beginning, VRVS/phone)
EC Eric Cano CMS-online
EO Emil Obreshkov ATLAS-offline
FR Fons Rademakers ALICE-offline (excused)
JC Joël Closier LHCb-offline
JI Jan Iven IT-services catchall (chair)
JP Jaroslaw Polok general Desktops (secretary)
KS Klaus Schossmaier ALICE-online
NMN Nicolas De Metz-Noblat AB/CO-development (missing)
NN Niko Neufeld LHCb-online
SW Stephan Wynhoff CMS-offline (missing)
TK Thorsten Kleinwort IT PLUS/BATCH service

Date: Tuesday 07.02.2002, 10:00 Place: 40-R-D10 VRVS: "Sky" room

Agenda

  • SLC5 delay impact - discussion, new plan.
  • SLC4 certification status
  • initial SLC4 roll-out timeline (to be coordinated with GDB)
  • SLC3 support lifetime (per architecture)
  • AOB

Due to technical problems with VRVS, BG was just able to make a general statement about the concerns of the non-LHC experiments -- these have little or no manpower for OS migrations and would like to see such changes to be kept to a minimum.

The slides of JI's presentation are available here (PDF, PPT).

Summary

The proposed certification schedule with a deadline at the end of March 2006 was accepted, it was agreed that no further slips should occur. SLC4 certification will get delayed again only if major issues are actually found.

It was also agreed that the Grid Deployment Board (GDB) needs to be tightly involved in the actual SLC4 rollout coordination.

Target date for switching the current "LXPLUS" alias to SLC4: October 2006.

proposed end-of-life timelines for SLC3 were accepted.

RHEL5/SLC5 delay

CERN and Fermi now expect Red Hat Enterprise Linux 5 (RHEL5) to not be available before the end of 2006, a CERNified version (SLC5) would only be available in 1Q2007 (i.e. too late for LHC startup). This means the previous plan of just certifying SLC4 but no rolling it out widely is no longer viable.

The current proposal now is to use SLC4 as the "LHC startup release", and roll out SLC4 in production mode (i.e. switch the "LXPLUS" alias to SLC4) until autumn, with the GDB being strongly involved. Batch capacity migration will be done in stages according to the experiments' requirements. This proposal has been accepted.

[update from BG after the meeting:

  • non-LHC experiments "suffer" from any system/compiler migration due to lack of manpower,
  • but they understand the need of upgrades and consider a frequency of about 2 years to be reasonable.
  • whether to go from SLC3 to SLC4 or to SLC5 does not affect these considerations.
]

SLC4 status: IT

Linux Support proposes to fold in the pending changes from RHEL4U3 (currently in beta). Several tools are not fully certified yet, but we have encountered no major stumbling blocks. TK: SLC4 PLUS/Batch test machines need a few weeks more to set up. Kerberos5+ssh interoperability still needs more work. The focus of the IT tests has been on i386, wider tests on EM64T/AMD64 are still required.

JP announced that the default package updating tool on i386 will go from the current "apt" (for RPM) to "yum", as is already the case for the 64bit platforms.

SLC4: non-IT status updates

NN pointed out that CASTOR client libraries are now required in order to go forward with the certification. JC asked whether these are CASTOR-1 or CASTOR-2 [they are CASTOR-2], and whether they would allow to connect to the current production services that run CASTOR-1 [which they should].

EC asked whether SLC4 will stay with the current 2.6.9-* kernel, as some external projects (iSCSI) have patches available only for newer releases. 2.6.9 is very likely to stay the default kernel in SLC4, this has the positive side effect that online drivers don't need to be ported to newer kernel releases continuously [for the specific case of iSCSI, the Red Hat kernel (and hence SLC4) should have a usable backported version]. EC also asked for the status of Quattor/SPMA for multiarch systems (like EM64T/AMD64). While this is only currently being tested, no major issues are assumed given that SPMA is rather close to the RPM libraries, and does not do dependency resolution (a possible source of problems) anyway.

AP send the following summary for LCG AA and CERNLIB:

For the 3.4.4 compiler on SLC3, the compiler installation in AFS which was used for the LCG AA packages is located at: /afs/cern.ch/sw/lcg/contrib/gcc/3.4.4/slc3_ia32_gcc344 (and, for the AMD 64 architecture: /afs/cern.ch/sw/lcg/contrib/gcc/3.4.4/slc3_amd64_gcc344). To set up the environment, please follow the instructions at the end of: https://twiki.cern.ch/twiki/bin/view/SPI/FindBuildServers

cernlib (2005 source with security patches from debian folks):

  • build and tested ok on slc3 3.4.4 (using libshift from 3.2.3)
  • tests on slc4 ran ok with slc3/344 build
  • will be installed once libshift.so is available (few days)

LCG-AA

  • port to gcc 3.4.4 on slc3 done (ia32 only so far; amd64 still has problems)
  • don't expect problems for slc-4 on 32 bit arch
  • do expect priority to raise soon (experiments interest)
  • guesstimate: builds will start to be done after CHEP (end Feb)

There was a quick discussion on whether the AF compiler (SLC3+gcc-3.4.4) would need to be deployed locally for performance reasons; agreement was that for the upcoming tests the distribution via AFS is expected to be fine, given that this is an interim test until the 'natively' recompiled LCG environment becomes available. This could be reviewed if this 'native' version is delayed or if actual performance severely impacts the tests.

EO summarized the status for ATLAS-offline: they require GAUDI for which tests with the SLC3+gcc-3.4.4 compiler are under way. Some concerns about the current priority/little available manpower for these tests.

JC confirmed that LHCb-offline were just waiting for the LCG software and the CASTOR client, they are ready to start tests.

NN explained the LHCb-online strategy, the embedded machines moved from 7.3 to SLC4 in one go. Non-visible machines and infrastructure have also gone to SLC4 already, except for Control/Testbeam (dependency on PVSS and CASTOR-client) and filter farms that need the offline code and will follow LHCb offline plans (these machines are running diskless and can be migrated very quickly if needed). The current "Quattorification" (through LinuxForControls/CNIC) of the online nodes is stuck on configuring diskless clients.

EC: CMS-online has only 4 test machines with SLC4 but is building a new test cluster. The kernel drivers haven't been ported yet to 2.6. Ongoing physics activity (cosmics) means that little may happen in this area until May.

[a short breakout discussion ensued on the requirement to have the "bigphysarea" patch in SLC4 (used by {ATLAS,CMS,ALICE}-online) - this patch is currently in the SLC4 kernel, but its usage should be reviewed, it is unclear whether it can be ported to future versions]

KS: ALICE -online has done some SLC4 tests, no major obstacles seen. Their drivers have been ported to 2.6.9, and both 32bit and 64bit versions are OK. They will re-test after the latest changes have been included. Not a formal statement (need FR, but ALICE-offline doesn't seem to have stuck major problems either.

BB explained that ATLAS-online has done an initial port of its custom drivers to 32bit-SLC4, kernel 2.6, and that the higher-level code has been compiled against SLC3+gcc-3.4.4 LCG libraries (some issues found). They plan to have SLC3+gcc-3.4.4 as part of their code release in March 2006. When LCG on SLC4 is ready, SLC4+gcc3.4.4 will be included in the nightly builds, and should be available in a formal release by about May (tentative.) Their goal is that SLC4 will be the standard platform for the September ATLAS-offline milestone - details scheduling of SLC4 deployment during 2006 at point 1 will depend on requirements imposed by testing and cosmic running.

SLC3 lifetimes

The proposed timelines were agreed on:

  • support for SLC3 on ia64 and amd64 should stop at the end of 2006
  • support for SLC3 on i386 will stop end of October 2007 (current SL3 end of life)

AOB

  • "does CVS work with Kerberos?": yes, incompatibility was found and should be resolved.
  • "will the HEPiX scripts go away?": not foreseen for now, they have been ported and are currently maintained. Still need to provide documentation for LHCb, both a high-level design document and a man page will be required. [ the (ancient) documentation at http://cern.ch/wwwhepix/wg/scripts/www/shells/ is still largely valid, and Peter Kelemen's presentation to HEPiX 2003 has an high-level overview. The need for a man page has been acknowledged.]
  • next meeting: target is 3rd week of March