Minutes draft2

DRAFT-2 Minutes of the LXCERT meeting 21.03.2006

Invited / Attendance

Abbrev Name Affiliation Present
AB Alastair Bland AB/CO-admin
AP Andreas Pfeiffer LCG Application Area (excused)
BB Bruce M Barnett ATLAS-online (replaced by MD, Marc Dobson)
BG Benigno Gobbo non-LHC experiments
EC Eric Cano CMS-online excused (Vincenzo Innocente)
EO Emil Obreshkov ATLAS-offline
FR Fons Rademakers ALICE-offline
JC Joël Closier LHCb-offline
JI Jan Iven IT-services catchall (chair)
JP Jaroslaw Polok general Desktops (secretary)
KS Klaus Schossmaier ALICE-online
NMN Nicolas De Metz-Noblat AB/CO-development (excused)
NN Niko Neufeld LHCb-online
VI Vincenzo Innocente CMS-offline
TK Thorsten Kleinwort IT PLUS/BATCH service

Date: Tuesday 21.03.2002, 16:00
Place: 40-R-D10

Agenda

  • SLC4 certification status
  • SLC4 certification decision
  • AOB

Summary

No major obstacles for SLC4 were reported, it is still foreseen to certify SLC4 at the end of march 2006.

SLC4 status - Tour de table:

(several updates have been received before the meeting, including on from AP:
LCG AppArea software:
- all recent external packages (as for configuration LCG_42) have been build successfully in local disk space and will be installed soon in AFS using the system compiler (now 3.4.5). Building of projects will start afterwards. This is for both h/w architectures: ia32 and amd64

CERNLIB
- after some problems with libshift have been cured, cernlib will be installed into AFS /afs/cern.ch/sw/lcg/external/cernlib) in the next few days (again for the two h/w architectures as mentioned above).
There is still a warning coming from the makedepend step about missing stddef.h which should be understood (anyone has any idea ??), but this should not delay the installation. (stddef.h seems to be in the compiler's lib/include directory for slc3 but not in slc4 ???)


VI: CMS-online has all their dependencies met, but will not port kernel drivers until after the magnet test is done (~June). OK to certify.
CMS-offline software compiled on SLC3 with gcc-3.2.3 and running on SLC4 (both on i386 and x86_64 in 32/64bit-mode) has passed the internal validation tests. This is good enough to allow CMS to run, OK to certify. "Native" tests with SLC4/gcc-3.4.5 are still pending on LCG/SPI software availability.
Interesting data points:

  • Fermi has deployed SLF4 for "all services (including Grid), except for the worker nodes" and has not reported major issues.
  • 32bit software compiled on SLC4 with gcc32 runs fine on SLC3.

TK: test SLC4 LXPLUS/LXBATCH machines still expected to be available at the end of the month.

SLC4 LXPLUS/LXBATCH machines are nor expected to run NFS clients, so /hebdb and /fatmen will not be available there.

Expect some iterations on the package sets in the beginning, standard policy applies: supported packages from the "linuxsoft" repository usually can get added quickly (except if security policy violations), anything else needs discussion.  No real showstoppers, OK to certify.

TK raised the question whether the "default" eventually should point to the 32bit or 64bit variant. Unclear  - JI, JP prefer 64bit version to get remaining 64bit issues discovered quickly, JC (and the majority of the participant) preferred 32bit to ease the transition. This still needs to be confirmed before the actual alias switch based on user feedback after the LXPLUS machines become available.
AB asked whether the LEMON versions were substantially different between SLC3 and SLC4. This appears not to be the case, except for modules that parse /proc (which shows differences between 2.4 and 2.6 kernels) and are being rewritten. 

BG gave a quick summary for the non-LHC experiments - not all have replied to his questions. The assumption is that non-replying experiments are either happy or have too little manpower to do proactive tests or have largely disappeared. NA60 tested in 32bit mode and is fine with SLC4, OPAL waits for CERNLIB (but programs recompile), COMPASS has also rebuild their code and is also waiting for CERNLIB. COMPASS had a look at 64bit mode (on SLC3), analysis software appears OK, reconstruction had some 64bit issues.
In summary, the non-LHC experiments haven't found serious issues with SLC4, OK to certify.

KS reported that ALICE-online "looks good" after a period of intensive testing over the last days (some minor issues found and addressed), also tried it on a few desktops. Currently having trouble with /dev device files disappearing, will look into "udev" configuration (but in general everybody likes the consistency that udev brings).

MD listed the outstanding issues for ATLAS-online:

  • tg3 or bcm5700 driver - needs tests, but AB reported that tg3 is good enough for them on SLC4 [JP asked whether anybody was using the bcm5700 on SLC3 - turns out to be not the case, both AB and MD have their own versions on SLC3]
  • bigphysarea - longer-term issue, the driver is in SLC4 for now
Currently, SLC3+gcc-3.4.4 is part of the nightly recompilation cycle, some issues still need fixing there. MD hopes that this can be done until mid-April, then switch over to SLC4+gcc-3.4.5 and fix eventual new issues until end of April. None of this will block the certification.


AB said that the most vital AB/CO servers (NFS file servers) are already running on SLC4, since the new LVM features are very interesting. They have seen several issues (nfsd crashes, kernel crashes, ext3 shrinking only while offline), but luckily started with low expectations ("..as long as not all servers crash at the same time..."). They noticed that
  • sshd/login speed is acceptable (no longer the 4sec delay they see with SLC3),
  • the proprietary ATI drivers seems OK,
  • NVidida drivers causes a crash after a few X11 sessions [JP remarked on general (lack of) stability of this driver on SMP kernels], including one that gets triggered via the new graphical boot screen ("rhgb") after reinstallations,
  • the default runlevel after Kickstart installs seems to b "3" even if X11 has been configure properly
  • LabView crashes (under investigation with IT/CO, worked fine under SLC3)
Java WebStart (widely used in AB/CO) still needs testing, and they need the ORACLE instantclient and TNSNAMES.ORA updating mechanism [JP explained the new "cernonly" RPM repository]. AB also reported on some "diskless client" SLC4 setups, which triggered a short discussion since ATLAS-online and LHCb-online are doing similar things, LHCb in the scope of LinuxForControls. Some conceptual differences as to whether to use NFS-root vs RAMdisk images, and how to best update the image file, but all are invited to submit their use cases/specifications to LinuxForControls/Matthias Schröder.

AB inquired on the previous "go to SLC5" proposal (was changed after RHEL5 delays to "go to SLC4", see minutes of the previous meeting), and expressed interest to look at RHEL5betas (but timelines may not match AB/CO "change window" in Dec/January). In conclusion, AB hoped that SLC4 would get certified ASAP.

ALICE-offline / FR certified their software in 32bit and 64bit mode, no major issues seen.

JC explained that LHCb-offline had had no time for significant tests (due to the overlap with DataChallenges). They will use SLC3 for the next DCs as well (no immediate need for SLC4), but have no reason to block the SLC4 certification.
 
EO: ATLAS-offline has started to include SLC3+gcc-3.4.4 into their nightly rebuild exercise, which has identified a few issues (currently being fixed, but at low priority -- several other issues due to a change in release policy). The existing SLC3 version has passed all validation tests under SLC4. Two "desktop" installation tests failed since both were done on machines with badly-supported hardware (Vobis800+Intel EtherExpress "AUI" [JP: workaround is known for this])., OK to certify.

NN confirmed the statements from the last meeting - they are ready to switch the remaining LHCb-online nodes to SLC4, servers already run it.  Currently looking into PVSS issues [update from W.Salter after the meeting: these are now resolved]. OK to certify.

For Linux support, JP explained that an "integrated" SLC43 release is currently under way (last chance to get updates and new products into the base installation), should be available before the end of March. Any later updates will be released via the usual updates mechanism. JI hoped that an initial Kerberos5-aware login environment and SSH would be available, but this has been delayed already considerably.

Certification decision

All agree that SLC4 can be certified once the integrated SLC43 is available. Deadline of "end of march" will most likely be met.

Next Steps

JI asked the experiments to provide input for the transition from SLC3 to SLC4, by providing suitable time windows for migration and listing constraints (periods of stability). "Hidden" services (ie. those not running user code) will be migrated over the summer, with short service interruptions scheduled as usual.

AOBs:

  • There was a discussion on whether new kernels should be made the "default" for next reboot, this is a new update policy to minimize the number of desktops running old&insecure kernels. AB expressed concern that this would introduce an unforeseen change after a machine crashed (e.g AB machine with hardware watchdog), some buggy instance of yum/apt actually removed drivers for the old kernel; result was unusable machines.
    JP explained that some bugs in the updating scripts had been fixed, and advocated an even stricter policy (to reboot directly after a kernel update that got applied at boot time), as to minimize the time the new kernel sits dormant on a machine.
    Action: JP and AB to come up with a example configuration that prevents kernel updates (and other problematic packages, such as the binary NVidia drivers) from going automatically to the AB/CO machines. LinuxForControls got mentioned as usual.
  • AB would like a mixed 8bit/24bit mode in X11 to support legacy (HP) applications.
  • AB asked whether the default AFS installation should limit AFS to use only the CERN cell, after several problems seen with file browsers hat tried to enumerate all known AFS cells. Some workarounds for this problem exist (e.g. AFS dynroot, but this breaks some working scripts  that access /afs/usr/local/).
  • JP reminded the participants of proposed SLC3 lifetimes - no objection, already agreed last time.