Recovering from the Faulty CrowdStrike Update: Steps, Strategies, and the Importance of Contingency Planning
Impact and Real-World Examples
The recent faulty update from CrowdStrike caused significant disruptions for several organizations, including American Airlines and several financial and healthcare institutions.
American Airlines
American Airlines faced substantial operational disruptions due to the update. The bugcheck or blue screen errors linked to the faulty Falcon Sensor update led to flight delays and cancellations, affecting thousands of passengers. The airline had to quickly implement emergency protocols to manage the situation and ensure passenger safety while working on resolving the technical issues.
Financial Institutions
A large banking corporation experienced widespread system crashes, resulting in temporary service outages and customer dissatisfaction. This incident underscored the critical need for robust contingency plans and reliable IT support to swiftly address such issues and maintain trust.
Healthcare Providers
A major healthcare network faced disruptions in their patient management systems due to the update. This situation not only threatened data security but also impacted patient care services. The healthcare network had to implement emergency protocols to ensure continuity of care while addressing the technical issues.
How It Happened
The issue stemmed from a faulty channel file update for CrowdStrike’s Falcon Sensor on Windows hosts. The problematic file, “C-00000291*.sys,” was timestamped at 0409 UTC. Hosts running this version experienced crashes and blue screen errors. The corrected version, timestamped at 0527 UTC, resolved these issues, but not before many systems were impacted.
Immediate Action Steps
For Windows Hosts Still Crashing:
- Reboot the Host:
- Restart the host to allow it to download the reverted (good) channel file.
- If the host crashes again, proceed to the next steps.
- Safe Mode or Windows Recovery Environment:
- Boot Windows into Safe Mode or the Windows Recovery Environment.
- Navigate to %WINDIR%\System32\drivers\CrowdStrike.
- Locate and delete the file matching “C-00000291*.sys”.
- Reboot the host normally.
- Note: Bitlocker-encrypted hosts may require a recovery key.
For Public Cloud or Similar Environments:
Option 1:
- Detach the operating system disk volume from the impacted virtual server.
- Create a snapshot or backup of the disk volume as a precaution.
- Attach/mount the volume to a new virtual server.
- Navigate to %WINDIR%\System32\drivers\CrowdStrike.
- Locate and delete the file matching “C-00000291*.sys”.
- Detach the volume from the new virtual server.
- Reattach the fixed volume to the impacted virtual server.
Option 2:
- Roll back to a snapshot before 0409 UTC.
For Azure Environments via Serial Console:
- Access Serial Console:
- Login to the Azure console.
- Navigate to Virtual Machines and select the affected VM.
- Click “Connect” > “More ways to Connect” > “Serial Console”.
- Command Execution:
- Once the Serial Access Console (SAC) has loaded, type cmd and press Enter.
- Enter ch -si 1.
- Press any key (e.g., space bar) and enter Administrator credentials.
- Safe Mode Configuration:
- Execute bcdedit /set {current} safeboot minimal.
- Execute bcdedit /set {current} safeboot network.
- Restart the VM.
- Optional Verification:
- To confirm the boot state, run: wmic COMPUTERSYSTEM GET BootupState.
Importance of Contingency Plans
Effective recovery from such incidents hinges on robust contingency planning. Here’s why:
- Regular Backups:
- Ensure frequent backups of critical systems and data. Regularly test these backups to confirm they can be restored efficiently.
- Testing Environment:
- Maintain a dedicated testing environment that mirrors your production setup. Test all updates here before deploying them live to catch potential issues early.
- Clear Communication Protocols:
- Establish clear communication channels to inform stakeholders promptly about any issues and the steps being taken. This transparency helps manage expectations and maintains trust.
- Incident Response Plan:
- Develop and regularly update an incident response plan outlining specific actions to take in the event of faulty updates or other cybersecurity incidents.
Choosing an IT Provider that Tests Updates Rigorously
The right IT provider can significantly reduce the risk of disruptions due to faulty updates. When selecting an IT provider, consider the following:
- Proven Track Record:
- Look for providers with a history of reliable and successful update deployments. Client testimonials and case studies can provide valuable insights into their performance.
- Robust Testing Procedures:
- Inquire about their update testing procedures. Providers should have a dedicated testing environment and a systematic approach to verifying updates before they reach your systems.
- Proactive Support:
- Choose a provider that offers proactive support and monitoring, capable of detecting and addressing potential issues before they escalate.
- Transparent Communication:
- Ensure the provider maintains transparent communication about update schedules, potential risks, and mitigation strategies.
Conclusion
Recovering from a faulty CrowdStrike update requires swift action, clear communication, and effective contingency plans. By understanding the specific recovery steps and the importance of having a reliable IT provider that rigorously tests updates, businesses can minimize disruptions and maintain robust security. Investing in comprehensive contingency planning and choosing a diligent IT partner can make all the difference in navigating the complexities of software updates in today’s digital landscape.
At Attentus Technologies, we thoroughly test all software we deploy for our customers to ensure maximum reliability. While we cannot guarantee 100% safety, our rigorous testing procedures and robust contingency plans are designed to prevent most issues and enable swift recovery when needed.