Hi all, I recently (about a week ago) bought a New 2016 XPS 13. The specs are: i7, 16GB, 512GB SSD, QHD. I am dual booting Windows and Fedora Linux, and aside from the coil whine issue which is pretty irritating (worse under Windows) the day before yesterday it started crashing under load. The first time was while playing a video (720p) in Linux, then while trying to install some software in Linux. Yesterday I reproduced the problem in Linux by playing a video again, then rebooted into windows and got a crash while trying to install League of Legends.
When booting Linux back up, the system log says a Machine Check Exception - i.e. a hardware error - was detected. The only useful-looking line says, "Generic CACHE Level-2 Generic Error" which I believe indicates the CPU detected that the L2 Cache has become corrupted. Under Windows I got BSODs both times - the code however was not MACHINE_CHECK_EXCEPTION but, the first time, IRQL_NOT_LESS_OR_EQUAL and the second time DRIVER_IRQL_NOT_LESS_OR_EQUAL.
I am running in AHCI mode rather than RAID (this was necessary to dual boot Linux. To get Windows working I had to reboot it into safe mode to get it to enable the AHCI driver.) Apparently on older XPS models this caused BSODs, but with a different (more specific) error code - so I doubt it's that. It's worth noting that before Monday I was able to install Linux, a bunch of other software and watch a different 720p video without anything going wrong - I am fairly sure in this time I did not install anything which could affect this (for example, graphics drivers or OS updates.) This could be good luck, could be me forgetting something, or could mean something's broken internally.
I'm getting somewhat high temperatures while doing this. In Linux the system log sometimes has other machine check exceptions about the processor being throttled due to high heat. I see the cores get 85 degrees using the Linux temperature sensor tools, with the CPU overclocking to 2.9GHz, but when I checked temps until a crash, it crashed while the temperature had dipped slightly, and was sitting around 78. In Windows temperatures got a bit higher (or maybe the software is just different - one never knows I guess) measuring up to 90 degrees in RealTemp. These seem a bit hot for having been at load for, I guess, 15 minutes, but I don't know that for sure. At any rate, such a temperature alone sure doesn't explain a crash.
If anyone has any advice that'd be nice. Although I don't think the standard stuff like "update drivers" and so on is going to be of any use since the same problem (it seems) occurs in two different OSs. I'm already on the currently-latest 1.0.7 BIOS. If anyone can say if they have the same problem or perform similar tests (install League of Legends from scratch in Windows, play video in Linux, or just any kind of load testing really - I will try in prime95 later) that'd be useful. Also if you could check temperatures under load to see if mine are unusual that'd be handy. I'm expecting I'll have to send it back though. If I do, does anyone know that if they replace the motherboard (or whole unit) they swap out the SSD? Setting up Fedora has been quite some effort so it would be nice to know if I'll have to do a full back up to avoid it again. I just hope if it comes to that that they don't make the coil whine worse!
-
John Ratsey Moderately inquisitive Super Moderator
Do Dell's own diagnostics and the logs in the BIOS reveal anything?
If not, leave Memtest86 running overnight to give the RAM a thorough testing (much more thorough than the diagnostics). A couple of years ago I had some erratic BSODs and Memtest86 revealed a bad RAM module.
John -
Maybe take a read through the 9350 threads a bit as that is a (very) similar machine to your 9360.
The 9550 (similar but with 15" screen) threads show that the BIOS and some drivers took Dell more than six months to get generally sorted. As far as I am concerned, it was a disaster out of the gate. Even today, some with 4k screens are running older BIOS & drivers based on which "features" they need vs. "problems" they can live with. That said, once Dell got the BIOS and drivers generally sorted, I think it runs quite well.
The 9550 really starts to throttle around 80*C so there is a marginally useful datapoint. There were issues with lousy thermal paste application from the factory.
User Eason has a nice guide on undervolting - check that out on his signature. Note that Intel XTU has some bugs that CAUSE power throttling on some new Dell machines. Perhaps ThrottleStop is an option but not sure it works with Kaby Lake.
FYI - I dropped thermals on the similar 9550 some 17*C with ThrottleStop, CPU-GPU repaste, and replacement of VRAM thermal pads. Those are typical results for other users...
ThrottleStop can enable Intel SpeedStep which might help with thermals as well. Dell hasn't bothered to enable that nice feature. I have a thread here that discusses some of the basics of SpeedStep -
It could happen if the CPU is overheating. repaste before doing anything drastic like letting a Dell butcher loose on it.
-
When you call Dell Tech Support, you'll need to jump through the hoops. You won't get much luck bypassing all of that by saying, "Look, I know what I'm doing; this is a hardware defect. Just put me through to Tier 83 tech support, and give me a new motherboard." My best advice is to just play dumb, and play along. They might have you run through some hardware diagnostics, sure. But when they get to the part when they tell you to do a data-destructive System Restore, just tell them that was the first thing you tried (since you went online to read help articles); and it didn't help.
if you're mailing your machine into a Dell service center, then definitely capture a drive image of your SSD. It's highly likely that the service center will do a System Restore to run diagnostics against a known-good software configuration (the factory image), as the only reliable way it can validate that the problem was fixed.Kikuri likes this. -
Hi thanks for the quick replies. When I get home I will check Dell's diagnostics (do you mean the thing you get here?), BIOS logs (I didn't realise there were any...), memtest86 and ThrottleStop. I did look for some 9350 threads and didn't find anything that looked that relevant. Mostly stuff to do with AHCI mode or stuff fixed by BIOS updates.
I didn't get the on-site repair so I'll investigate repasting. It would require a little investment since I don't actually have any thermal paste or torx screwdrivers(!) However, can I get a bit more info one why you suggest that bad cooling would cause crashes? My impression is that the throttling was good enough nowadays that CPUs basically never crashed or damaged themselves due to overheating. (Anecdote: a couple of years ago my desktop was routinely throttling itself while gaming and all that happened was my framerate would tank - I'd carry on playing for an hour with no other effects!) Maybe this is overly optimistic but it would be useful to know more - in particular, people discussing other XPS laptops *seem* to be talking about improving performance rather than fixing crashes.
The advice for dealing with Tech Support is very much appreciated. To save space I intend to take a file-level backup rather than a full image (and because I know how to do it reliably - I feel dd'ing the drive or using cloning software is a bit fragile especially when fiddling with drive modes, UEFI/GPT (technologies I don't know), multiple OSes and so on.)
I'll check in later - thanks again. -
Just as an FYI... repasting your CPU would most likely give Dell valid grounds for denying a motherboard replacement request under warranty. In the US, a vendor can deny a warranty request if they can show that actions / modifications performed by the consumer caused the damage being claimed. And it's not a far stretch to claim that a customer screwing around with the CPU; and messages about random low-level hardware failures and L2 cache failure messages are related.
Now, you're not likely to get caught. A lot of things need to go wrong for that to happen. But if you're sending a laptop in to a service center for a job that will require them to remove a heatsink and deal with re-applying thermal paste at some point in the repair process; it wouldn't be too hard for a keen-eyed repair tech to notice that the existing thermal paste is a different color than what they normally see; report it up to their supervisor; and then who knows what happens from there.Kikuri likes this. -
Good tips GonZo and Kent,
There is a 600+ thread relating to repasting the similar 9550 that will be worth a view.
http://forum.notebookreview.com/thr...rature-observations-undervolt-repaste.785963/ -
Makes a refreshing change as most other companies would call foul and revoke the warranty. This is also why Dell have no warranty void stickers. -
Thanks for the comments on repasting.
OK so: I checked the BIOS Logs (nowt), ran memtest for one test (nowt), and ran the Dell diagnostics. This found a problem with the memory - seems weird given that it tests for a fraction of the time of even one memtest, but maybe they know something I don't? What do you think?
I got a new BSOD error code today: KMODE_EXCEPTION_NOT_HANDLED in NTFS.sys. This was again under load, although after quite a while at load and not at its hottest. I will re-run memtest overnight to see what it turns up. Dell will send me a new memory module, I guess for me to install here, under warranty, which sounds like it might just be a no-brainer. I'd just need to get the right screwdrivers and be away. I could apply a sensible quantity of thermal paste at the same time in case that's the issue.
Thoughts?
EDIT: Well I assumed since the page was offering to send out memory that the memory wasn't soldered on... everything I read online though suggests it is, so uh... wat? :S
EDIT2: OK memtest86+ ran overnight for 8 hours without any errors. However before that I re-ran Dell Diagnostics to check the exact error and confirm they are running a memtest-like test. Prime95 also fails the large FFT test, but the small FFT test at least passes two tests. (The large one fails the first test.) I'm going to see what happens if I undervolt in Intel XTU and retry but I'm not too hopeful. This also isn't a solution because XTU doesn't run in Linux as far as I know.Last edited: Nov 17, 2016 -
Update: undervolting by 50mV (actually now I'm not sure of the precise units... it was ten times the minimum step Intel XTU permitted, anyway) reduced temperature quite a bit, but the system still crashed while running Prime95 in RAM-stress mode. I wasn't watching at the time so I don't know how long it took but it wasn't that long. In any case I thought the next step would be to contact Dell support. I think the website must have been wrong in saying they'd send me a new RAM module, but I'd like to see what they do offer me. If they just send out a new mainboard then that might still be a pretty good solution. The only thing is that replacing that will be more effort and complication than swapping out some RAM sticks.
I still have the option of repasting without getting anything replaced, but I am still under the impression that if the CPU is being kept under its maximum temperature, the system should be perfectly stable. Perhaps the RAM itself is overheating, but then that's a separate issue not solved by repasting the CPU. -
Just call Dell now.
-
Sorry, I didn't make it clear, I already contacted them, but am waiting for a reply. (I can't really call them while at work so it's via email for now.)
-
-
pressing likes this.
-
-
Dell didn't reply in the promised 1 working day. I'll ring them on Monday if they still haven't replied by email. Supposing they decide that the RAM is to blame, as seems reasonable, are they likely to send out a new motherboard for me to install, or will they get me to send it to them? If the former, is there some risk to me if I screw up the installation? The process doesn't seem too risky but there's always a slight danger of the screwdriver slipping and whacking something sensitive or whatever, which obviously increases the more involved the task. -
Sent from my SM-G920F -
Yes, obviously. I was trying to get an idea of what to expect.
-
-
Motherboard replaced today (and repasted) - instability seems to be fixed, but the temperature seems to be the same. I'm less worried about the temperature, of course.
EDIT: The coil whine on the new motherboard is godawful on battery in Linux...Last edited: Nov 22, 2016 -
Got the same issue. Memory is fault on my 9360. Prime95 found errors and Dell diagnostic software too.
But Dell is unable to repair the unit because there are no spare parts....This situation lasts for weeks.... -
Demand the unit is replaced with a new one.Kikuri likes this. -
Already asked and talked to a manager. They refuse to replace/refund. But I'm dealing with the French plateform which is not really performing well. I'm now telling them that I'm going to sue them if they do not take action.
Worse experience!!! I will NEVER buy a Dell anymore. -
Yeah you really need the resolutions team for this one m8, hope you paid with a credit card, I would be starting a chargeback about now.
Are you within your 1st 14 days? they have to refund in that timeframe.Last edited: Feb 9, 2017
XPS 13 9360 - Crashes under load
Discussion in 'Dell XPS and Studio XPS' started by Fish-Face, Nov 16, 2016.