• 23 Posts
  • 1.24K Comments
Joined 2 years ago
cake
Cake day: July 5th, 2023

help-circle

  • Avid Amoeba@lemmy.catoSelfhosted@lemmy.worldSyncthing alternatives
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    1 minute ago

    Yeah. But it could be the board that burned it. But yeah, dead RAM is bad news, something is likely up. If I had data corruption and RAM didn’t show errors I’d begin swapping components. If the machine is cheap and swapping components would be too expensive or impractical, I’d swap the machine for another, like a cheap second hand Dell box.





  • Avid Amoeba@lemmy.catoSelfhosted@lemmy.worldSyncthing alternatives
    link
    fedilink
    English
    arrow-up
    6
    ·
    edit-2
    17 hours ago

    It’s kind of embarrassing because I used to work as a service technician at a popular computer store in the 2000s and Memtest86+ has been a standard fare of testing. I guess outside of OC, the shorter first pass truly was enough to spot bad RAM in the vast majority of cases. Plus multichannel interactions were not nearly as prevalent in the DDR1/2/3 days. I recently installed 4 DIMMS for 128GB on an AM5 machine just to discover that the 5600 RAM only boots at 3600 in a 4-DIMM config, as per AMD’s docs. Could force it higher but without extra adjustment it can’t go beyond 4600 on this machine. Back in the day, different DIMMs, often with different chips worked in 2, 4-DIMM configs so long as they matched their JEDEC spec. backinmyday.jpg




  • Avid Amoeba@lemmy.catoSelfhosted@lemmy.worldSyncthing alternatives
    link
    fedilink
    English
    arrow-up
    34
    ·
    edit-2
    17 hours ago

    That’s really weird. I’ve been using it for mobile-desktop-server-offsite sync for many years, with transfer sizes over 15TB, over WiFi, cellular, cable, fiber. I’ve never seen data corruption. Conflicts, sometimes. Permission issues, sometimes. Wiping something accidentally, sometimes. It’s even more weird because Syncthing performs computes hash values for the files it manages. I don’t know if it performs hash validation after copying remotely but if not, it can be forced manually which would tell you what’s fucked and be pulled from the source, if it still exists.

    Nevermind, it verifies the result:

    When a block is copied or received from another device, its SHA256 hash is computed and compared with the expected value. If it matches the block is written to a temporary copy of the file, otherwise it is discarded and Syncthing tries to find another source for the block.

    According to this, if you have data corruption it can only occur between copying/moving a temporary file on your destination to another directory, or it could occur on the source itself. Both of those scenarios are a cause of concern and would likely persist with any utility. Moving or copying a file from one location to another on a sane machine should not corrupt it. If I were you I’d ensure my server doesn’t eat bits. If not the storage media, it could be bit rot, or bad RAM.

    Just in case everything seems fine, let me tell you what I dealt with. I had a Ryzen 5950X machine with 32GB of RAM. It worked well since inception with no signs of RAM or data corruption issues. I test every new machine with Memtest86+. At some point I migrated the storage from Ext4 on LVMRAID to ZFS. All good. Then I wrote an alarm for Prometheus to tell me if there’s any issues in ZFS. A week later I get an email about a ZFS error. I check the system - says checksum errors, data has been corrected, applications unaffected, run a scrub to clear. I ran a scrub. A few more checksum errors found, all corrected, we’re clean now. There was a strong solar storm around that time, probably that. A couple of weeks later I get another email. Same symptoms, same procedure. No solar storm. Shit. Memtest86+ - pass. Hm. A couple of weeks later I get another. Same thing. Memtest again - nothing. This went on for several months. Meanwhile the off-site backup sees nothing like that. While running Memtest on another machine I noticed that the test passes following the first took longer than the first, a lot longer. I thought something might be wrong with that machine. Dug into it, got into Memtest’s source code and discovered that the first pass is shorter on purpose so that it quickly flags obviously bad RAM. Apparently if you want to detect less obvious issues, you have to run multiple passes. OK. Memtest the main server again, pass 1: OK, pass 2: OK, pass 3: OK, pass 4: FAIL. FUCK. Memtest each stick separately for 4 passes: OK. Memtest 2 at a time: OK. Memtest all 4: FAIL. Alright, now we know why ZFS keeps finding checksum errors. Long story short, this machine could not run this RAM in 4-DIMM config. Replaced it with another RAM that’s rated to run in 4-DIMM config on that processor. No more checksum issues. If I was running the older Ext4-on-LVMRAID storage stack, I would have caught NONE of these and it would have happily corrupted files here and there. In fact it likely did and I have some corruption. Moral of the story - run many Memtest passes and use checksumming storage stack like ZFS or Btrfs. I strongly recommend ZFS since its stripe RAID works fine unlike Btrfs’es. If you don’t find bad RAM, start using it today, even if you’re working with a single disk and add redundancy when you can. Only after change Syncthing for something else if you still somehow get corruption without ZFS’es knowledge. And if ZFS tells you that you have checksum errors, you likely have bad hardware.