1. 12 Mar, 2021 4 commits
  2. 21 Feb, 2021 1 commit
  3. 12 Feb, 2021 4 commits
    • Konstantin Belousov's avatar
      Merge ufs_fhtovp() into ffs_inotovp(). · 89fd61d9
      Konstantin Belousov authored
      The function alone was not used for anything but ffs_fstovp() for long time.
      
      Suggested by:	mckusick
      Reviewed by:	chs, mckusick
      Tested by:	pho
      MFC after:	2 weeks
      Sponsored by:	The FreeBSD Foundation
      89fd61d9
    • Konstantin Belousov's avatar
      ffs_inotovp(): interface to convert (ino, gen) into alive vnode · 5952c86c
      Konstantin Belousov authored
      It generalizes the VFS_FHTOVP() interface, making it possible to fetch
      the inode without faking filehandle.  Also it adds the ffs flags argument
      which allows to control ffs_vgetf() call.
      
      Requested by:	mckusick
      Reviewed by:	chs, mckusick
      Tested by:	pho
      MFC after:	2 weeks
      Sponsored by:	The FreeBSD Foundation
      5952c86c
    • Konstantin Belousov's avatar
      ffs: Add FFSV_REPLACE_DOOMED flag to ffs_vgetf() · f16c26b1
      Konstantin Belousov authored
      It specifies that caller requests a fresh non-doomed vnode.  If doomed
      vnode is found in the hash, it should behave similarly to FFSV_REPLACE.
      Or, to put it differently, the flag is same as FFSV_REPLACE, but only
      when the found hashed vnode is doomed.
      
      Reviewed by:	chs, mkcusick
      Tested by:	pho
      MFC after:	2 weeks
      Sponsored by:	The FreeBSD Foundation
      f16c26b1
    • Konstantin Belousov's avatar
      buf SU hooks: track buf_start() calls with B_IOSTARTED flag · bf0db193
      Konstantin Belousov authored
      and only call buf_complete() if previously started.  Some error paths,
      like CoW failire, might skip buf_start() and do bufdone(), which itself
      call buf_complete().
      
      Various SU handle_written_XXX() functions check that io was started
      and incomplete parts of the buffer data reverted before restoring them.
      This is a useful invariant that B_IO_STARTED on buffer layer allows to
      keep instead of changing check and panic into check and return.
      
      Reported by:	pho
      Reviewed by:	chs, mckusick
      Tested by:	pho
      MFC after:	2 weeks
      Sponsored by:	The FreeBSD Foundations
      bf0db193
  4. 30 Jan, 2021 1 commit
  5. 16 Jan, 2021 1 commit
    • Kirk McKusick's avatar
      Eliminate a locking panic when cleaning up UFS snapshots after a · 79a5c790
      Kirk McKusick authored
      disk failure.
      
      Each vnode has an embedded lock that controls access to its contents.
      However vnodes describing a UFS snapshot all share a single snapshot
      lock to coordinate their access and update. As part of mounting a
      UFS filesystem with snapshots, each of the vnodes describing a
      snapshot has its individual lock replaced with the snapshot lock.
      When the filesystem is unmounted the vnode's original lock is
      returned replacing the snapshot lock.
      
      When a disk fails while the UFS filesystem it contains is still
      mounted (for example when a thumb drive is removed) UFS forcibly
      unmounts the filesystem. The loss of the drive causes the GEOM
      subsystem to orphan the provider, but the consumer remains until
      the filesystem has finished with the unmount. Information describing
      the snapshot locks was being prematurely cleared during the orphaning
      causing the return of the snapshot vnode's original locks to fail.
      The fix is to not clear the needed information prematurely.
      
      Sponsored by: Netflix
      79a5c790
  6. 12 Jan, 2021 1 commit
    • Kirk McKusick's avatar
      Eliminate lock order reversal in UFS ffs_unmount(). · 2d4422e7
      Kirk McKusick authored
      UFS uses a new "mntfs" pseudo file system which provides private
      device vnodes for a file system to safely access its disk device.
      The original device vnode is saved in um_odevvp to hold the exclusive
      lock on the device so that any attempts to open it for writing will
      fail. But it is otherwise unused and has its BO_NOBUFS flag set to
      enforce that file systems using mntfs vnodes do not accidentally
      use the original devfs vnode. When the file system is unmounted,
      um_odevvp is no longer needed and is released.
      
      The lock order reversal happens because device vnodes must be locked
      before UFS vnodes. During unmount, the root directory vnode lock
      is held. When when calling vrele() on um_odevvp, vrele() attempts to
      exclusive lock um_odevvp causing the lock order reversal. The problem
      is eliminated by doing a non-blocking exclusive lock on um_odevvp
      which will always succeed since there are no users of um_odevvp.
      With um_odevvp locked, it can be released using vput which does not
      attempt to do a blocking exclusive lock request and thus avoids the
      lock order reversal.
      
      Sponsored by: Netflix
      2d4422e7
  7. 28 Nov, 2020 1 commit
    • Konstantin Belousov's avatar
      Make MAXPHYS tunable. Bump MAXPHYS to 1M. · cd853791
      Konstantin Belousov authored
      Replace MAXPHYS by runtime variable maxphys. It is initialized from
      MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.
      
      Make b_pages[] array in struct buf flexible.  Size b_pages[] for buffer
      cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
      atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
      The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
      to use unaligned buffers still sized to maxphys, esp. when such
      buffers come from userspace (*).  Overall, we save significant amount
      of otherwise wasted memory in b_pages[] for buffer cache buffers,
      while bumping MAXPHYS to desired high value.
      
      Eliminate all direct uses of the MAXPHYS constant in kernel and driver
      sources, except a place which initialize maxphys.  Some random (and
      arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
      straight.  Some drivers, which use MAXPHYS to size embeded structures,
      get private MAXPHYS-like constant; their convertion is out of scope
      for this work.
      
      Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
      dev/siis, where either submitted by, or based on changes by mav.
      
      Suggested by: mav (*)
      Reviewed by:	imp, mav, imp, mckusick, scottl (intermediate versions)
      Tested by:	pho
      Sponsored by:	The FreeBSD Foundation
      Differential revision:	https://reviews.freebsd.org/D27225
      cd853791
  8. 14 Nov, 2020 2 commits
    • Konstantin Belousov's avatar
      Handle LoR in flush_pagedep_deps(). · 8a1509e4
      Konstantin Belousov authored
      When operating in SU or SU+J mode, ffs_syncvnode() might need to
      instantiate other vnode by inode number while owning syncing vnode
      lock.  Typically this other vnode is the parent of our vnode, but due
      to renames occuring right before fsync (or during fsync when we drop
      the syncing vnode lock, see below) it might be no longer parent.
      
      More, the called function flush_pagedep_deps() needs to lock other
      vnode while owning the lock for vnode which owns the buffer, for which
      the dependencies are flushed.  This creates another instance of the
      same LoR as was fixed in softdep_sync().
      
      Put the generic code for safe relocking into new SU helper
      get_parent_vp() and use it in flush_pagedep_deps().  The case for safe
      relocking of two vnodes with undefined lock order was extracted into
      vn helper vn_lock_pair().
      
      Due to call sequence
           ffs_syncvnode()->softdep_sync_buf()->flush_pagedep_deps(),
      ffs_syncvnode() indicates with ERELOOKUP that passed vnode was
      unlocked in process, and can return ENOENT if the passed vnode
      reclaimed.  All callers of the function were inspected.
      
      Because UFS namei lookups store auxiliary information about directory
      entry in in-memory directory inode, and this information is then used
      by UFS code that creates/removed directory entry in the actual
      mutating VOPs, it is critical that directory vnode lock is not dropped
      between lookup and VOP.  For softdep_prelink(), which ensures that
      later link/unlink operation can proceed without overflowing the
      journal, calls were moved to the place where it is safe to drop
      processing VOP because mutations are not yet applied.  Then, ERELOOKUP
      causes restart of the whole VFS operation (typically VFS syscall) at
      top level, including the re-lookup of the involved pathes.  [Note that
      we already do the same restart for failing calls to vn_start_write(),
      so formally this patch does not introduce new behavior.]
      
      Similarly, unsafe calls to fsync in snapshot creation code were
      plugged.  A possible view on these failures is that it does not make
      sense to continue creating snapshot if the snapshot vnode was
      reclaimed due to forced unmount.
      
      It is possible that relock/ERELOOKUP situation occurs in
      ffs_truncate() called from ufs_inactive().  In this case, dropping the
      vnode lock is not safe.  Detect the situation with VI_DOINGINACT and
      reschedule inactivation by setting VI_OWEINACT.  ufs_inactive()
      rechecks VI_OWEINACT and avoids reclaiming vnode is truncation failed
      this way.
      
      In ffs_truncate(), allocation of the EOF block for partial truncation
      is re-done after vnode is synced, since we cannot leave the buffer
      locked through ffs_syncvnode().
      
      In collaboration with:	pho
      Reviewed by:	mckusick (previous version), markj
      Tested by:	markj (syzkaller), pho
      Sponsored by:	The FreeBSD Foundation
      Differential revision:	https://reviews.freebsd.org/D26136
      8a1509e4
    • Konstantin Belousov's avatar
      Add a framework that tracks exclusive vnode lock generation count for UFS. · 61846fc4
      Konstantin Belousov authored
      This count is memoized together with the lookup metadata in directory
      inode, and we assert that accesses to lookup metadata are done under
      the same lock generation as they were stored.  Enabled under DIAGNOSTICS.
      
      UFS saves additional data for parent dirent when doing lookup
      (i_offset, i_count, i_endoff), and this data is used later by VOPs
      operating on dirents.  If parent vnode exclusive lock is dropped and
      re-acquired between lookup and the VOP call, we corrupt directories.
      
      Framework asserts that corruption cannot occur that way, by tracking
      vnode lock generation counter.  Updates to inode dirent members also
      save the counter, while users compare current and saved counters
      values.
      
      Also, fix a case in ufs_lookup_ino() where i_offset and i_count could
      be updated under shared lock.  It is not a bug on its own since dvp
      i_offset results from such lookup cannot be used, but it causes false
      positive in the checker.
      
      In collaboration with:	pho
      Reviewed by:	mckusick (previous version), markj
      Tested by:	markj (syzkaller), pho
      Sponsored by:	The FreeBSD Foundation
      Differential revision:	https://reviews.freebsd.org/D26136
      61846fc4
  9. 25 Oct, 2020 1 commit
    • Kirk McKusick's avatar
      Various new check-hash checks have been added to the UFS filesystem · 996d40f9
      Kirk McKusick authored
      over various major releases. Superblock check hashes were added for
      the 12 release and cylinder-group and inode check hashes will appear
      in the 13 release.
      
      When a disk with a UFS filesystem is writably mounted, the kernel
      clears the feature flags for anything that it does not support. For
      example, if a UFS disk from a 12-stable kernel is mounted on an
      11-stable system, the 11-stable kernel will clear the flag in the
      filesystem superblock that indicates that superblock check-hashs
      are being maintained. Thus if the disk is later moved back to a
      12-stable system, the 12-stable system will know to ignore its
      incorrect check-hash.
      
      If the only filesystem modification done on the earlier kernel is
      to run a utility such as growfs(8) that modifies the superblock but
      neither updates the check-hash nor clears the feature flag indicating
      that it does not support the check-hash, the disk will fail to mount
      if it is moved back to its original newer kernel.
      
      This patch moves the code that clears the filesystem feature flags
      from the mount code (ffs_mountfs()) to the code that reads the
      superblock (ffs_sbget()). As ffs_sbget() is used by the kernel mount
      code and is imported into libufs(3), all the filesystem utilities
      will now also clear these flags when they make modifications to the
      filesystem.
      
      As suggested by John Baldwin, fsck_ffs(8) has been changed to accept
      and repair bad superblock check-hashes rather than refusing to run.
      This change allows fsck to recover filesystems that have been impacted
      by utilities older than those created after this change and is a
      sensible thing to do in any event.
      
      Reported by:  John Baldwin (jhb@)
      MFC after:    2 weeks
      Sponsored by: Netflix
      996d40f9
  10. 08 Oct, 2020 1 commit
    • Konstantin Belousov's avatar
      Do not leak B_BARRIER. · e1ef4c29
      Konstantin Belousov authored
      Normally when a buffer with B_BARRIER is written, the flag is cleared
      by g_vfs_strategy() when creating bio.  But in some cases FFS buffer
      might not reach g_vfs_strategy(), for instance when copy-on-write
      reports an error like ENOSPC.  In this case buffer is returned to
      dirty queue and might be written later by other means.  Among then
      bdwrite() reasonably asserts that B_BARRIER is not set.
      
      In fact, the only current use of B_BARRIER is for lazy inode block
      initialization, where write of the new inode block is fenced against
      cylinder group write to mark inode as used.  The situation could be
      seen that we break dependency by updating cg without written out
      inode.  Practically since CoW was not able to find space for a copy of
      inode block, for the same reason cg group block write should fail.
      
      Reported by:	pho
      Discussed with:	chs, imp, mckusick
      Sponsored by:	The FreeBSD Foundation
      MFC after:	1 week
      Differential revision:	https://reviews.freebsd.org/D26511
      e1ef4c29
  11. 01 Sep, 2020 1 commit
  12. 19 Aug, 2020 1 commit
  13. 16 Aug, 2020 1 commit
  14. 10 Aug, 2020 1 commit
  15. 25 Jul, 2020 1 commit
  16. 19 Jun, 2020 2 commits
    • Kirk McKusick's avatar
      The binary representation of the superblock (the fs structure) is written · 93440bbe
      Kirk McKusick authored
      out verbatim to the disk: see ffs_sbput() in sys/ufs/ffs/ffs_subr.c.
      It contains a pointer to the fs_summary_info structure. This pointer
      value inadvertently causes garbage to be stored. It is garbage because
      the pointer to the fs_summary_info structure is the address the then
      current stack or heap. Although a mere pointer does not reveal anything
      useful (like a part of a private key) to an attacker, garbage output
      deteriorates reproducibility.
      
      This commit zeros out the pointer to the fs_summary_info structure
      before writing the out the superblock.
      
      Reviewed by:  kib
      Tested by:    Peter Holm
      PR:           246983
      Sponsored by: Netflix
      93440bbe
    • Kirk McKusick's avatar
      Move the pointers stored in the superblock into a separate · 34816cb9
      Kirk McKusick authored
      fs_summary_info structure. This change was originally done
      by the CheriBSD project as they need larger pointers that
      do not fit in the existing superblock.
      
      This cleanup of the superblock eases the task of the commit
      that immediately follows this one.
      
      Suggested by: brooks
      Reviewed by:  kib
      PR:           246983
      Sponsored by: Netflix
      34816cb9
  17. 17 Jun, 2020 1 commit
  18. 14 Jun, 2020 1 commit
    • Rick Macklem's avatar
      Fix export_args ex_flags field so that is 64bits, the same as mnt_flags. · 1f7104d7
      Rick Macklem authored
      Since mnt_flags was upgraded to 64bits there has been a quirk in
      "struct export_args", since it hold a copy of mnt_flags
      in ex_flags, which is an "int" (32bits).
      This happens to currently work, since all the flag bits used in ex_flags are
      defined in the low order 32bits. However, new export flags cannot be defined.
      Also, ex_anon is a "struct xucred", which limits it to 16 additional groups.
      This patch revises "struct export_args" to make ex_flags 64bits and replaces
      ex_anon with ex_uid, ex_ngroups and ex_groups (which points to a
      groups list, so it can be malloc'd up to NGROUPS in size.
      This requires that the VFS_CHECKEXP() arguments change, so I also modified the
      last "secflavors" argument to be an array pointer, so that the
      secflavors could be copied in VFS_CHECKEXP() while the export entry is locked.
      (Without this patch VFS_CHECKEXP() returns a pointer to the secflavors
      array and then it is used after being unlocked, which is potentially
      a problem if the exports entry is changed.
      In practice this does not occur when mountd is run with "-S",
      but I think it is worth fixing.)
      
      This patch also deleted the vfs_oexport_conv() function, since
      do_mount_update() does the conversion, as required by the old vfs_cmount()
      calls.
      
      Reviewed by:	kib, freqlabs
      Relnotes:	yes
      Differential Revision:	https://reviews.freebsd.org/D25088
      1f7104d7
  19. 25 May, 2020 1 commit
    • Chuck Silvers's avatar
      This commit enables a UFS filesystem to do a forcible unmount when · d79ff54b
      Chuck Silvers authored
      the underlying media fails or becomes inaccessible. For example
      when a USB flash memory card hosting a UFS filesystem is unplugged.
      
      The strategy for handling disk I/O errors when soft updates are
      enabled is to stop writing to the disk of the affected file system
      but continue to accept I/O requests and report that all future
      writes by the file system to that disk actually succeed. Then
      initiate an asynchronous forced unmount of the affected file system.
      
      There are two cases for disk I/O errors:
      
         - ENXIO, which means that this disk is gone and the lower layers
           of the storage stack already guarantee that no future I/O to
           this disk will succeed.
      
         - EIO (or most other errors), which means that this particular
           I/O request has failed but subsequent I/O requests to this
           disk might still succeed.
      
      For ENXIO, we can just clear the error and continue, because we
      know that the file system cannot affect the on-disk state after we
      see this error. For EIO or other errors, we arrange for the geom_vfs
      layer to reject all future I/O requests with ENXIO just like is
      done when the geom_vfs is orphaned. In both cases, the file system
      code can just clear the error and proceed with the forcible unmount.
      
      This new treatment of I/O errors is needed for writes of any buffer
      that is involved in a dependency. Most dependencies are described
      by a structure attached to the buffer's b_dep field. But some are
      created and processed as a result of the completion of the dependencies
      attached to the buffer.
      
      Clearing of some dependencies require a read. For example if there
      is a dependency that requires an inode to be written, the disk block
      containing that inode must be read, the updated inode copied into
      place in that buffer, and the buffer then written back to disk.
      
      Often the needed buffer is already in memory and can be used. But
      if it needs to be read from the disk, the read will fail, so we
      fabricate a buffer full of zeroes and pretend that the read succeeded.
      This zero'ed buffer can be updated and written back to disk.
      
      The only case where a buffer full of zeros causes the code to do
      the wrong thing is when reading an inode buffer containing an inode
      that still has an inode dependency in memory that will reinitialize
      the effective link count (i_effnlink) based on the actual link count
      (i_nlink) that we read. To handle this case we now store the i_nlink
      value that we wrote in the inode dependency so that it can be
      restored into the zero'ed buffer thus keeping the tracking of the
      inode link count consistent.
      
      Because applications depend on knowing when an attempt to write
      their data to stable storage has failed, the fsync(2) and msync(2)
      system calls need to return errors if data fails to be written to
      stable storage. So these operations return ENXIO for every call
      made on files in a file system where we have otherwise been ignoring
      I/O errors.
      
      Coauthered by: mckusick
      Reviewed by:   kib
      Tested by:     Peter Holm
      Approved by:   mckusick (mentor)
      Sponsored by:  Netflix
      Differential Revision:  https://reviews.freebsd.org/D24088
      d79ff54b
  20. 10 Apr, 2020 1 commit
    • Konstantin Belousov's avatar
      ufs: apply suspension for non-forced rw unmounts. · 71f26429
      Konstantin Belousov authored
      Forced rw unmounts and remounts from rw to ro already suspend
      filesystem, which closes races with writers instantiating new vnodes
      while unmount flushes the queue.  Original intent of not including
      non-forced unmounts into this regime was to allow such unmounts to
      fail if writer was active, but this did not worked well.
      
      Similar change, but causing all unmount, even involving only ro
      filesystem, were proposed in D24088, but I believe that suspending ro
      is undesirable, and definitely spends CPU time.
      
      Reported by:	markj
      Discussed with:	chs, mckusick
      Tested by:	pho
      Sponsored by:	The FreeBSD Foundation
      MFC after:	1 week
      71f26429
  21. 06 Mar, 2020 1 commit
  22. 16 Feb, 2020 1 commit
  23. 10 Feb, 2020 2 commits
  24. 26 Jan, 2020 1 commit
  25. 14 Jan, 2020 1 commit
  26. 13 Jan, 2020 2 commits
  27. 03 Jan, 2020 1 commit
  28. 06 Oct, 2019 1 commit
  29. 04 Oct, 2019 1 commit
  30. 06 Sep, 2019 1 commit