Package org.apache.spark.util
Class HadoopFSUtils
Object
org.apache.spark.util.HadoopFSUtils
Utility functions to simplify and speed-up file listing.
- 
Constructor Summary
Constructors - 
Method Summary
Modifier and TypeMethodDescriptionstatic scala.collection.immutable.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,scala.collection.immutable.Seq<org.apache.hadoop.fs.FileStatus>>> listFiles(org.apache.hadoop.fs.Path path, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter) Lists a collection of paths recursively with a single API invocation.static org.apache.spark.internal.Logging.LogStringContextLogStringContext(scala.StringContext sc) static org.slf4j.Loggerstatic voidorg$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1) static scala.collection.immutable.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,scala.collection.immutable.Seq<org.apache.hadoop.fs.FileStatus>>> parallelListLeafFiles(SparkContext sc, scala.collection.immutable.Seq<org.apache.hadoop.fs.Path> paths, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter, boolean ignoreMissingFiles, boolean ignoreLocality, int parallelismThreshold, int parallelismMax) Lists a collection of paths recursively.static booleanshouldFilterOutPath(String path) Checks if we should filter out this path.static booleanshouldFilterOutPathName(String pathName) Checks if we should filter out this path name. 
- 
Constructor Details
- 
HadoopFSUtils
public HadoopFSUtils() 
 - 
 - 
Method Details
- 
parallelListLeafFiles
public static scala.collection.immutable.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,scala.collection.immutable.Seq<org.apache.hadoop.fs.FileStatus>>> parallelListLeafFiles(SparkContext sc, scala.collection.immutable.Seq<org.apache.hadoop.fs.Path> paths, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter, boolean ignoreMissingFiles, boolean ignoreLocality, int parallelismThreshold, int parallelismMax) Lists a collection of paths recursively. Picks the listing strategy adaptively depending on the number of paths to list.This may only be called on the driver.
- Parameters:
 sc- Spark context used to run parallel listing.paths- Input paths to listhadoopConf- Hadoop configurationfilter- Path filter used to exclude leaf files from resultignoreMissingFiles- Ignore missing files that happen during recursive listing (e.g., due to race conditions)ignoreLocality- Whether to fetch data locality info when listing leaf files. If false, this will returnFileStatuswithoutBlockLocationinfo.parallelismThreshold- The threshold to enable parallelism. If the number of input paths is smaller than this value, this will fallback to use sequential listing.parallelismMax- The maximum parallelism for listing. If the number of input paths is larger than this value, parallelism will be throttled to this value to avoid generating too many tasks.- Returns:
 - for each input path, the set of discovered files for the path
 
 - 
listFiles
public static scala.collection.immutable.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,scala.collection.immutable.Seq<org.apache.hadoop.fs.FileStatus>>> listFiles(org.apache.hadoop.fs.Path path, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter) Lists a collection of paths recursively with a single API invocation. Like parallelListLeafFiles, this ignores FileNotFoundException on the given root path.This is able to be called on both driver and executors.
- Parameters:
 path- a path to listhadoopConf- Hadoop configurationfilter- Path filter used to exclude leaf files from result- Returns:
 - the set of discovered files for the path
 
 - 
shouldFilterOutPathName
Checks if we should filter out this path name. - 
shouldFilterOutPath
Checks if we should filter out this path. - 
org$apache$spark$internal$Logging$$log_
public static org.slf4j.Logger org$apache$spark$internal$Logging$$log_() - 
org$apache$spark$internal$Logging$$log__$eq
public static void org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1)  - 
LogStringContext
public static org.apache.spark.internal.Logging.LogStringContext LogStringContext(scala.StringContext sc)  
 -