AWS Athena MSCK REPAIR TABLE Takes Too Long For A Small Dataset
Answer : MSCK REPAIR TABLE can be a costly operation, because it needs to scan the table's sub-tree in the file system (the S3 bucket). Multiple levels of partitioning can make it more costly, as it needs to traverse additional sub-directories. Assuming all potential combinations of partition values occur in the data set, this can turn into a combinatorial explosion. If you are adding new partitions to an existing table, then you may find that it's more efficient to run ALTER TABLE ADD PARTITION commands for the individual new partitions. This avoids the need to scan the table's entire sub-tree in the file system. It is less convenient than simply running MSCK REPAIR TABLE , but sometimes the optimization is worth it. A viable strategy is often to use MSCK REPAIR TABLE for an initial import, and then use ALTER TABLE ADD PARTITION for ongoing maintenance as new data gets added into the table. If it's really not feasible to use ALTER TABLE ADD PARTITION t...