Hadoop MCQ Set 1
1. Serialization of string columns uses a ________ to form unique column values.
a) Footer
b) STRIPES
c) Dictionary
d) Index
Answer
Answer: c [Reason:] The dictionary is sorted to speed up predicate filtering and improve compression ratios.
2. Point out the correct statement :
a) The Avro file dump utility analyzes ORC files
b) Streams are compressed using a codec, which is specified as a table property for all streams in that table
c) The ODC file dump utility analyzes ORC files
d) All of the mentioned
Answer
Answer: b [Reason:] The codec can be Snappy, Zlib, or none.
3. _______ is a lossless data compression library that favors speed over compression ratio.
a) LOZ
b) LZO
c) OLZ
d) All of the mentioned
Answer
Answer: a [Reason:] lzo and lzop need to be installed on every node in the Hadoop cluster.
4. Which of the following will prefix the query string with parameters:
a) SET hive.exec.compress.output=false
b) SET hive.compress.output=false
c) SET hive.exec.compress.output=true
d) All of the mentioned
Answer
Answer: a [Reason:] Use lzop command utility or your custom Java to generate .lzo.index for the .lzo files.
5. Point out the wrong statement :
a) TIMESTAMP is Only available starting with Hive 0.10.0
b) DECIMAL introduced in Hive 0.11.0 with a precision of 38 digits
c) Hive 0.13.0 introduced user definable precision and scale
d) All of the mentioned
Answer
Answer: b [Reason:] TIMESTAMP is available starting with Hive 0.8.0
6. Integral literals are assumed to be _________ by default.
a) SMALL INT
b) INT
c) BIG INT
d) TINY INT
Answer
Answer: b [Reason:] Integral literals are assumed to be INT by default, unless the number exceeds the range of INT in which case it is interpreted as a BIGINT, or if one of the following postfixes is present on the number.
7. Hive uses _____-style escaping within the strings.
a) C
b) Java
c) Python
d) Scala
Answer
Answer: a [Reason:] String literals can be expressed with either single quotes (‘) or double quotes (“).
8. Which of the following statement will create column with varchar datatype ?
a) CREATE TABLE foo (bar CHAR(10))
b) CREATE TABLE foo (bar VARCHAR(10))
c) CREATE TABLE foo (bar CHARVARYING(10))
d) All of the mentioned
Answer
Answer: b [Reason:] Varchar datatype was introduced in Hive 0.12.0
9. _________ will overwrite any existing data in the table or partition
a) INSERT WRITE
b) INSERT OVERWRITE
c) INSERT INTO
d) None of the mentioned
Answer
Answer: c [Reason:] INSERT INTO will append to the table or partition, keeping the existing data intact.
10. Hive does not support literals for ______ types.
a) Scalar
b) Complex
c) INT
d) CHAR
Answer
Answer: b [Reason:] It is not possible to use them in INSERT INTO…VALUES clauses.
Hadoop MCQ Set 2
1. Hive specific commands can be run from Beeline, when the Hive _______ driver is used.
a) ODBC
b) JDBC
c) ODBC-JDBC
d) All of the Mentioned
Answer
Answer: b [Reason:] Hive specific commands are same as Hive CLI commands.
2. Point out the correct statement :
a) –helpusage display a usage message
b) The JDBC connection URL format has the prefix jdbc:hive:
c) Starting with Hive 0.14, there are improved SV output formats
d) None of the mentioned
Answer
Answer: c [Reason:] Output formats available are namely DSV, CSV2 and TSV2.
3. _________ reduce the amount of informational messages displayed (true) or not (false).
a) –silent=[true/false].
b) –autosave=[true/false].
c) –force=[true/false].
d) All of the mentioned
Answer
Answer: a [Reason:] It also stops displaying the log messages for the query from HiveServer2.
4. Which of the following is used to set transaction isolation level ?
a) –incremental=[true/false].
b) –isolation=LEVEL
c) –force=[true/false].
d) –truncateTable=[true/false].
Answer
Answer: b [Reason:] Set the transaction isolation level to TRANSACTION_READ_COMMITTED or TRANSACTION_SERIALIZABLE.
5. Point out the wrong statement :
a) HiveServer2 has a new JDBC driver
b) CSV and TSV output formats are maintained for forward compatibility
c) HiveServer2 supports both embedded and remote access to HiveServer2
d) None of the mentioned
Answer
Answer: b [Reason:] CSV and TSV output formats are maintained for backward compatibility.
6. The ________ allows users to read or write Avro data as Hive tables.
a) AvroSerde
b) HiveSerde
c) SqlSerde
d) None of the mentioned
Answer
Answer: a [Reason:] AvroSerde understands compressed Avro files.
7. Starting in Hive _______, the Avro schema can be inferred from the Hive table schema.
a) 0.14
b) 0.12
c) 0.13
d) 0.11
Answer
Answer: a [Reason:] Starting in Hive 0.14, columns can be added to an Avro backed Hive table using the Alter Table statement.
8. The AvroSerde has been built and tested against Hive 0.9.1 and later, and uses Avro _______ as of Hive 0.13 and 0.14.
a) 1.7.4
b) 1.7.2
c) 1.7.3
d) None of the mentioned
Answer
Answer: d [Reason:] The AvroSerde uses Avro 1.7.5.
9. Which of the following data type is supported by Hive ?
a) map
b) record
c) string
d) enum
Answer
Answer: d [Reason:] Hive has no concept of enums.
10. Which of the following data type is converted to Array prior to Hive 0.12.0 ?
a) map
b) long
c) float
d) bytes
Answer
Answer: d [Reason:] Bytes are converted to Array[smallint] prior to Hive 0.12.0.
Hadoop MCQ Set 3
1. A _________ grants initial permissions, and subsequently a user may or may not be given the permission to grant/revoke permissions.
a) keyspace
b) superuser
c) sudouser
d) none of the mentioned
Answer
Answer: b [Reason:] Object permission management is based on internal authorization.
2. Point out the correct statement :
a) Cassandra accommodates expensive, consumer SSDs extremely well
b) Cassandra re-writes or re-reads existing data, and never overwrites the rows in place
c) Cassandra uses a storage structure similar to a Log-Structured Merge Tree
d) None of the mentioned
Answer
Answer: c [Reason:] A log-structured engine that avoids overwrites and uses sequential IO to update data is essential for writing to hard disks (HDD) and solid-state disks (SSD).
3. __________ is one of many possible IAuthorizer implementations, and the one that stores permissions in the system_auth.permissions table to support all authorization-related CQL statements.
a) CassandraAuth
b) CassandraAuthorizer
c) CassAuthorizer
d) All of the mentioned
Answer
Answer: b [Reason:] Configuration consists mainly of changing the authorizer option in the cassandra.yaml to use the CassandraAuthorizer.
4. Cassandra creates a ___________ for each table, which allows you to symlink a table to a chosen physical drive or data volume.
a) directory
b) subdirectory
c) domain
d) path
Answer
Answer: b [Reason:] The new file name format includes the keyspace name to distinguish which keyspace and table the file contains when streaming or bulk loading data.
5. Point out the wrong statement :
a) Cassandra provides fine-grained control of table storage on disk, writing tables to disk using separate table directories within each keyspace directory
b) The hinted handoff feature and Cassandra conformance and conformance to the ACID
c) Client utilities and application programming interfaces (APIs) for developing applications for data storage and retrieval are available
d) None of the mentioned
Answer
Answer: b [Reason:] The hinted handoff feature and Cassandra conformance and non-conformance to the ACID.
6. When ___________ contents exceed a configurable threshold, the memtable data, which includes indexes, is put in a queue to be flushed to disk
a) subtable
b) memtable
c) intable
d) memorytable
Answer
Answer: b [Reason:] You can configure the length of the queue by changing memtable_flush_queue_size in the cassandra.yaml.
7. Data in the commit log is purged after its corresponding data in the memtable is flushed to an _________ .
a) SSHables
b) SSTable
c) Memtables
d) None of the mentioned
Answer
Answer: b [Reason:] SSTables are immutable, not written to again after the memtable is flushed.
8. For each SSTable, Cassandra creates _________ index .
a) memory
b) partition
c) in memory
d) all of the mentioned
Answer
Answer: b [Reason:] Partition index is list of partition keys and the start position of rows in the data file (on disk).
9. Cassandra marks data to be deleted using :
a) tombstone
b) combstone
c) tenstone
d) none of the mentioned
Answer
Answer: a [Reason:] Cassandra also does not delete in place because the SSTable is immutable.
10. Tombstones exist for a configured time period defined by the _______ value set on the table.
a) gc_grace_minutes
b) gc_grace_time
c) gc_grace_seconds
d) gc_grace_hours
Answer
Answer: c [Reason:] During compaction, there is a temporary spike in disk space usage and disk I/O because the old and new SSTables co-exist.
Hadoop MCQ Set 4
1. Mahout provides ____________ libraries for common and primitive Java collections.
a) Java
b) Javascript
c) Perl
d) Python
Answer
Answer: a [Reason:] Maths operations are focused on linear algebra and statistics.
2. Point out the correct statement :
a) Mahout is distributed under a commercially friendly Apache Software license
b) Mahout is a library of scalable machine-learning algorithms, implemented on top of Apache Hadoop® and using the MapReduce paradigm
c) Apache Mahout is a project of the Apache Software Foundation to produce free implementations of distributed or otherwise scalable machine learning algorithms
d) None of the mentioned
Answer
Answer: d [Reason:] The goal of Mahout is to build a vibrant, responsive, diverse community to facilitate discussions not only on the project itself but also on potential use cases.
3. _________ does not restrict contributions to Hadoop based implementations.
a) Mahout
b) Oozie
c) Impala
d) All of the mentioned
Answer
Answer: a [Reason:] Mahout is distributed under a commercially friendly Apache Software license.
4. Mahout provides an implementation of a ______________ identification algorithm which scores collocations using log-likelihood ratio.
a) collocation
b) compaction
c) collection
d) none of the mentioned
Answer
Answer: a [Reason:] The log-likelihood score indicates the relative usefulness of a collocation with regards other term combinations in the text.
5. Point out the wrong statement :
a) ‘Taste’ collaborative-filtering recommender component of Mahout was originally a separate project and can run stand-alone without Hadoop
b) Integration of Mahout with initiatives such as the Pregel-like Giraph are actively under discussion
c) Calculating the LLR is very straightforward
d) None of the mentioned
Answer
Answer: d [Reason:] There are a couple ways to run the llr-based collocation algorithm in mahout.
6. The tokens are passed through a Lucene ____________ to produce NGrams of the desired length.
a) ShngleFil
b) ShingleFilter
c) SingleFilter
d) Collfilter
Answer
Answer: b [Reason:] The tools that the collocation identification algorithm are embedded within either consume tokenized text as input or provide the ability to specify an implementation of the Lucene Analyzer class perform tokenization in order to form ngrams.
7. The _________ collocation identifier is integrated into the process that is used to create vectors from sequence files of text keys and values.
a) lbr
b) lcr
c) llr
d) lar
Answer
Answer: c [Reason:] The –minLLR option can be used to control the cutoff that prevents collocations below the specified LLR score from being emitted.
8. ____________ generates NGrams and counts frequencies for ngrams, head and tail subgrams.
a) CollocationDriver
b) CollocDriver
c) CarDriver
d) All of the mentioned
Answer
Answer: b [Reason:] Each call to the mapper passes in the full set of tokens for the corresponding document using a StringTuple.
9. A key of type ___________ is generated which is used later to join ngrams with their heads and tails in the reducer phase.
a) GramKey
b) Primary
c) Secondary
d) None of the mentioned
Answer
Answer: a [Reason:] The GramKey is a composite key made up of a string n-gram fragment as the primary key and a secondary key used for grouping and sorting in the reduce phase.
10. ________ phase merges the counts for unique ngrams or ngram fragments across multiple documents.
a) CollocCombiner
b) CollocReducer
c) CollocMerger
d) None of the mentioned
Answer
Answer: a [Reason:] The combiner treats the entire GramKey as the key and as such, identical tuples from separate documents are passed into a single call to the combiner’s reduce method, their frequencies are summed and a single tuple is passed out via the collector.
Hadoop MCQ Set 5
1. __________ is a REST API for HCatalog.
a) WebHCat
b) WbHCat
c) InpHCat
d) None of the mentioned
Answer
Answer: a [Reason:] REST stands for “representational state transfer”, a style of API based on HTTP verbs.
2. Point out the correct statement :
a) There is no guaranteed read consistency when a partition is dropped
b) Unpartitioned tables effectively have one default partition that must be created at table creation time
c) Once a partition is created,records cannot be added to it, removed from it, or updated in it
d) All of the mentioned
Answer
Answer: d [Reason:] Partitioned tables have no partitions at create time.
3. With HCatalog, _________ does not need to modify the table structure.
a) Partition
b) Columns
c) Robert
d) All of the mentioned
Answer
Answer: c [Reason:] Without HCatalog, Robert must alter the table to add the required partition.
4. Sally in data processing uses __________ to cleanse and prepare the data.
a) Pig
b) Hive
c) HCatalog
d) Impala
Answer
Answer: a [Reason:] Without HCatalog, Sally must be manually informed by Joe when data is available, or poll on HDFS.
5. Point out the wrong statement :
a) The original name of WebHCat was Templeton
b) Robert in client management uses Hive to analyze his clients’ results
c) With HCatalog, HCatalog cannot send a JMS message that data is available
d) All of the mentioned
Answer
Answer: c [Reason:] The Pig job can then be restarted after analyzing client.
6. For ___________ partitioning jobs, simply specifying a custom directory is not good enough.
a) static
b) semi cluster
c) dynamic
d) all of the mentioned
Answer
Answer: c [Reason:] Since it writes to multiple destinations, and thus, instead of a directory specification, it requires a pattern specification.
7. ___________ property allows us to specify a custom dir location pattern for all the writes, and will interpolate each variable.
a) hcat.dynamic.partitioning.custom.pattern
b) hcat.append.limit
c) hcat.pig.storer.external.location
d) hcatalog.hive.client.cache.expiry.time
Answer
Answer: a [Reason:] hcat.append.limit allows an HCatalog user to specify a custom append limit.
8. HCatalog maintains a cache of _________ to talk to the metastore.
a) HiveServer
b) HiveClients
c) HCatClients
d) All of the mentioned
Answer
Answer: b [Reason:] HCatalog manages a cache of 1 metastore client per thread, defaulting to an expiry of 120 seconds.
9. On the write side, it is expected that the user pass in valid _________ with data correctly.
a) HRecords
b) HCatRecos
c) HCatRecords
d) None of the mentioned
Answer
Answer: c [Reason:] In some cases where a user of HCat (such as some older versions of pig) does not support all the datatypes supported by hive, there are a few config parameters provided to handle data promotions/conversions to allow them to read data through HCatalog.
10. A float parameter, defaults to 0.0001f, which means we can deal with 1 error every __________ rows
a) 1000
b) 10000
c) 1 million rows
d) None of the mentioned
Answer
Answer: b [Reason:] hcat.input.bad.record.threshold property is throw out error on encountering bad record.
Hadoop MCQ Set 6
1. ________ systems are scale-out file-based (HDD) systems moving to more uses of memory in the nodes.
a) NoSQL
b) NewSQL
c) SQL
d) All of the mentioned
Answer
Answer: a [Reason:] NoSQL systems make the most sense whenever the application is based on data with varying data types and the data can be stored in key-value notation.
2. Point out the correct statement :
a) Hadoop is ideal for the analytical, post-operational, data-warehouse-ish type of workload
b) HDFS runs on a small cluster of commodity-class nodes
c) NEWSQL is frequently the collection point for big data
d) None of the mentioned
Answer
Answer: a [Reason:] Hadoop together with a relational data warehouse, they can form very effective data warehouse infrastructure.
3. Hadoop data is not sequenced and is in 64MB to 256 MB block sizes of delimited record values with schema applied on read based on:
a) HCatalog
b) Hive
c) Hbase
d) All of the mentioned
Answer
Answer: a [Reason:] Other means of tagging the values also can be used.
4. __________ are highly resilient and eliminate the single-point-of-failure risk with traditional Hadoop deployments
a) EMR
b) Isilon solutions
c) AWS
d) None of the mentioned
Answer
Answer: b [Reason:] enterprise data protection and security options including file system auditing and data-at-rest encryption to address compliance requirements is also provided by Isilon solution.
5. Point out the wrong statement :
a) EMC Isilon Scale-out Storage Solutions for Hadoop combine a powerful yet simple and highly efficient storage platform
b) Isilon’s native HDFS integration means you can avoid the need to invest in a separate Hadoop infrastructure
c) NoSQL systems do provide high latency access and accommodate less concurrent users
d) None of the mentioned
Answer
Answer: c [Reason:] NoSQL systems do provide low latency access and accommodate many concurrent users.
6. HDFS and NoSQL file systems focus almost exclusively on adding nodes to :
a) Scale out
b) Scale up
c) Both Scale out and up
d) None of the mentioned
Answer
Answer: a [Reason:] HDFS and NoSQL file systems focus almost exclusively on adding nodes to increase performance (scale-out) but even they require node configuration with elements of scale up.
7. Which is the most popular NoSQL database for scalable big data store with Hadoop ?
a) Hbase
b) MongoDB
c) Cassandra
d) None of the mentioned
Answer
Answer: a [Reason:] HBase is the Hadoop database: a distributed, scalable Big Data store that lets you host very large tables — billions of rows multiplied by millions of columns — on clusters built with commodity hardware.
8. The ___________ can also be used to distribute both jars and native libraries for use in the map and/or reduce tasks.
a) DataCache
b) DistributedData
c) DistributedCache
d) All of the mentioned
Answer
Answer: c [Reason:] The child-jvm always has its current working directory added to the java.library.path and LD_LIBRARY_PATH.
9. HBase provides ___________ like capabilities on top of Hadoop and HDFS.
a) TopTable
b) BigTop
c) Bigtable
d) None of the mentioned
Answer
Answer: c [Reason:] Google Bigtable leverages the distributed data storage provided by the Google File System.
10. _______ refers to incremental costs with no major impact on solution design, performance and complexity.
a) Scale-out
b) Scale-down
c) Scale-up
d) None of the mentioned
Answer
Answer: c [Reason:] dding more CPU/RAM/Disk capacity to Hadoop DataNode that is already part of a cluster does not require additional network switches.
Hadoop MCQ Set 7
1. Mapper implementations are passed the JobConf for the job via the ________ method
a) JobConfigure.configure
b) JobConfigurable.configure
c) JobConfigurable.configureable
d) None of the mentioned
Answer
Answer: b [Reason:] JobConfigurable.configure method is overrided to initialize themselves.
2. Point out the correct statement :
a) Applications can use the Reporter to report progress
b) The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job
c) The intermediate, sorted outputs are always stored in a simple (key-len, key, value-len, value) format
d) All of the mentioned
Answer
Answer: d [Reason:] Reporters can be used to set application-level status messages and update Counters.
3. Input to the _______ is the sorted output of the mappers.
a) Reducer
b) Mapper
c) Shuffle
d) All of the mentioned
Answer
Answer: a [Reason:] In Shuffle phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.
4. The right number of reduces seems to be :
a) 0.90
b) 0.80
c) 0.36
d) 0.95
Answer
Answer: d [Reason:] The right number of reduces seems to be 0.95 or 1.75.
5. Point out the wrong statement :
a) Reducer has 2 primary phases
b) Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures
c) It is legal to set the number of reduce-tasks to zero if no reduction is desired
d) The framework groups Reducer inputs by keys (since different mappers may have output the same key) in sort stage
Answer
Answer: a [Reason:] Reducer has 3 primary phases: shuffle, sort and reduce.
6. The output of the _______ is not sorted in the Mapreduce framework for Hadoop.
a) Mapper
b) Cascader
c) Scalding
d) None of the mentioned
Answer
Answer: d [Reason:] The output of the reduce task is typically written to the FileSystem. The output of the Reducer is not sorted.
7. Which of the following phases occur simultaneously ?
a) Shuffle and Sort
b) Reduce and Sort
c) Shuffle and Map
d) All of the mentioned
Answer
Answer: a [Reason:] The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.
8. Mapper and Reducer implementations can use the ________ to report progress or just indicate that they are alive.
a) Partitioner
b) OutputCollector
c) Reporter
d) All of the mentioned
Answer
Answer: c [Reason:] Reporter is a facility for MapReduce applications to report progress, set application-level status messages and update Counters.
9. __________ is a generalization of the facility provided by the MapReduce framework to collect data output by the Mapper or the Reducer
a) Partitioner
b) OutputCollector
c) Reporter
d) All of the mentioned
Answer
Answer: b [Reason:] Hadoop MapReduce comes bundled with a library of generally useful mappers, reducers, and partitioners.
10. _________ is the primary interface for a user to describe a MapReduce job to the Hadoop framework for execution.
a) Map Parameters
b) JobConf
c) MemoryConf
d) None of the mentioned
Answer
Answer: b [Reason:] JobConf represents a MapReduce job configuration.