Top Courses in IT & Software 728x90

Tuesday, 12 April 2016

Top 90 SQL Server Interview questions

Top 90 SQL Server Interview questions 



Top 90 SQL Server Interview questions
Top 90 SQL Server Interview questions 

As part of Interview questions series we are posting SQL server Interview questions  which will be very usefull for the  Fresher and experienced Job seekers.


1)      How many servers you were supporting?
2)       How were you maintaining the servers?
3)       What are your daily activities?
4)       How were you scheduling the jobs, through agent (or) any 3rd party tool?
5)       What is DR strategy?
6)       Users are complaining that databases are responding very slow! How will you troubleshoot the issue?
7)       What are the different types of Replication?
8)       What types of maintenance jobs were there in your previous project?
9)       What will update statistics do? What is its significance?
10)   What are the few DBCC commands you use frequently?
11)   How will you check fragmentation on table or index?
12)   What is BCP and how is it useful?
13)   What are DTS packages and have you designed any such packages?
14)   What does SQL server Upgrade Advisor tool do?
15)   Why is there a possibility for only one clustered index?
16)   How will you troubleshoot a poor performing query?
17)   What are all the filters you will be seeing in SQL Server Profiler?
18)   What are the different types of Backup?
19)   What is the difference between FULL and Differential Backup?
20)   What is your backup strategy?
21)   What are DBCCINDEX, DBCC INDEXDEFRAGE commands used for?
22)   How will you find log space usages of all databases?
23)   How much memory does SQL Server requires?
24)   How will you monitor log shipping?
25)  How will you setup Log Shipping? Explain in detailed?
26)   What are AWE configurations?
27)   What are all the jobs associated with Log Shipping?
28)   What is reorganizing index, rebuild index and what situations can you use these?
29)   What is check pint?
30)    If tempdb is full what u can do?
31)   Can you move tempdb one drive to another drive if business is running?
32)   If master database is corrupt in sql server what u can do?
33)   How many files in a database. If a ldf file is deleted in a database what you can do(Backup is not available) and mdf file is deleted what u can do?
34)  What is the usage of ndf file in a sqlserver database?
35)   If blocking occurs in sql server how to identify you in sql server and how to resolve it?
36)   If deadlock occurs in sql server how to identify you in sql server and how to resolve it ?
37)   In all how many ways can we move databases? Which is the best one?
38)   How will you move databases(system and user)to different instance?
39)   How to move a particular table in a SQL Server instance?
40)   An application is slow, how will you troubleshoot from db perspective?
41)   A user was able to connect yesterday but is unable to connect today? What will you do?
42)   Difference between Logshipping and mirroring?
43)   Ticketing tool + CR tool (are both the same and different?)
44)   What are isolation levels in sql server 2000?
45)   Difference between read committed and seriallizable in sql server 2000?
46)   How to give select and update permission in sql server?
47)   Tell me new features in sql server 2005 as DBA?
48)   Difference between truncate and shrink?
49)   Which monitoring tool used tells me three main things in monitoring tool?
50)   If an error occurs in sql server how to identify that error is application error or database error?
51)   What is view and usage of a view?
52)   What is execution plan in sql server?
53)   What is usage ofnonclustered index in sql server?
54)    If a primary server is fail how to up the secondary server?
55)   If tlog is deleted in primary server in logshipping how to restore in secondary server?
56)   What type of issues resolves in sql server recently?
57)   What type of maintenance plan in sql server?
58)   How to identify if a job is fail?
59)   How to configure error logs?
60)   What is statistics in sql server?
61)   What are statistics, under what circumstances they go out of date, how do you update them?
62)   What is stored procedure and trigger?
63)   How long the update statistics job used to run on the largest database(of size 1300GB)?
64)   If u found a block on the server when update statistics is running then what u do?
65)   How was u maintaining OLTP applications having terabytes of data?
66)   Update statistics run successfully on Sunday. After that on Wednesday large volume of data is inserted into?
67)  What will you do in case of poor performance of query?
68)   Pre requisite for migrate from 2000 to 2005?
69)  How many types of modes in database mirroring?
70)   What is database snapshot? How do you do that?
71)   How will u kill a process?
72)   How do you find out which process is getting blocked?
73)   How many cluster index and non cluster index are there?
74)   What is significance of update statistics command?
75)   What is blocking and how it is different from deadlock?
76)   How to transit the new Databases from Development team to Production Support?
77)   How identify property plays a role in case of Replicztion?
78)   Authentication Modes? Explain Mixed Mode?
79)    Database crash what will you do immediately?
80)    What will you do after installing a new SQL Server 2005 instance?
81)   Configure and Managing SQL Server?
82)   Difference between role and privilege?
83)    A database has 10 tables and user has to access only one table for a user to access?
84)   Logshipping primary crashes, how will you bring the secondary as primary?
85)   Profiler how frequently will you use?
86)   How to handle issues during installation upgrade? How will you handle those errors?
87)   Dotnet framework 2.0 minimum is required for SQL Server?
88)   What is latency period of Log shipping?
89)   Copy wizard in SQL Server 2005?
90)   How many instances do you handle?
91)  Tempdb space is increasing very rapidly what will you do?

92)   If you use Truncate only option you will loose current transactions, how will you stop that from doing?

Top 50 INFORMATICA INTERVIEW QUESTIONS AND ANSWERS

Top 50 INFORMATICA INTERVIEW QUESTIONS AND ANSWERS


INFORMATICA INTERVIEW QUESTIONS AND ANSWERS
INFORMATICA INTERVIEW QUESTIONS AND ANSWERS




Informatica is one of the Leading ETL tool in the Market and one of the most used  ETL tool used in Most of IT Companies .
It Comes Under Datawarehosuing  and its one of the highly paid jobs . I am Posting Informatica Interview questions and answers for freshers and experienced which will be help full for them for cracking Informatica job interview questions .


1.  What do you mean by Enterprise Data Warehousing?
When the organization data is created at a single point of access it is called as enterprise data warehousing. Data can be provided with a global view to the server via a single source store.  One can do periodic analysis on that same source. It gives better results but however the time required is high.
2. What the difference is between a database, a data warehouse and a data mart?
Database includes a set of sensibly affiliated data which is normally small in size as compared to data warehouse. While in data warehouse there are assortments of all sorts of data and data is taken out only according to the customer’s needs. On the other hand datamart is also a set of data which is designed to  cater the needs of different domains. For instance an organization having different chunk of data for its different departments i.e. sales, finance, marketing etc.
3. What is meant by a domain?
When all related relationships and nodes are covered by a sole organizational point,  its called domain. Through this data management can be improved.
4. What is the difference between a repository server and a powerhouse?
Repository  server controls the complete repository which includes tables, charts, and various procedures etc. Its main function is to assure the repository  integrity and consistency. While a powerhouse server governs the implementation of various processes among the factors of server’s database repository.
5.  How many repositories can be created in informatica?
There can be any number of repositories in informatica but eventually it depends on number of ports.
6. What is the benefit of partitioning a session?
Partitioning a session means solo implementation sequences within the session. It’s main purpose is to improve server’s operation and efficiency. Other transformations including extractions and other outputs of single partitions are carried out in parallel.
7. How are indexes created after completing the load process?
For the purpose of creating indexes after the load process, command tasks at session level can be used. Index creating scripts can be brought in line with  the session’s workflow or the post session implementation sequence. Moreover this type of index creation cannot be controlled after the load process at transformation level.
8. Explain sessions. Explain how batches are used to combine executions?
A teaching set that needs to be implemented to convert data from a source to a target is called a session. Session can be carried out using the session’s manager or pmcmd command. Batch execution can be used to combine sessions executions either in serial manner or in a parallel. Batches can have different sessions carrying forward in  a parallel or serial manner.
9. How many number of sessions can one group in batches?
One can group any number of sessions  but it would be easier for migration if the number of sessions are lesser in a batch.
10. Explain the difference between mapping parameter and mapping variable?
When  values  change during the session’s execution it’s called a mapping variable. Upon completion the Informatica server stores the end value of a variable and is reused when session restarts. Moreover those values that do not change during the sessions execution are called mapping parameters.  Mapping procedure explains mapping parameters and their usage. Values are allocated to these parameters before starting the session.
11.What is complex mapping?
Following are the features of complex mapping.
·          Difficult requirements
·          Many numbers of transformations
·          Complex business logic
12. How can one identify whether mapping is correct or not without connecting session?
One can find whether the session is correct or not without connecting the session is with the help of debugging option.
13. Can one use mapping parameter or variables created in one mapping into any other reusable transformation?
Yes, One can do because reusable transformation does not contain any mapplet or mapping.
14. Explain the use of aggregator cache file?
Aggregator transformations are handled in chunks of instructions during each run. It stores  transitional values which are found  in local buffer memory. Aggregators provides extra cache files for storing the transformation values if extra memory is required.
15. Briefly describe lookup transformation?
Lookup transformations are those transformations which have admission right to RDBMS based data set. The server makes the access faster by using the lookup tables to look  at explicit table data or the database. Concluding data is achieved by matching the look up condition for all look up ports delivered during transformations.
16. What does role playing dimension mean?
The dimensions that are utilized for playing diversified roles while remaining in the same database domain are called role playing dimensions.
17. How can repository reports be accessed without SQL or other transformations?
Ans:Repositoryreports are established by metadata reporter. There is no need of SQL or other transformation since it is a web app.
18. What are the types of metadata that stores in repository?
The types of metadata includes Source definition, Target definition, Mappings, Mapplet, Transformations.
19. Explain the code page compatibility?
When data moves from one code page to another provided that both code pages have the same character sets then data loss cannot occur. All the characteristics of source page must be available in the target page. Moreover if  all the characters of source page are not present in the target page then it would be a subset and data loss will definitely occur during transformation due the fact the two code pages are not compatible.
20. How can you validate all mappings in the repository simultaneously?
All the mappings cannot be validated simultaneously because each time only one mapping can be validated.
21. Briefly explain the Aggregator transformation?
It allows one to do aggregate calculations such as sums, averages etc. It is unlike expression transformation in which  one can do calculations in groups.
22. Describe Expression transformation?
Values can be calculated in single row before writing on the target in this form of transformation. It can be used to perform non aggregate calculations. Conditional statements can also be tested before output results go to target tables.
23. What do you mean by filter transformation?
It is a medium of filtering rows in a mapping. Data needs to be transformed through filter transformation and then filter condition is applied.  Filter transformation contains all ports of input/output, and the rows which meet  the condition can only pass through that filter.
24. What is Joiner transformation?
 Joiner transformation combines two affiliated heterogeneous sources living in different locations while a source qualifier transformation can combine data emerging from a common source.
25. What is Lookup transformation?
 It is used for looking up data in a relational table through mapping. Lookup definition from any relational database is imported from a source which has tendency of connecting client and server. One can use multiple lookup transformation in a mapping.
26. How Union Transformation is used?
 Ans: It is a diverse input group transformation which can be used to combine data from different sources. It works like UNION All statement in SQL that is used to combine result set of two SELECT statements.
27. What do you mean Incremental Aggregation?
  Option for incremental aggregation is enabled  whenever a session is created for a mapping aggregate. Power center performs incremental aggregation through the mapping and historical cache data to perform new aggregation calculations incrementally.
28. What is the difference between a connected look up and unconnected look up?
  When the inputs are taken directly from other transformations in the pipeline it is called connected lookup. While unconnected lookup doesn’t take inputs directly from other transformations, but it can be used in any transformations  and can be raised as a function using LKP expression. So it can be said that an unconnected lookup can be called multiple times in mapping.
29. What is a mapplet?
 A recyclable object that is using mapplet designer is called a mapplet. It permits one to reuse the transformation logic in multitude mappings moreover it also contains set of transformations.
30.Briefly define reusable transformation?
 Reusable transformation is used numerous times in mapping. It is different from other mappings which use the transformation since it is stored as a metadata. The transformations will be nullified in the mappings whenever any change in the reusable transformation is made.
31. What does update strategy mean, and what are the different option of it?
 Row by row processing is done by informatica.  Every row is inserted in the target table because it is marked as default. Update strategy is used whenever the row has to be updated or inserted based on some sequence. Moreover the condition must be specified in update strategy for the processed row to be marked as updated or inserted.
32. What is the scenario which compels informatica server to reject files?
This happens when it faces DD_Reject in update strategy transformation.  Moreover it disrupts the database constraint filed in the rows was condensed.
33. What is surrogate key?
Surrogate key is a replacement for the natural prime key. It is a unique identification for each row in the table.  It is very beneficial because the natural primary key can change which eventually makes update more difficult. They are always used in form of a digit or integer.
34.What are the prerequisite tasks to achieve the session partition?
 In order to perform session partition one need to configure the session to partition source data and then installing the Informatica server machine in multifold CPU’s.
35. Which files are created during the session rums by informatics server?
During session runs, the files created are namely Errors log, Bad file, Workflow low and session log.
 36. Briefly define a session task?
It is a chunk of instruction the guides Power center server about how and when to transfer data from sources to targets.
 37. What does command task mean?
This specific task permits one or more than one shell commands in UNIX or DOS in windows to run during the workflow.
38. What is standalone command task?
This task can be used anywhere in the workflow to run the shell commands.
39. What is meant by pre and post session shell command?
Command task can be called as the pre or post session shell command for a session task. One can run it as pre session command r post session success command or post session failure command.
40.What is predefined event?
It is a file-watch event. It waits for a specific file to arrive at a specific location.
41. How can you define user defied event?
User defined event can be described as a flow of tasks in the workflow. Events can be created and then raised as need arises.
42. What is a work flow?
Ans: Work flow is a bunch of instructions that communicates server about how to implement tasks.
43. What are the different tools in workflow manager?
Following are the different tools in workflow manager namely
·          Task Designer
·          Task Developer
·          Workflow Designer
 44. Tell me any other tools for scheduling purpose other than workflow manager pmcmd?
The tool for scheduling purpose other than workflow manager can be a third party tool like ‘CONTROL M’.
45. What is OLAP (On-Line Analytical Processing?
A method by which multi-dimensional analysis occurs.
46. What are the different types of OLAP? Give an example?
ROLAP eg.BO, MOLAP eg.Cognos, HOLAP, DOLAP
47. What do you mean by worklet?
When the workflow tasks are grouped in a set, it is called as worklet. Workflow tasks includes timer, decision, command, event wait, mail, session, link, assignment, control etc.
48. What is the use of target designer?
Target Definition is created with the help of target designer.
49. Where can we find the throughput option in informatica?
Throughput option can be found in informatica in workflow monitor. In workflow monitor, right click on session, then click on get run properties and under source/target statistics we can find throughput option.
50. What is target load order?
Ans: Target load order is specified on the basis of source qualifiers in a mapping. If there are multifold source qualifiers linked to different targets then one can entitle order in which informatica server loads data into targets.


Wednesday, 6 April 2016

Big Data Interview questions and Answers

Big Data Interview Questions asked in companies like IBM, Amazon, HP, Google.



Big Data Interview questions and Answers


What is a JobTracker in Hadoop? How many instances of JobTracker run on a Hadoop Cluster?
JobTracker is the daemon service for submitting and tracking MapReduce jobs in Hadoop. There is only One Job Tracker process run on any hadoop cluster. Job Tracker runs on its own JVM process. In a typical production cluster its run on a separate machine. Each slave node is configured with job tracker node location. The JobTracker is single point of failure for the Hadoop MapReduce service. If it goes down, all running jobs are halted. JobTracker in Hadoop performs following actions(from Hadoop Wiki:)

Client applications submit jobs to the Job tracker.
The JobTracker talks to the NameNode to determine the location of the data
The JobTracker locates TaskTracker nodes with available slots at or near the data
The JobTracker submits the work to the chosen TaskTracker nodes.
The TaskTracker nodes are monitored. If they do not submit heartbeat signals often enough, they are deemed to have failed and the work is scheduled on a different TaskTracker.

A TaskTracker will notify the JobTracker when a task fails. The JobTracker decides what to do then: it may resubmit the job elsewhere, it may mark that specific record as something to avoid, and it may may even blacklist the TaskTracker as unreliable.

When the work is completed, the JobTracker updates its status.
Client applications can poll the JobTracker for information.

How JobTracker schedules a task?
The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated. When the JobTracker tries to find somewhere to schedule a task within the MapReduce operations, it first looks for an empty slot on the same server that hosts the DataNode containing the data, and if not, it looks for an empty slot on a machine in the same rack.

What is a Task Tracker in Hadoop? How many instances of TaskTracker run on a Hadoop Cluster
A TaskTracker is a slave node daemon in the cluster that accepts tasks (Map, Reduce and Shuffle operations) from a JobTracker. There is only One Task Tracker process run on any hadoop slave node. Task Tracker runs on its own JVM process. Every TaskTracker is configured with a set of slots, these indicate the number of tasks that it can accept. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. The TaskTracker monitors these task instances, capturing the output and exit codes. When the Task instances finish, successfully or not, the task tracker notifies the JobTracker. The TaskTrackers also send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the JobTracker that it is still alive. These message also inform the JobTracker of the number of available slots, so the JobTracker can stay up to date with where in the cluster work can be delegated.

What is a Task instance in Hadoop? Where does it run?
Task instances are the actual MapReduce jobs which are run on each slave node. The TaskTracker starts a separate JVM processes to do the actual work (called as Task Instance) this is to ensure that process failure does not take down the task tracker. Each Task Instance runs on its own JVM process. There can be multiple processes of task instance running on a slave node. This is based on the number of slots configured on task tracker. By default a new task instance JVM process is spawned for a task.

How many Daemon processes run on a Hadoop system?
Hadoop is comprised of five separate daemons. Each of these daemon run in its own JVM. Following 3 Daemons run on Master nodes NameNode - This daemon stores and maintains the metadata for HDFS. Secondary NameNode - Performs housekeeping functions for the NameNode. JobTracker - Manages MapReduce jobs, distributes individual tasks to machines running the Task Tracker. Following 2 Daemons run on each Slave nodes DataNode – Stores actual HDFS data blocks. TaskTracker - Responsible for instantiating and monitoring individual Map and Reduce tasks.

What is configuration of a typical slave node on Hadoop cluster? How many JVMs run on a slave node?

    Single instance of a Task Tracker is run on each Slave node. Task tracker is run as a separate JVM process.
    Single instance of a DataNode daemon is run on each Slave node. DataNode daemon is run as a separate JVM process.
    One or Multiple instances of Task Instance is run on each slave node. Each task instance is run as a separate JVM process. The number of Task instances can be controlled by configuration. Typically a high end machine is configured to run more task instances.

What is the difference between HDFS and NAS ?
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. Following are differences between HDFS and NAS

    In HDFS Data Blocks are distributed across local drives of all machines in a cluster. Whereas in NAS data is stored on dedicated hardware.
    HDFS is designed to work with MapReduce System, since computation are moved to data. NAS is not suitable for MapReduce since data is stored seperately from the computations.
    HDFS runs on a cluster of machines and provides redundancy usinga replication protocal. Whereas NAS is provided by a single machine therefore does not provide data redundancy.

How NameNode Handles data node failures?
NameNode periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode. When NameNode notices that it has not recieved a hearbeat message from a data node after a certain amount of time, the data node is marked as dead. Since blocks will be under replicated the system begins replicating the blocks that were stored on the dead datanode. The NameNode Orchestrates the replication of data blocks from one datanode to another. The replication data transfer happens directly between datanodes and the data never passes through the namenode.

Does MapReduce programming model provide a way for reducers to communicate with each other? In a MapReduce job can a reducer communicate with another reducer?
Nope, MapReduce programming model does not allow reducers to communicate with each other. Reducers run in isolation.





Can I set the number of reducers to zero?
Yes, Setting the number of reducers to zero is a valid configuration in Hadoop. When you set the reducers to zero no reducers will be executed, and the output of each mapper will be stored to a separate file on HDFS. [This is different from the condition when reducers are set to a number greater than zero and the Mappers output (intermediate data) is written to the Local file system(NOT HDFS) of each mappter slave node.]
Where is the Mapper Output (intermediate kay-value data) stored ?

The mapper output (intermediate data) is stored on the Local file system (NOT HDFS) of each individual mapper nodes. This is typically a temporary directory location which can be setup in config by the hadoop administrator. The intermediate data is cleaned up after the Hadoop Job completes.

What are combiners? When should I use a combiner in my MapReduce Job?
Combiners are used to increase the efficiency of a MapReduce program. They are used to aggregate intermediate map output locally on individual mapper outputs. Combiners can help you reduce the amount of data that needs to be transferred across to the reducers. You can use your reducer code as a combiner if the operation performed is commutative and associative. The execution of combiner is not guaranteed, Hadoop may or may not execute a combiner. Also, if required it may execute it more then 1 times. Therefore your MapReduce jobs should not depend on the combiners execution.

What is Writable & WritableComparable interface?
    org.apache.hadoop.io.Writable is a Java interface. Any key or value type in the Hadoop Map-Reduce framework implements this interface. Implementations typically implement a static read(DataInput) method which constructs a new instance, calls readFields(DataInput) and returns the instance.
    org.apache.hadoop.io.WritableComparable is a Java interface. Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface. WritableComparable objects can be compared to each other using Comparators.

What is the Hadoop MapReduce API contract for a key and value Class?
    The Key must implement the org.apache.hadoop.io.WritableComparable interface.
    The value must implement the org.apache.hadoop.io.Writable interface.

What is a IdentityMapper and IdentityReducer in MapReduce ?
    org.apache.hadoop.mapred.lib.IdentityMapper Implements the identity function, mapping inputs directly to outputs. If MapReduce programmer do not set the Mapper Class using JobConf.setMapperClass then IdentityMapper.class is used as a default value.
    org.apache.hadoop.mapred.lib.IdentityReducer Performs no reduction, writing all input values directly to the output. If MapReduce programmer do not set the Reducer Class using JobConf.setReducerClass then IdentityReducer.class is used as a default value.

What is the meaning of speculative execution in Hadoop? Why is it important?
Speculative execution is a way of coping with individual Machine performance. In large clusters where hundreds or thousands of machines are involved there may be machines which are not performing as fast as others. This may result in delays in a full job due to only one machine not performaing well. To avoid this, speculative execution in hadoop can run multiple copies of same map or reduce task on different slave nodes. The results from first node to finish are used.

When is the reducers are started in a MapReduce job?
In a MapReduce job reducers do not start executing the reduce method until the all Map jobs have completed. Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The programmer defined reduce method is called only after all the mappers have finished.
If reducers do not start before all mappers finish then why does the progress on MapReduce job shows something like Map(50%) Reduce(10%)? Why reducers progress percentage is displayed when mapper is not finished yet?

Reducers start copying intermediate key-value pairs from the mappers as soon as they are available. The progress calculation also takes in account the processing of data transfer which is done by reduce process, therefore the reduce progress starts showing up as soon as any intermediate key-value pair for a mapper is available to be transferred to reducer. Though the reducer progress is updated still the programmer defined reduce method is called only after all the mappers have finished.

What is HDFS ? How it is different from traditional file systems?
HDFS, the Hadoop Distributed File System, is responsible for storing huge data on the cluster. This is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant.

    HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware.
    HDFS provides high throughput access to application data and is suitable for applications that have large data sets.
    HDFS is designed to support very large files. Applications that are compatible with HDFS are those that deal with large data sets. These applications write their data only once but they read it one or more times and require these reads to be satisfied at streaming speeds. HDFS supports write-once-read-many semantics on files.

What is HDFS Block size? How is it different from traditional file system block size?
In HDFS data is split into blocks and distributed across multiple nodes in the cluster. Each block is typically 64Mb or 128Mb in size. Each block is replicated multiple times. Default is to replicate each block three times. Replicas are stored on different nodes. HDFS utilizes the local file system to store each HDFS block as a separate file. HDFS Block size can not be compared with the traditional file system block size.

What is a NameNode? How many instances of NameNode run on a Hadoop Cluster?
The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself. There is only One NameNode process run on any hadoop cluster. NameNode runs on its own JVM process. In a typical production cluster its run on a separate machine. The NameNode is a Single Point of Failure for the HDFS Cluster. When the NameNode goes down, the file system goes offline. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives.

What is a DataNode? How many instances of DataNode run on a Hadoop Cluster?
A DataNode stores data in the Hadoop File System HDFS. There is only One DataNode process run on any hadoop slave node. DataNode runs on its own JVM process. On startup, a DataNode connects to the NameNode. DataNode instances can talk to each other, this is mostly during replicating data.

How the Client communicates with HDFS?
The Client communication to HDFS happens using Hadoop HDFS API. Client applications talk to the NameNode whenever they wish to locate a file, or when they want to add/copy/move/delete a file on HDFS. The NameNode responds the successful requests by returning a list of relevant DataNode servers where the data lives. Client applications can talk directly to a DataNode, once the NameNode has provided the location of the data.

How the HDFS Blocks are replicated?
HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time. The NameNode makes all decisions regarding replication of blocks. HDFS uses rack-aware replica placement policy. In default configuration there are total 3 copies of a datablock on HDFS, 2 copies are stored on datanodes on same rack and 3rd copy on a different rack.

Apache Hadoop Cluster Interview Questions 
What is the Hadoop-core configuration?
Hadoop core is configured by two xml files:
1. hadoop-default.xml which was renamed to 2. hadoop-site.xml.
These files are written in xml format. We have certain properties in these xml files, which
consist of name and value. But these files do not exist now.

Which are the three modes in which Hadoop can be run?
The three modes in which Hadoop can be run are:
1. standalone (local) mode
2. Pseudo-distributed mode
3. Fully distributed mode

Explain what are the features of Stand alone (local) mode?
In stand-alone mode there are no daemons, everything runs on a single JVM. It has no DFS
and utilizes the local file system. Stand-alone mode is suitable only for running MapReduce
programs during development. It is one of the most least used environments.

What are the features of Pseudo mode?
Pseudo mode is used both for development and in the QA environment. In the Pseudo
mode all the daemons run on the same machine.

Can we call VMs as pseudos?
No, VMs are not pseudos because VM is something different and pesudo is very specific to
Hadoop.

What are the features of Fully Distributed mode?
Fully Distributed mode is used in the production environment, where we have ‘n’ number
of machines forming a Hadoop cluster. Hadoop daemons run on a cluster of machines.
There is one host onto which Namenode is running and another host on which datanode is
running and then there are machines on which task tracker is running. We have separate
masters and separate slaves in this distribution.

Does Hadoop follows the UNIX pattern?
Yes, Hadoop closely follows the UNIX pattern. Hadoop also has the ‘conf‘ directory as in the
case of UNIX.

In which directory Hadoop is installed?
Cloudera and Apache has the same directory structure. Hadoop is installed in cd
/usr/lib/hadoop-0.20/.

What are the port numbers of Namenode, job tracker and task tracker?
The port number for Namenode is ’70′, for job tracker is ’30′ and for task tracker is ’60′.

What are the Hadoop configuration files at present?
There are 3 configuration files in Hadoop:
1. core-site.xml
2. hdfs-site.xml
3. mapred-site.xml
These files are located in the conf/ subdirectory.

How to exit the Vi editor?
To exit the Vi Editor, press ESC and type :q and then press enter.

What is a spill factor with respect to the RAM?
Spill factor is the size after which your files move to the temp file. Hadoop-temp directory is
used for this.

Is fs.mapr.working.dir a single directory?
Yes, fs.mapr.working.dir it is just one directory.

Which are the three main hdfs-site.xml properties?
The three main hdfs-site.xml properties are:
1. dfs.name.dir which gives you the location on which metadata will be stored and where
DFS is located – on disk or onto the remote.
2. dfs.data.dir which gives you the location where the data is going to be stored.
3. fs.checkpoint.dir which is for secondary Namenode.

How to come out of the insert mode?
To come out of the insert mode, press ESC, type :q (if you have not written anything) OR
type :wq (if you have written anything in the file) and then press ENTER.

What is Cloudera and why it is used?
Cloudera is the distribution of Hadoop. It is a user created on VM by default. Cloudera
belongs to Apache and is used for data processing.

What happens if you get a ‘connection refused java exception’ when you type hadoop
fsck /?
It could mean that the Namenode is not working on your VM.

We are using Ubuntu operating system with Cloudera, but from where we can
download Hadoop or does it come by default with Ubuntu?
This is a default configuration of Hadoop that you have to download from Cloudera or from
Edureka’s dropbox and the run it on your systems. You can also proceed with your own
configuration but you need a Linux box, be it Ubuntu or Red hat. There are installation
steps present at the Cloudera location or in Edureka’s Drop box. You can go either ways.



  REAL TIME MAP REDUCE INTERVIEW QUESTIONS

What does ‘jps’ command do?
This command checks whether your Namenode, datanode, task tracker, job tracker, etc are
working or not.

How can I restart Namenode?
1. Click on stop-all.sh and then click on start-all.sh OR
2. Write sudo hdfs (press enter), su-hdfs (press enter), /etc/init.d/ha (press enter) and
then /etc/init.d/hadoop-0.20-namenode start (press enter).

What is the full form of fsck?
Full form of fsck is File System Check.

How can we check whether Namenode is working or not?
To check whether Namenode is working or not, use the command /etc/init.d/hadoop-
0.20-namenode status or as simple as jps.

What does the command mapred.job.tracker do?
The command mapred.job.tracker lists out which of your nodes is acting as a job tracker.

What does /etc /init.d do?
/etc /init.d specifies where daemons (services) are placed or to see the status of these
daemons. It is very LINUX specific, and nothing to do with Hadoop.

How can we look for the Namenode in the browser?
If you have to look for Namenode in the browser, you don’t have to give localhost:8021, the
port number to look for Namenode in the brower is 50070.

How to change from SU to Cloudera?
To change from SU to Cloudera just type exit.

Which files are used by the startup and shutdown commands?
Slaves and Masters are used by the startup and the shutdown commands.

What do slaves consist of?
Slaves consist of a list of hosts, one per line, that host datanode and task tracker servers.

What do masters consist of?
Masters contain a list of hosts, one per line, that are to host secondary namenode servers.

What does hadoop-env.sh do?
hadoop-env.sh provides the environment for Hadoop to run. JAVA_HOME is set over here.

Can we have multiple entries in the master files?
Yes, we can have multiple entries in the Master files.

Where is hadoop-env.sh file present?
hadoop-env.sh file is present in the conf location.

In Hadoop_PID_DIR, what does PID stands for?
PID stands for ‘Process ID’.

What does /var/hadoop/pids do?
It stores the PID.

What does hadoop-metrics.properties file do?
hadoop-metrics.properties is used for ‘Reporting‘ purposes. It controls the reporting for
Hadoop. The default status is ‘not to report‘.